You are here

Workflow - Swif



What is Swif?

Swif, the" scientific workflow indefatigable
factotum", is a system that aims to simplify the use of Jefferson Lab's
batch system.  As the name implies, it
will work tirelessly on your behalf so that that you need not expend
unnecessary effort to make good use of the compute farm.

The goal of this initial release is to provide some features
that are lacking in Auger.  While future
versions of Swif may bypass Auger entirely, the current version functions as a
middleman between you and Auger, providing the following enhanced capabilities:

  • Tape-savvy job scheduling
  • Job grouping and phased release
  • Automatic classification of errors
  • Mass job modification, resubmission, recall,
    cancelation
  • Ability to specify job outputs at runtime
  • Mapping of jobs to product files
  • At-a-glance status information
  • Detailed metrics

 All of these features are accessed via the single command
line tool /site/bin/swif.  You can run swif -help for specific usage information.

How does Swif work?

Swif is an always-on service.  Once it knows about the jobs you want to run,
it will dispatch them to Auger in a well-considered manner and monitor their
progress.  If any of your jobs encounter
problems that require intervention, it will suspend further dispatches until
those errors are resolved.  You can add
new jobs, cancel, modify and/or resubmit existing jobs at any time.  Swif will handle the busy work for you. 

Improving throughput

One factor
that greatly influences job throughput is the accessibility of input files.  In particular, the distribution of files across
tapes can impose complex constraints on an efficient ordering of jobs.  A fundamental fact of storing data on tape is
that it assumes an ordering as it is written which dictates the optimal retrieval
sequence: it’s most efficient to start reading at the beginning of tape,
streaming through to the end.  Rewinding,
seeking, loading and unloading tapes are costly operations not only because of
the time involved, but also in terms of media degradation.  The tape system therefore assumes a greedy
approach when scheduling reads: it will hold onto a tape as long as fairness
allows, reading files in the order they were written, which may not necessarily
yield the best results for your jobs.

Because
the tape system provides reliable long-term storage for the experimental halls
as well as for all of the lab’s scientific computing users, it can become quite
busy.  It can provide very high
throughput owing to the number of high-speed tape drives that can operate in
parallel.  Swif releases jobs to Auger in
a manner that provides the best possible throughput given the prevailing
load.  To the extent possible, it will
release jobs in bunches that share a common set of input tapes and proximity of
data.  It attempts to co-schedule the
dispatch of jobs such that the set of input tapes can occupy parallel drives in
case they are available.  If all of the
jobs released in a batch were limited to a smaller number of tapes, then the
average wait time would increase, since files further down a tape would not be
read before their antecedents.

Swif
also ensures that already dispatched jobs will have the opportunity to start
running before releasing more jobs that might cause the tape system to hold
onto a tape instead of loading one required by a previously released job.  One common scenario with Auger is that each
job in a group with a large number of inputs can end up waiting around longer
than necessary for one or two straggling files, even after having retrieved the
brunt of input data.  If the data set is large
enough, the cache system might end up flushing out some of the previously
fetched files in order to make room for the stragglers!

In
short, Swif avoids releasing too many jobs at a time, but carefully chooses the
ones it does release to balance efficient tape access against job latency. 

The life cycle of a job

When
you add a job to a workflow, it is placed into a pending queue.  When it is
dispatched to Auger, it is placed into the running
queue.  If it encounters some problem, it
will be moved to the problem queue,
otherwise it will end up in the success
list.  From the problem queue, it can be
moved back to the pending queue, possibly with some modifications to prevent a
recurrence of the initial problem (e.g. increasaed RAM requirement to handle an
over resource limit error), or it can be moved to the failed or canceled
lists. 

Starting and stopping a workflow

The
swif commands run and pause will start/resume and suspend a
workflow.  Note that a newly created
workflow is initially suspended.  You
must explicitly start it.  Unless you
specify a limit, starting a workflow will cause it to continue running jobs,
including new ones that you may add.  Any
errors, however, will cause the workflow to stop dispatching jobs.  You must resolve these errors before it will
resume.  You can always add new jobs to a
workflow, even after canceling (unless you use the swif cancel –delete command). 

You
can limit the release of jobs in two ways. 
If you wish to group jobs in sequence, you can assign phase numbers to
them.  No job with a higher phase number
will start until all those with a lower phase number have completed, or been
explicitly abandoned.  When you run your
workflow, you can specify a phase limit via the –phaselimit option.  Doing
so will cause it to run all jobs with phase numbers up to and including the
limit.  This feature is completely
optional, and has no effect if it is not something you care about.

You
can also specify a limit on the total number of job attempts before suspending
via the –joblimit option.  An attempt corresponds to one job run.  Since jobs can be re-run, one job may have
many attempts.

Pausing
a workflow will prevent further jobs from being dispatched, but will allow
those already dispatched to complete.  If
you wish to terminate running jobs as well, you can specify swif pause –now.

Handling problem jobs

You
can inspect jobs in the problem queue by running the command swif status -problems.  This will show some information about each
job in the problem queue, including the reason for its failure. 

If
it seems the most reasonable thing to do is simply to retry the jobs (for
example, if the error was owing to a system issue), you could issue the command
swif retry-jobs -problems
SWIF-SYSTEM-ERROR
.  This will place
all jobs that failed with SWIF-SYSTEM-ERROR
back into the pending list.  

On
the other hand, if you feel that the jobs sneed to be modified (maybe they need
more time) you would issue a command like swif
modify-jobs -time add 2h -problems AUGER-TIMEOUT
.  This will add two hours to every job in the
problem queue that failed with a timeout error. 
Of course, if some of your jobs fail this way, it may happen that others
will too, so you might be better off using the command swif modify-jobs -time add 2h all. 
This will handle the problem jobs, but will also apply to any other jobs
that have not completed.  Any running
jobs will be cancelled, modified, then placed back into the dispatch pool.

And
the most drastic approach is simply to abandon the jobs via the command swif abandon-jobs.  This will move problem jobs into the failure
list. 

All
three of these commands (retry-jobs, modify-jobs, abandon-jobs) select and
operate upon jobs in a siiliar manner.

Canceling a workflow

Use the command swif
cancel
to terminate a workflow.  This
will cancel any running jobs and move pending ones into the canceled list.  If you wish to completely obliterate
everything about the workflow, you can specify swif cancel -delete -discard-tape-files -discard-disk-files.  This will completely delete the workflow as
well as any files it wrote to tape or disk via job output specification.

Monitoring a workflow

The command swif
status
provides a variety of glimpses into the state of your workflow.  By default, it will show how many jobs have
yet to be dispatched, how many are currently in the batch system, how many have
succeeded, failed, or been canceled.  It
also shows how many attempts have been made to run jobs, whether the workflow
is suspended or has problem jobs.

Requesting other views on the workflow status will display:
problem jobs, job products, resource utilization for job runs, auger/PBS status
of dispatched jobs.

Like every swif command, you can get further usage details
via command-line help.

Dynamically
specifying job output files

If you don’t know exactly what files your job will produce
until it runs, you can postpone specifying them until you do know.  Within your job you can use the command swif outfile.  Using this feature, you can get the benefit
of having your output files copied by the batch system and recorded as products
associated with the job.  Also, any
output files that exist on job failure will be copied to /volatile in case they
may prove useful in debugging, or may in fact represent valid output.