Workflow - Swif

What is Swif?

Swif, the" scientific workflow indefatigable factotum", is a system that aims to simplify the use of Jefferson Lab's batch system.  As the name implies, it will work tirelessly on your behalf so that that you need not expend unnecessary effort to make good use of the compute farm.

The goal of this initial release is to provide some features that are lacking in Auger.  While future versions of Swif may bypass Auger entirely, the current version functions as a middleman between you and Auger, providing the following enhanced capabilities:

  • Tape-savvy job scheduling
  • Job grouping and phased release
  • Automatic classification of errors
  • Mass job modification, resubmission, recall, cancelation
  • Ability to specify job outputs at runtime
  • Mapping of jobs to product files
  • At-a-glance status information
  • Detailed metrics

 All of these features are accessed via the single command line tool /site/bin/swif.  You can run swif -help for specific usage information.

How does Swif work?

Swif is an always-on service.  Once it knows about the jobs you want to run, it will dispatch them to Auger in a well-considered manner and monitor their progress.  If any of your jobs encounter problems that require intervention, it will suspend further dispatches until those errors are resolved.  You can add new jobs, cancel, modify and/or resubmit existing jobs at any time.  Swif will handle the busy work for you. 

Improving throughput

One factor that greatly influences job throughput is the accessibility of input files.  In particular, the distribution of files across tapes can impose complex constraints on an efficient ordering of jobs.  A fundamental fact of storing data on tape is that it assumes an ordering as it is written which dictates the optimal retrieval sequence: it’s most efficient to start reading at the beginning of tape, streaming through to the end.  Rewinding, seeking, loading and unloading tapes are costly operations not only because of the time involved, but also in terms of media degradation.  The tape system therefore assumes a greedy approach when scheduling reads: it will hold onto a tape as long as fairness allows, reading files in the order they were written, which may not necessarily yield the best results for your jobs.

Because the tape system provides reliable long-term storage for the experimental halls as well as for all of the lab’s scientific computing users, it can become quite busy.  It can provide very high throughput owing to the number of high-speed tape drives that can operate in parallel.  Swif releases jobs to Auger in a manner that provides the best possible throughput given the prevailing load.  To the extent possible, it will release jobs in bunches that share a common set of input tapes and proximity of data.  It attempts to co-schedule the dispatch of jobs such that the set of input tapes can occupy parallel drives in case they are available.  If all of the jobs released in a batch were limited to a smaller number of tapes, then the average wait time would increase, since files further down a tape would not be read before their antecedents.

Swif also ensures that already dispatched jobs will have the opportunity to start running before releasing more jobs that might cause the tape system to hold onto a tape instead of loading one required by a previously released job.  One common scenario with Auger is that each job in a group with a large number of inputs can end up waiting around longer than necessary for one or two straggling files, even after having retrieved the brunt of input data.  If the data set is large enough, the cache system might end up flushing out some of the previously fetched files in order to make room for the stragglers!

In short, Swif avoids releasing too many jobs at a time, but carefully chooses the ones it does release to balance efficient tape access against job latency. 

The life cycle of a job

When you add a job to a workflow, it is placed into a pending queue.  When it is dispatched to Auger, it is placed into the running queue.  If it encounters some problem, it will be moved to the problem queue, otherwise it will end up in the success list.  From the problem queue, it can be moved back to the pending queue, possibly with some modifications to prevent a recurrence of the initial problem (e.g. increasaed RAM requirement to handle an over resource limit error), or it can be moved to the failed or canceled lists. 

Starting and stopping a workflow

The swif commands run and pause will start/resume and suspend a workflow.  Note that a newly created workflow is initially suspended.  You must explicitly start it.  Unless you specify a limit, starting a workflow will cause it to continue running jobs, including new ones that you may add.  Any errors, however, will cause the workflow to stop dispatching jobs.  You must resolve these errors before it will resume.  You can always add new jobs to a workflow, even after canceling (unless you use the swif cancel –delete command). 

You can limit the release of jobs in two ways.  If you wish to group jobs in sequence, you can assign phase numbers to them.  No job with a higher phase number will start until all those with a lower phase number have completed, or been explicitly abandoned.  When you run your workflow, you can specify a phase limit via the –phaselimit option.  Doing so will cause it to run all jobs with phase numbers up to and including the limit.  This feature is completely optional, and has no effect if it is not something you care about.

You can also specify a limit on the total number of job attempts before suspending via the –joblimit option.  An attempt corresponds to one job run.  Since jobs can be re-run, one job may have many attempts.

Pausing a workflow will prevent further jobs from being dispatched, but will allow those already dispatched to complete.  If you wish to terminate running jobs as well, you can specify swif pause –now.

Handling problem jobs

You can inspect jobs in the problem queue by running the command swif status -problems.  This will show some information about each job in the problem queue, including the reason for its failure. 

If it seems the most reasonable thing to do is simply to retry the jobs (for example, if the error was owing to a system issue), you could issue the command swif retry-jobs -problems SWIF-SYSTEM-ERROR.  This will place all jobs that failed with SWIF-SYSTEM-ERROR back into the pending list.  

On the other hand, if you feel that the jobs sneed to be modified (maybe they need more time) you would issue a command like swif modify-jobs -time add 2h -problems AUGER-TIMEOUT.  This will add two hours to every job in the problem queue that failed with a timeout error.  Of course, if some of your jobs fail this way, it may happen that others will too, so you might be better off using the command swif modify-jobs -time add 2h all.  This will handle the problem jobs, but will also apply to any other jobs that have not completed.  Any running jobs will be cancelled, modified, then placed back into the dispatch pool.

And the most drastic approach is simply to abandon the jobs via the command swif abandon-jobs.  This will move problem jobs into the failure list. 

All three of these commands (retry-jobs, modify-jobs, abandon-jobs) select and operate upon jobs in a siiliar manner.

Canceling a workflow

Use the command swif cancel to terminate a workflow.  This will cancel any running jobs and move pending ones into the canceled list.  If you wish to completely obliterate everything about the workflow, you can specify swif cancel -delete -discard-tape-files -discard-disk-files.  This will completely delete the workflow as well as any files it wrote to tape or disk via job output specification.

Monitoring a workflow

The command swif status provides a variety of glimpses into the state of your workflow.  By default, it will show how many jobs have yet to be dispatched, how many are currently in the batch system, how many have succeeded, failed, or been canceled.  It also shows how many attempts have been made to run jobs, whether the workflow is suspended or has problem jobs.

Requesting other views on the workflow status will display: problem jobs, job products, resource utilization for job runs, auger/PBS status of dispatched jobs.

Like every swif command, you can get further usage details via command-line help.

Dynamically specifying job output files

If you don’t know exactly what files your job will produce until it runs, you can postpone specifying them until you do know.  Within your job you can use the command swif outfile.  Using this feature, you can get the benefit of having your output files copied by the batch system and recorded as products associated with the job.  Also, any output files that exist on job failure will be copied to /volatile in case they may prove useful in debugging, or may in fact represent valid output.