SWIF2 - Scientific Workflow Indefatigable Factotum, Version 2

Swif2 is a software system developed at Jefferson Lab that aims to help users schedule, run, and track computation jobs on batch systems both locally and at remote sites. Users create named workflows to which they add units of work (jobs) with associated runtime parameters and optional input / output files. The swif service handles moving data files to and from compute sites, dealing with tape I/O and network transfers where needed. Users can monitor the overall status of a workflow as well as its associated jobs, and can interact with jobs to cancel, re-try, and modify them to resolve the issues that inevitably arise in large workflows. Within a workflow, users can define inter-job dependencies and phased ordering for jobs.

Swif2 (hereafter simply referred to as swif) evolved out of the original swif1 system that was designed as a layer above the Auger tool also developed at Jefferson Lab. Swif2 differs significantly from swif1 in several notable ways. First, it assumes that the underlying batch systems to which it dispatches jobs support the SLURM scheduling and execution system (SLURM is used at most large-scale scientific compute facilities, including Jefferson Lab, NERSC, and many other Department of Energy sites). Second, it has built-in support for running jobs at multiple sites. Importantly, while many of the commands are similar to those in swif1, they have changed in some subtle ways and the output is vastly different. Any tools written to parse the output of swif1 commands will need to be significantly modified to work with swif2.

Read the command line reference

Differences from swif1

Most commands are essentially the same. The primary difference is in the information output. For example, swif status adds several fields, such as finer details about status of dispatched attempts), and removes a few such as individual error type counts. Consult the command reference for output details.

Migrating from Auger

The simplest way to migrate from Auger is to use the swif add-jsub command, which takes an Auger submission file as passed to jsub and uses it to create workflow jobs.

Main Concepts

The system makes heavy use of a MySQL database to track the status of user jobs and associated data files. The main conceptual data objects of swif, most of which map directly to database table records, are:

Workflow

Job

Site Job

Job Attempt

Site

General Implementation Details

The swif daemon runs at Jefferson Lab and is a service maintained by the Scientific Computing group. All interaction with the system is done via the command-line tool /site/bin/swif2. All functionality is supported via sub-commands such as 'swif2 create', 'swif2 add-job', 'swif2 cancel', etc. These will be described later. This section describes in general terms how swif works. Note that while the system has been in use for some time supporting off-site production work for Hall D, it is still a work in progress.

When a user invokes the swif command all supplied information is bundled into an HTTP message that is sent over HTTPS to the swif service. The swif client requires that its user has a valid scientific computing certificate generated via the jcert tool. This certificate is read by the client and used to authenticate the JLab user with the server.

In most cases a swif user request will primarily involve database manipulations. The service monitors the database and reacts accordingly to changes that require action. Several independent threads handle the various tasks thus required, as outlined below, typically sleeping for several minutes before repeating. Given the asynchronous nature of the system, user actions will not typically yield instantaneous results.

Analysis Loop

Transfer Loop

Dispatch Loop

Reap Loop

SLURM Polling Loop

Condition Loop

General Usage Overview

Using swif is quite simple. After creating a workflow, jobs are added to it, and the workflow is started. Users check the status of their workflows and may resolve problem job attempts if necessary. Once a workflow has gone dormant for an extended period it will become archived. Following are somewhat more specific details about these steps.

Create a Workflow

In order to run jobs you need first to create a workflow to which they will belong. Before creating a workflow, the user must:

  1. Have or obtain a Jefferson Lab Scientific Computing certificate
  2. Provide a name for the workflow. Names must be unique on a per-user basis.
  3. Determine the default site where contained jobs will run
    • Supported sites are configured in the swif database. Adding new sites is not currently a user-accessible operation.
    • If no site is specified, the default site 'jlab/enp' is assumed. This designates the Jefferson Lab Experimental Nuclear Physics batch farm. Other potential sites are listed below.
    • For off-site locations the user must also specify site storage and site login configurations.  These are not needed for local workflows.
  4. Choose whether default parameters are acceptable.
    • How many job attempts can be run concurrently active (default is 500)
    • You can specify whether to prevent launching new job attempts if the number of unresolved problems reaches a certain limit (default is no limit).
  5. Remote Site: To run jobs at a supported remote site, the user must configure ssh in a manner that allows the swif service to run an ssh process on their behalf without need for entering a passphrase. One way to achieve this if the site does not permit ssh authentication with an empty passphrase is to use ssh-agent. Consult the ssh-agent man page and cross reference with help materials for the remote site of interest, then login to the machine named 'swif-egress-21.jlab.org' to configure the ssh client accordingly.
  6. Remote site: Transferring files to and from remote sites requires configuring globus for command-line access. To do so, login to 'swif-egress-21.jlab.org' and run the command 'globus login'. This will walk you through the steps necessary to setup the command-line environment. Once this is done, swif will be able to interact with globus on your behalf by launching a shell process with your user id to invoke commands.

Add a Job to a Workflow

Once a workflow exists to contain the jobs you wish to run, you can add them via the command 'swif2 add-job'. This command allows for

Start and Stop a Workflow

A newly created workflow begins in a suspended state; none of its jobs will be attempted until it is un-suspended by invoking 'swif2 run'. The corollary is 'swif2 pause', which will suspend the workflow, preventing new attempts from being scheduled or launched.

Monitor a Workflow

The command 'swif status' provides summary information about a workflow and optionally detailed information about job attempts and unresolved problems.

Life Cycle of a Job

A swif job can be in one of the following states:

Job State Meaning
pending The job has no active attempt.
attempting A job attempt is in progress.
done Most recent job attempt completed successfully. No more attempts will be made.
abandoned Job has been abandoned. No more attempts will be made.

When a job is first added to a workflow it will remain in the pending state until the scheduler determines that it is time to make an attempt. Then a new job attempt record is created to track its progress, and the job transitions into the attempting state. The job will remain in this state until its attempt has (1) completely succeeded, at which point it will transition to done, (2) been canceled or modified via user interaction, which would cause a transition back to pending, or (3) designated by the user as not worth further attempts, which would transition it into the abandoned state. Once a job is attempting the associated job attempt record tracks the status of its attempt in more detail.

A job attempt can be in one of the following states:

Job Attempt State Meaning
preparing Inputs are being transferred to job location.
ready Batch job can be created for attempt.
dispatched A batch job has been created.
reaping Batch job has completed. Outputs are being surveyed and retrieved.
done Batch job succeeded, outputs retrieved.
problem An error has occurred requiring manual intervention to resolve.

If a job attempt encounters a problem, it will record one of the following error codes, usually along with some more specific message.

Job Attempt Problem Meaning
SITE_PREP_FAIL Something went wrong while attempting to create the site job directory, link input files, etc.
SITE_LAUNCH_FAIL Batch job could not be created.
SLURM_<> Batch job terminated with slurm code <> (e.g. SLURM_FAILED or SLURM_OUT_OF_MEMORY).
SWIF_INPUT_FAIL A required input could not be transferred from its source.
SWIF_MISSING_OUTPUT Batch job failed to produce an output file that user specified.
SWIF_OUTPUT_FAIL A generated output file could not be transferred to its destination.
SWIF_SYSTEM_ERROR An internal error likely requiring operational intervention or possibly a software fix.
SWIF_EXPIRATION Batch job status failed to update within five days.
USER_CANCEL User has requested that batch job be canceled.

Note that a problem job attempt will remain in that state until the user intervenes. The possible actions for such a situation are as follows:

Job Attempt Mechanics - File Management

Every job attempt is associated with a site job specification, which is in turn associated with a workflow site configuration. This configuration includes information about file system locations at the compute site. Crucially, it specifies the root directory, designated $SWIF_DIR, under which all of swif's activities occur. Following is an explanation of its subdirectories.

Location Purpose
$SWIF_DIR/input All files specified as job inputs reside here. The files have numeric names that correspond to their swif file catalog numbers.
$SWIF_DIR/jobs/$USER Directory under which job attempt working directories are created.
$SWIF_DIR/jobs/$USER/$JOB/$ATTEMPT Directory for a job attempt.

Note that swif has two views on $SWIF_DIR. First is the local path as seen from a user logged in to the compute site; second is the externally visible path used to access the directory via globus online. When files are moved to a compute site they are sent via globus to $SWIF_DIR/input. Once an input file has been transferred to a compute site, it will remain there until all dependent workflow jobs have either succeeded or been abandoned.

When a job attempt completes, its working directory is scanned for expected output files. All such files are then transferred to their respective destinations, via globus if necessary, and to tape if requested.

Once a job attempt has successfully completed or been abandoned, then its working directory and all of its contents will be deleted. Note that this will only happen once all specified outputs have been moved to their appropriate destinations.

Job Attempt Mechanics - Automation

The swif daemon executes commands on behalf of Jefferson Lab users, locally and remotely. In order to do this, the daemon process needs to spawn a sub-processes for each unique user/site combination that executes as that user. This sub-process then launches an ssh session to open a tunneled bash shell with which it can interact to issue commands. The swif daemon is administratively constrained to launching only 'ssh' and 'globus' commands as a Jefferson Lab user, from the machine swif-egress-21.jlab.org.