Swif2 is a software system developed at Jefferson Lab that aims to help users schedule, run, and track computation jobs on batch systems both locally and at remote sites.
Users create named workflows to which they add units of work (jobs) with associated runtime parameters and optional input / output files. The swif service handles moving data files to and from compute sites, dealing with tape I/O and network transfers where needed. Users can monitor the overall status of a workflow as well as its associated jobs, and can interact with jobs to cancel, re-try, and modify them to resolve the issues that inevitably arise in large workflows. Within a workflow, users can define inter-job dependencies and phased ordering for jobs.
Swif2 (hereafter simply referred to as swif) evolved out of the original swif1 system that was designed as a layer above the Auger tool also developed at Jefferson Lab. Swif2 differs significantly from swif1 in several notable ways. First, it assumes that the underlying batch systems to which it dispatches jobs support the SLURM scheduling and execution system (SLURM is used at most large-scale scientific compute facilities, including Jefferson Lab, NERSC, and many other Department of Energy sites). Second, it has built-in support for running jobs at multiple sites. Importantly, while many of the commands are similar to those in swif1, they have changed in some subtle ways and the output is vastly different. Any tools written to parse the output of swif1 commands will need to be significantly modified to work with swif2.
Read the command line reference.
Differences from swif1 and Auger
The system makes heavy use of a MySQL database to track the status of user jobs and associated data files. The main conceptual data objects of swif, most of which map directly to database table records, are:
Workflow
Job
Site Job
Job Attempt
Site
The swif daemon runs at Jefferson Lab and is a service maintained by the Scientific Computing group. All interaction with the system is done via the command-line tool /site/bin/swif2. All functionality is supported via sub-commands such as 'swif2 create', 'swif2 add-job', 'swif2 cancel', etc. These will be described later. This section describes in general terms how swif works. Note that while the system has been in use for some time supporting off-site production work for Hall D, it is still a work in progress.
When a user invokes the swif command all supplied information is bundled into an HTTP message that is sent over HTTPS to the swif service. The swif client requires that its user has a valid scientific computing certificate generated via the jcert tool. This certificate is read by the client and used to authenticate the JLab user with the server.
In most cases a swif user request will primarily involve database manipulations. The service monitors the database and reacts accordingly to changes that require action. Several independent threads handle the various tasks thus required, as outlined below, typically sleeping for several minutes before repeating. Given the asynchronous nature of the system, user actions will not typically yield instantaneous results.
Using swif is quite simple. After creating a workflow, jobs are added to it, and the workflow is started. Users check the status of their workflows and may resolve problem job attempts if necessary. Once a workflow has gone dormant for an extended period it will become archived. Following are somewhat more specific details about these steps.
In order to run jobs you need first to create a workflow to which they will belong. Before creating a workflow, the user must:
Once a workflow exists to contain the jobs you wish to run, you can add them via the command 'swif2 add-job'. This command allows for
A newly created workflow begins in a suspended state; none of its jobs will be attempted until it is un-suspended by invoking 'swif2 run'. The corollary is 'swif2 pause', which will suspend the workflow, preventing new attempts from being scheduled or launched.
The command 'swif status' provides summary information about a workflow and optionally detailed information about job attempts and unresolved problems.
A swif job can be in one of the following states:
Job State | Meaning |
---|---|
pending | The job has no active attempt. |
attempting | A job attempt is in progress. |
done | Most recent job attempt completed successfully. No more attempts will be made. |
abandoned | Job has been abandoned. No more attempts will be made. |
When a job is first added to a workflow it will remain in the pending state until the scheduler determines that it is time to make an attempt. Then a new job attempt record is created to track its progress, and the job transitions into the attempting state. The job will remain in this state until its attempt has (1) completely succeeded, at which point it will transition to done, (2) been canceled or modified via user interaction, which would cause a transition back to pending, or (3) designated by the user as not worth further attempts, which would transition it into the abandoned state. Once a job is attempting the associated job attempt record tracks the status of its attempt in more detail.
A job attempt can be in one of the following states:
Job Attempt State | Meaning |
---|---|
preparing | Inputs are being transferred to job location. |
ready | Batch job can be created for attempt. |
dispatched | A batch job has been created. |
reaping | Batch job has completed. Outputs are being surveyed and retrieved. |
done | Batch job succeeded, outputs retrieved. |
problem | An error has occurred requiring manual intervention to resolve. |
If a job attempt encounters a problem, it will record one of the following error codes, usually along with some more specific message.
Job Attempt Problem | Meaning |
---|---|
SITE_PREP_FAIL | Something went wrong while attempting to create the site job directory, link input files, etc. |
SITE_LAUNCH_FAIL | Batch job could not be created. |
SLURM_<> | Batch job terminated with slurm code <> (e.g. SLURM_FAILED or SLURM_OUT_OF_MEMORY). |
SWIF_INPUT_FAIL | A required input could not be transferred from its source. |
SWIF_MISSING_OUTPUT | Batch job failed to produce an output file that user specified. |
SWIF_OUTPUT_FAIL | A generated output file could not be transferred to its destination. |
SWIF_SYSTEM_ERROR | An internal error likely requiring operational intervention or possibly a software fix. |
SWIF_EXPIRATION | Batch job status failed to update within five days. |
USER_CANCEL | User has requested that batch job be canceled. |
Note that a problem job attempt will remain in that state until the user intervenes. The possible actions for such a situation are as follows:
Automation: The swif daemon executes commands on behalf of Jefferson Lab users, locally and remotely. In order to do this, the daemon process needs to spawn a sub-processes for each unique user/site combination that executes as that user. This sub-process then launches an ssh session to open a tunneled bash shell with which it can interact to issue commands. The swif daemon is administratively constrained to launching only 'ssh' and 'globus' commands as a Jefferson Lab user, from the machine swif-egress-21.jlab.org.
File Management: Every job attempt is associated with a site job specification, which is in turn associated with a workflow site configuration. This configuration includes information about file system locations at the compute site. Crucially, it specifies the root directory, designated $SWIF_DIR, under which all of swif's activities occur. Following is an explanation of its subdirectories.
Location | Purpose |
---|---|
$SWIF_DIR/input |
All files specified as job inputs reside here. The files have numeric names that correspond to their swif file catalog numbers. |
$SWIF_DIR/jobs/$USER |
Directory under which job attempt working directories are created. |
$SWIF_DIR/jobs/$USER/$SWIF_JOB_NAME/$SWIF_JOB_ATTEMPT_ID |
Staging / working directory for a job attempt. |
NB: At Jefferson Lab, SWIF_DIR=/lustre/enp/swif2.
Note that swif has two views on $SWIF_DIR. First is the local path as seen from a user logged in to the compute site; second is the externally visible path used to access the directory via globus online. When files are moved to a compute site they are sent via globus to $SWIF_DIR/input. Once an input file has been transferred to a compute site, it will remain there until all dependent workflow jobs have either succeeded or been abandoned.
When a job attempt completes, its staging directory is scanned for expected output files. All such files are then transferred to their respective destinations, via globus if necessary, and to tape if requested.
Once a job attempt has successfully completed or been abandoned, then its staging directory and all of its contents will be deleted. Note that this will only happen once all specified outputs have been moved to their appropriate destinations.
Execution: When a job attempt is launched via slurm, its working directory will be $SWIF_DIR/jobs/$USER/$SWIF_JOB_NAME/$SWIF_JOB_ATTEMPT_ID. At Jefferson Lab, however, unless the user has explicitly requested otherwise, swif will create a wrapper script that