You are here

Workflow - Swif2

Swif2 is a software system developed at Jefferson Lab that aims to help users schedule, run, and track computation jobs on batch systems both locally and at remote sites.

Users create named workflows to which they add units of work (jobs) with associated runtime parameters and optional input / output files. The swif service handles moving data files to and from compute sites, dealing with tape I/O and network transfers where needed. Users can monitor the overall status of a workflow as well as its associated jobs, and can interact with jobs to cancel, re-try, and modify them to resolve the issues that inevitably arise in large workflows. Within a workflow, users can define inter-job dependencies and phased ordering for jobs.

Swif2 (hereafter simply referred to as swif) evolved out of the original swif1 system that was designed as a layer above the Auger tool also developed at Jefferson Lab. Swif2 differs significantly from swif1 in several notable ways. First, it assumes that the underlying batch systems to which it dispatches jobs support the SLURM scheduling and execution system (SLURM is used at most large-scale scientific compute facilities, including Jefferson Lab, NERSC, and many other Department of Energy sites). Second, it has built-in support for running jobs at multiple sites. Importantly, while many of the commands are similar to those in swif1, they have changed in some subtle ways and the output is vastly different. Any tools written to parse the output of swif1 commands will need to be significantly modified to work with swif2.

Read the command line reference.

Differences from swif1 and Auger

  1. Most swif2 commands are essentially the same as swif1. The primary difference is in the information output. For example, swif status adds several fields (such as finer details about status of dispatched attempts), and removes a few (such as individual error type counts). Consult the command reference for output details.
  2. The notion of "track" and "project", which swif1 inherited from Auger, have been replaced with "partition" and "account", as used by SLURM. This is reflected in the namese of the flags used with swif add-job.
  3. Swif2 does not pass through any environment variables from the user's submission shell.  Instead, the job envirronment is initialized in a manner similar to the way it is initialized when you login to an interactive farm node.  This differs from the default behavior of SLURM and, consequently, the current behavior of Auger (note that swif1 has always launched user code with a limited environment). The user's job script must therefore construct an appropriate runtime environment, if necessary, by use of facilities such as the 'module' command or sourcing of project-specific setup scripts.

 

Main Concepts


The system makes heavy use of a MySQL database to track the status of user jobs and associated data files. The main conceptual data objects of swif, most of which map directly to database table records, are:

Workflow

  • This is a named container that groups user jobs. It also has some associated parameters that affect how jobs are handled.

Job

  • Specifies a unit of work that will be executed by slurm on a compute cluster, including required input files and expected output files.

Site Job

  • Specifies how to run a job at a specific compute site. One job can have several site job specifications, as the details of running a job usually vary between compute sites with different project codes, user names, filesystem paths, etc.

Job Attempt

  • When a swif job is dispatched to a compute system a new job attempt record is created that stores information about the actual running batch job and is used to monitor its status, from transfer of input files, execution, and retrieval of output files. One job may have multiple attempts if issues prevent it from running correctly on the first go.

Site

  • Designates a compute site supported by swif. Non-local sites require
    • login host
    • login user
    • internal file system path
    • globus endpoint
       

General Implementation Details


The swif daemon runs at Jefferson Lab and is a service maintained by the Scientific Computing group. All interaction with the system is done via the command-line tool /site/bin/swif2. All functionality is supported via sub-commands such as 'swif2 create''swif2 add-job''swif2 cancel', etc. These will be described later. This section describes in general terms how swif works. Note that while the system has been in use for some time supporting off-site production work for Hall D, it is still a work in progress.

When a user invokes the swif command all supplied information is bundled into an HTTP message that is sent over HTTPS to the swif service. The swif client requires that its user has a valid scientific computing certificate generated via the jcert tool. This certificate is read by the client and used to authenticate the JLab user with the server.

In most cases a swif user request will primarily involve database manipulations. The service monitors the database and reacts accordingly to changes that require action. Several independent threads handle the various tasks thus required, as outlined below, typically sleeping for several minutes before repeating. Given the asynchronous nature of the system, user actions will not typically yield instantaneous results.

Analysis Loop

  • Runs database stored procedures to analyze workflows and schedule work.
  • During the analysis phase information about various tasks is assimilated and reconciled with expectation. Batch jobs that have terminated are matched with user job attempts, which are then updated to reflect their status. Job attempts with completed transfers are transitioned to appropriate states. Summary information about workflows is generated.
  • During the scheduling phase, all of this assimilated information is used to decide which workflows are eligible to have new batch job attempts, the selection of which is heavily influenced by the 'data cost' of preparing them for launch. The scheduler favors attempting jobs that require less data transport. This means that jobs will typically be attempted together if they depend on files from the same tape, or if they depend upon files that other previously attempted jobs also depended on so were already staged at the job site.

Transfer Loop

  • Initiates and monitors transfer of files to and from job sites.
  • When a new attempt is made to run a job that requires input files, any files that are not already present at the job site will have been placed by the scheduler into the job attempt input queue. The transfer loop works to move the files from source to destination , accessing the JLab tape system if necessary, and launching globus to transfer files for off-site work.
  • When a job attempt has completed, any output files specified by the user will be moved to its indicated output location, typically onto a JLab file system or into the tape library.

Dispatch Loop

  • Prepares and launches jobs on batch systems.
  • When all of the inputs required by a job attempt are available on the system where it will run, a new attempt-specific directory is created, the input files linked into it, and a slurm job created to launch in that directory. The slurm job id is gathered and stored in the swif database mapped to the corresponding job attempt record.

Reap Loop

  • Surveys completed jobs for output files, manages remote storage, and cleans up job directories.
  • When a job attempt completes successfully it is inserted into the reap queue, where it will be found and extracted by the reap loop. The process of reaping a job entails looking for expected output files and inserting them into the job attempt output queue so they can be handled by the transfer loop.
  • Once a job attempt has fully completed and all output files retrieved, the site job directory will be removed. Any associated input files will be removed from the siite storage location as well, unless there are other non-terminated jobs that also depend upon them.

SLURM Polling Loop

  • Polls slurm for status of all previously scheduled and non-terminated jobs.
  • Updates database table with current status to inform analysis loop.
  • Cancels jobs if required by user action (e.g. modify-jobs or abandon-jobs).

Condition Loop

  • Evaluates job preconditions to inform analysis loop.

 

General Usage Overview


Using swif is quite simple. After creating a workflow, jobs are added to it, and the workflow is started. Users check the status of their workflows and may resolve problem job attempts if necessary. Once a workflow has gone dormant for an extended period it will become archived. Following are somewhat more specific details about these steps.

Create a Workflow

In order to run jobs you need first to create a workflow to which they will belong. Before creating a workflow, the user must:

  1. Have or obtain a Jefferson Lab Scientific Computing certificate
  2. Provide a name for the workflow. Names must be unique on a per-user basis.
  3. Determine the default site where contained jobs will run
    • Supported sites are configured in the swif database. Adding new sites is not currently a user-accessible operation.
    • If no site is specified, the default site 'jlab/enp' is assumed. This designates the Jefferson Lab Experimental Nuclear Physics batch farm. Other potential sites are listed below.
    • For off-site locations the user must also specify site storage and site login configurations.  These are not needed for local workflows.
  4. Choose whether default parameters are acceptable.
    • How many job attempts can be run concurrently active (default is 500)
    • You can specify whether to prevent launching new job attempts if the number of unresolved problems reaches a certain limit (default is no limit).
  5. Remote Site: To run jobs at a supported remote site, the user must configure ssh in a manner that allows the swif service to run an ssh process on their behalf without need for entering a passphrase. One way to achieve this if the site does not permit ssh authentication with an empty passphrase is to use ssh-agent. Consult the ssh-agent man page and cross reference with help materials for the remote site of interest, then login to the machine named 'swif-egress-21.jlab.org' to configure the ssh client accordingly.
  6. Remote site: Transferring files to and from remote sites requires configuring globus for command-line access. To do so, login to 'swif-egress-21.jlab.org' and run the command 'globus login'. This will walk you through the steps necessary to setup the command-line environment. Once this is done, swif will be able to interact with globus on your behalf by launching a shell process with your user id to invoke commands.

Add a Job to a Workflow

Once a workflow exists to contain the jobs you wish to run, you can add them via the command 'swif2 add-job'. This command allows for

  • Naming the job,
  • Providing general parameters such as RAM limit, expected wall time, etc.
  • Specifying the command to run ,
  • Providing specific flags to pass through to the slurm sbatch command,
  • Assigning arbitrary tags to the job,
  • Assign to a specific workflow 'phase',
  • Specifying antecedent jobs or other conditions that must be met before the job can be attempted.

Start and Stop a Workflow

A newly created workflow begins in a suspended state; none of its jobs will be attempted until it is un-suspended by invoking 'swif2 run'. The corollary is 'swif2 pause', which will suspend the workflow, preventing new attempts from being scheduled or launched.

  • When running a workflow you may specify a maximum number of jobs to dispatch, and/or a maximum job 'phase' from which to launch jobs.
  • When pausing a workflow you may optionally request that all active job attempts be canceled immediately.

Monitor a Workflow

The command 'swif status' provides summary information about a workflow and optionally detailed information about job attempts and unresolved problems.

 

Life Cycle of a Job


A swif job can be in one of the following states:

Job State Meaning
pending The job has no active attempt.
attempting A job attempt is in progress.
done Most recent job attempt completed successfully. No more attempts will be made.
abandoned Job has been abandoned. No more attempts will be made.

When a job is first added to a workflow it will remain in the pending state until the scheduler determines that it is time to make an attempt. Then a new job attempt record is created to track its progress, and the job transitions into the attempting state. The job will remain in this state until its attempt has (1) completely succeeded, at which point it will transition to done, (2) been canceled or modified via user interaction, which would cause a transition back to pending, or (3) designated by the user as not worth further attempts, which would transition it into the abandoned state. Once a job is attempting the associated job attempt record tracks the status of its attempt in more detail.

A job attempt can be in one of the following states:

Job Attempt State Meaning
preparing Inputs are being transferred to job location.
ready Batch job can be created for attempt.
dispatched A batch job has been created.
reaping Batch job has completed. Outputs are being surveyed and retrieved.
done Batch job succeeded, outputs retrieved.
problem An error has occurred requiring manual intervention to resolve.

If a job attempt encounters a problem, it will record one of the following error codes, usually along with some more specific message.

Job Attempt Problem Meaning
SITE_PREP_FAIL Something went wrong while attempting to create the site job directory, link input files, etc.
SITE_LAUNCH_FAIL Batch job could not be created.
SLURM_<> Batch job terminated with slurm code <> (e.g. SLURM_FAILED or SLURM_OUT_OF_MEMORY).
SWIF_INPUT_FAIL A required input could not be transferred from its source.
SWIF_MISSING_OUTPUT Batch job failed to produce an output file that user specified.
SWIF_OUTPUT_FAIL A generated output file could not be transferred to its destination.
SWIF_SYSTEM_ERROR An internal error likely requiring operational intervention or possibly a software fix.
SWIF_EXPIRATION Batch job status failed to update within five days.
USER_CANCEL User has requested that batch job be canceled.

Note that a problem job attempt will remain in that state until the user intervenes. The possible actions for such a situation are as follows:

  • Retry the job. It will be transitioned back to pending. This course would be most appropriate for transient errors or easily fixed system issues such as incorrect directory permissions, expired login credentials, etc.
  • Modify the job. It will transition back to pending with some options having been modified for the next attempt. One would choose this course if the attempt failed owing to an improperly specified batch parameter, such as resource limits, expected wall time, etc.
  • Abandon the job. It will transition into the abandoned state. This is the appropriate course when the user determines that there is no point in further attempts.
  • 'Bless' a job attempt. This is done to inform swif that a job's problem attempt should be treated as successful, even if slurm reported an error. Use this approach if a job did useful work before, for example, running out of wall time or exiting with non-zero return code. After blessing, the job attempt will transition into the reaping state, in the same way it would have if slurm had not reported an error.
  • Resume an attempt. This applies only to attempts that have input or output failures. This will transition the attempt into either preparing or reaping.

 

Job Attempt Mechanics


Automation: The swif daemon executes commands on behalf of Jefferson Lab users, locally and remotely. In order to do this, the daemon process needs to spawn a sub-processes for each unique user/site combination that executes as that user. This sub-process then launches an ssh session to open a tunneled bash shell with which it can interact to issue commands. The swif daemon is administratively constrained to launching only 'ssh' and 'globus' commands as a Jefferson Lab user, from the machine swif-egress-21.jlab.org.

File Management: Every job attempt is associated with a site job specification, which is in turn associated with a workflow site configuration. This configuration includes information about file system locations at the compute site. Crucially, it specifies the root directory, designated $SWIF_DIR, under which all of swif's activities occur. Following is an explanation of its subdirectories.

Location Purpose
$SWIF_DIR/input All files specified as job inputs reside here. The files have numeric names that correspond to their swif file catalog numbers.
$SWIF_DIR/jobs/$USER Directory under which job attempt working directories are created.
$SWIF_DIR/jobs/$USER/$SWIF_JOB_NAME/$SWIF_JOB_ATTEMPT_ID Staging / working directory for a job attempt.

NB: At Jefferson Lab, SWIF_DIR=/lustre/enp/swif2.

Note that swif has two views on $SWIF_DIR. First is the local path as seen from a user logged in to the compute site; second is the externally visible path used to access the directory via globus online. When files are moved to a compute site they are sent via globus to $SWIF_DIR/input. Once an input file has been transferred to a compute site, it will remain there until all dependent workflow jobs have either succeeded or been abandoned.

When a job attempt completes, its staging directory is scanned for expected output files. All such files are then transferred to their respective destinations, via globus if necessary, and to tape if requested.

Once a job attempt has successfully completed or been abandoned, then its staging directory and all of its contents will be deleted. Note that this will only happen once all specified outputs have been moved to their appropriate destinations.

Execution: When a job attempt is launched via slurm, its working directory will be $SWIF_DIR/jobs/$USER/$SWIF_JOB_NAME/$SWIF_JOB_ATTEMPT_ID. At Jefferson Lab, however, unless the user has explicitly requested otherwise, swif will create a wrapper script that

  1. copies input files with non-absolute paths into a directory on the batch job node's scratch disk,
  2. executes user code in this directory,
  3. copies output files back to the staging directory