swif2 analyze

Analyze the performance characteristics of a workflow.

Usage

swif2 analyze <workflow> [options]

Produces a comprehensive statistical analysis of a workflow's execution, including timing breakdowns, resource usage, transfer performance, error patterns, and retry behavior. Useful for understanding why a workflow took as long as it did and whether resources were well-provisioned.

Arguments

Flag Value Comment
-display xml|json|simple Specify format for output.

Reports

Overview

Top-level summary metrics for the workflow.

Key Type Value
jobs integer Total number of jobs.
attempts integer Total number of job attempts.
attempts_per_job decimal Average number of attempts per job.
succeeded integer Number of attempts that completed successfully.
failed integer Number of attempts that ended with a problem.
abandoned integer Number of jobs that were abandoned.
first_activity timestamp Earliest recorded activity.
last_activity timestamp Most recent recorded activity.

Stage Timing

Average, maximum, and standard deviation of time spent in each job attempt lifecycle stage.

Key Type Value
stage string Lifecycle stage: preparing, ready, queued, running, or reaping.
attempts integer Number of attempts observed in this stage.
avg_time string Average time in this stage (e.g. 1d 10h, 3h 25m, 0h 09m).
max_time string Maximum time in this stage.
stdev_time string Standard deviation.

Stages: preparing means input files are being staged and transferred; ready means the attempt is queued for Slurm submission; queued means the Slurm job has been submitted and is waiting to start; running means the job is executing on a compute node; reaping means outputs are being collected and archived.

Slurm Execution

Aggregate compute resource usage for successfully completed attempts.

Key Type Value
jobs integer Number of completed attempts with Slurm data.
avg_wall string Average wall clock time (e.g. 1d 10h, 3h 25m, 0h 09m).
max_wall string Maximum wall clock time.
stdev_wall string Standard deviation of wall clock time.
avg_cpu string Average total CPU time.
cpu_efficiency decimal Fraction of allocated CPU actually used.
avg_maxrss_gb decimal Average peak memory usage in GB.
ram_efficiency decimal Fraction of allocated RAM actually used.

cpu_efficiency is computed as total CPU time divided by (elapsed time times number of cores). A value near 1.0 indicates full utilization; a low value suggests the job is I/O-bound or not using all allocated cores. ram_efficiency compares peak RSS to the requested allocation.

Resource Requests

Distribution of resource configurations requested across all attempts.

Key Type Value
cores integer Number of CPU cores requested.
hours decimal Wall time requested in hours.
ram_gb decimal RAM requested in GB.
disk_gb decimal Disk space requested in GB.
attempts integer Number of attempts with this configuration.

Input Timing

Transfer performance for input files of completed attempts.

Key Type Value
files integer Number of input files transferred.
size string Total data volume with adaptive units (e.g. 45M, 2G, 1T).
avg_xfer string Average transfer time per file (e.g. 3h 25m, 0h 09m).
max_xfer string Maximum transfer time for a single file.

Transfer time is measured from when the file was located or cached to when it was delivered (linked or pushed) to the compute site.

Output Timing

Archival performance for output files of completed attempts.

Key Type Value
files integer Number of output files archived.
size string Total data volume with adaptive units (e.g. 45M, 2G, 1T).
avg_archive string Average archival time per file (e.g. 3h 25m, 0h 09m).
max_archive string Maximum archival time for a single file.

Archival time is measured from when the file was pulled from the compute site to when it was written to tape or copied to its final destination.

Problems

Error categorization across all attempts.

Key Type Value
problem string Problem type.
occurrences integer Total number of attempts with this problem.
jobs_affected integer Number of distinct jobs affected.
unresolved integer Number of occurrences still unresolved.

Retries

Distribution of attempt counts per job.

Key Type Value
num_attempts integer Number of attempts made for a job.
jobs integer Number of jobs with this many attempts.
succeeded integer How many of those jobs completed successfully.
abandoned integer How many were abandoned.
in_progress integer How many are still in progress.
pending integer How many have not yet been attempted.

Exit Codes

Distribution of Slurm exit states and codes.

Key Type Value
state string Slurm job state (e.g. COMPLETED, FAILED, TIMEOUT).
exitcode integer Process exit code.
exitsignal integer Signal number if the process was killed.
occurrences integer Number of attempts with this exit state.