swif2 diagnose

Diagnose why a workflow's jobs aren't progressing.

Usage

swif2 diagnose <workflow> [options]

Examines the workflow state top-to-bottom and reports what's blocking progress, in plain language. Checks scheduling blockers, input staging, Slurm dispatch status, output transfers, and unresolved problems. Sections with nothing to report are skipped to keep output focused.

Arguments

Flag Value Comment
-display xml|json|simple Specify format for output.

Reports

Workflow Health

Single-row overview of the workflow's current state.

Key Type Value
suspended boolean Whether the workflow is paused.
frozen boolean Whether the workflow is frozen.
phase integer Current workflow phase.
phase_limit integer Maximum phase number.
max_concurrent integer Maximum concurrent job attempts allowed.
job_limit integer Maximum total job attempts allowed.
error_limit integer Maximum unresolved problems before pausing.
jobs integer Total number of jobs.
pending integer Jobs waiting to be scheduled.
active integer Jobs with active (non-done, non-problem) attempts.
problems integer Unresolved problem attempts.
done integer Successfully completed jobs.
abandoned integer Abandoned jobs.

Active Attempts

Groups active job attempts by lifecycle stage with age statistics.

Key Type Value
stage string Attempt status: preparing, ready, dispatched, canceling, or reaping.
attempts integer Number of attempts in this stage.
avg_age string Average time since last status update (e.g. 1d 10h, 3h 25m, 0h 09m).
max_age string Maximum time since last status update.

Large values for avg_age indicate a bottleneck. For example, many attempts stuck in preparing for hours usually means input files are waiting for tape retrieval.

Scheduling Blockers

Explains why pending jobs haven't been scheduled yet, checking reasons in the same priority order as the scheduler.

Key Type Value
reason string Why the jobs are blocked (see below).
jobs integer Number of pending jobs blocked for this reason.

Possible reasons, in priority order:

Only shown when there are pending jobs.

Input Staging

Breakdown of input file staging progress for attempts in the preparing stage.

Key Type Value
stage string Staging stage: locating, waiting for tape, transferring, done, or error.
files integer Number of input files in this stage.
attempts integer Number of distinct job attempts affected.
size string Total data volume with adaptive units (e.g. 45M, 2G, 1T).

waiting for tape means files need to be retrieved from the tape archive before they can be transferred to the compute site.

Only shown when there are attempts in the preparing stage.

Input Errors

Top errors encountered during input file staging.

Key Type Value
error string Error message.
files integer Number of files with this error.
attempts integer Number of job attempts affected.

Limited to the top 10 errors. Only shown when input staging errors exist.

Slurm Status

Distribution of Slurm job states for dispatched attempts.

Key Type Value
slurm_state string Slurm state (e.g. PENDING, RUNNING) or no sacct yet.
attempts integer Number of attempts in this state.
avg_age string Average time since dispatch (e.g. 1d 10h, 3h 25m, 0h 09m).
max_age string Maximum time since dispatch.

no sacct yet indicates recently-submitted jobs whose Slurm accounting data hasn't been polled yet.

Only shown when there are dispatched attempts.

Output Transfers

Breakdown of output file transfer progress for attempts in the reaping stage.

Key Type Value
stage string Transfer stage: pulling, archiving, done, or error.
files integer Number of output files in this stage.
attempts integer Number of distinct job attempts affected.
size string Total data volume with adaptive units (e.g. 45M, 2G, 1T).

Only shown when there are attempts in the reaping stage.

Problems

Unresolved problem attempts grouped by error type.

Key Type Value
problem string Problem type.
count integer Number of unresolved attempts with this problem.
oldest timestamp When the first occurrence was recorded.
newest timestamp When the most recent occurrence was recorded.

Only shown when there are unresolved problem attempts.

Recent Log

Recent warn and error log entries from the last 24 hours, condensed by message. Identical messages are grouped together with a count and the timestamps of the first and last occurrence.

Key Type Value
level string Log level: warn or error.
message string Log message (truncated to 120 chars).
count integer Number of times this message appeared.
first timestamp When this message first appeared (within the 24h window).
last timestamp When this message most recently appeared.

Errors are listed before warnings. Limited to the top 15 distinct messages. Only shown when warn/error entries exist in the last 24 hours.