swif2 diagnose-job

Diagnose why a specific job is in its current state.

Usage

swif2 diagnose-job <workflow> -name <job-name> [options]
swif2 diagnose-job -jid <job-id> [options]

Examines a single job and reports what's determining its current state. Checks scheduling blockers, attempt history, input staging, Slurm dispatch status, output transfers, problem history, antecedents, and conditions. Sections that don't apply to the job's current state are skipped.

Arguments

Flag Value Comment
-name job name Select job by name (requires -workflow).
-jid swif job id Select job by id.
-display xml|json|simple Specify format for output.

Reports

Job Overview

Always shown. Single-row summary of the job's identity and resource requests.

Key Type Value
job_id integer SWIF job identifier.
job_name string Job name within the workflow.
job_status string Current status: pending, attempting, done, or abandoned.
job_phase integer Job phase number (null if unphased).
attempts integer Total number of attempts made.
cores integer CPU cores requested.
hours decimal Wall time requested in hours.
ram_gb decimal RAM requested in GB.
disk_gb decimal Disk space requested in GB.
account string Slurm account/project.
partition string Slurm partition/track.

Scheduling Blocker

Only shown for pending jobs. Explains why this job hasn't been scheduled, using the same priority order as the scheduler.

Key Type Value
reason string Why the job is blocked (see swif2 diagnose for full list).

Attempt Timeline

Shown if any attempts exist. One row per attempt showing its status and total duration.

Key Type Value
attempt_id integer Job attempt ID.
status string Attempt status (e.g. done, problem, preparing).
slurm_id integer Slurm job ID (null if not yet dispatched).
started timestamp When the attempt began.
age string Total duration of the attempt (e.g. 1d 10h, 3h 25m, 0h 09m).

Current Attempt

Only shown if there is an active (non-done, non-problem) attempt. Focuses on the most recent active attempt.

Key Type Value
attempt_id integer Job attempt ID.
status string Current lifecycle stage.
slurm_id integer Slurm job ID.
entered_at timestamp When the attempt entered its current stage.
time_in_stage string How long it's been in this stage (e.g. 2h 15m, 0h 09m).

Input Staging

Only shown when the current attempt is in the preparing stage. Per-file breakdown of input file staging progress.

Key Type Value
local_path string Local path where the file will be placed.
source string Source location (scheme:path).
stage string locating, waiting for tape, transferring, done, or error.
size string File size with adaptive units (e.g. 45M, 2G).
error string Error message (truncated to 80 chars), if any.

waiting for tape means the file must be retrieved from the tape archive before it can be transferred.

Slurm Detail

Only shown when the current attempt is dispatched. Sacct data for the running Slurm job. Fields are null if sacct hasn't polled yet.

Key Type Value
slurm_id integer Slurm job ID.
state string Slurm state (e.g. PENDING, RUNNING, COMPLETED).
reason string Slurm reason code (e.g. Priority, Resources).
nodelist string Compute node(s) assigned.
elapsed string Wall time elapsed.
cputime string CPU time consumed.
maxrss string Peak memory usage.
requested_ram_gb decimal RAM requested for this job (for comparison with maxrss).
exitcode integer Process exit code.
exitsignal integer Signal number if killed.
maxdiskread string Peak disk read.
maxdiskwrite string Peak disk write.

Output Transfers

Only shown when the current attempt is in the reaping stage. Per-file breakdown of output file archival progress.

Key Type Value
output_path string Path of the output file on the compute node.
target_path string Archival destination path.
stage string pulling, archiving, done, or error.
size string File size with adaptive units (e.g. 45M, 2G).
error string Error message (truncated to 80 chars), if any.

Problem History

Only shown if any attempts ended with problems. Includes both resolved and unresolved problems to show the full failure history.

Key Type Value
attempt_id integer Job attempt ID.
problem string Problem type (e.g. SWIF_INPUT_FAIL, SLURM_TIMEOUT).
details string Problem details (truncated to 120 chars).
resolution string How the problem was resolved (null if unresolved).
ts timestamp When the problem was recorded.

Antecedents

Only shown if the job has upstream dependencies.

Key Type Value
antecedent_id integer Job ID of the dependency.
antecedent_name string Job name of the dependency.
job_status string Current status of the dependency.

Conditions

Only shown if the job has external conditions. Unsatisfied conditions are listed first.

Key Type Value
uri string Condition URI (file:// or http://).
status string valid or not valid.
error string Error from the most recent check, if any.
last_checked timestamp When the condition was last evaluated.

Recent Log

Warn and error log entries for this job, condensed by message. Identical messages are grouped together with a count and the timestamps of the first and last occurrence.

Key Type Value
level string Log level: warn or error.
message string Log message (truncated to 120 chars).
count integer Number of times this message appeared.
first timestamp When this message first appeared.
last timestamp When this message most recently appeared.

Errors are listed before warnings. Limited to the top 15 distinct messages. Only shown when warn/error log entries exist for this job.