Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. It has been used in many supercomputing sites and data centers across the world. JLab farm deployed Slurm in early 2019, but it was hidden from users by Auger system. Users now can access Slurm directly from farm interactive nodes.
Submitting Batch Jobs:
You can submit jobs from one of the
interactive nodes or from within a running batch script. Batch jobs are
submitted using slurm
sbatch command with a valid project account. You can
specify options on the command line, or (recommended) put all of them
into your batch script file. See sample scripts in one of the following sections. In your batch script, please specify at least the following, plus other options useful to your workflow.
- account, using -A, --account=<account>
- partition, which contains a set of nodes and serves like a queue, using -p, --partition=<partition_names>
- resources needed (number of nodes, mode of nodes, etc), using -C, --constraint=<list> for set of features of desired nodes, -N, --nodes=<num_node> for multiple node jobs, -n, --ntasks=1 --cpus-per-task=<numcores> for a single node job using multiple cores, and --mem-per-cpu=memsize (in unit of megbytes) for memory in megbytes for each core.
- wall time (specifying this more tightly than the default will improve your throughput), using -t, --time=<time>
- Note: a computing core denotes a virtual processing core (hyper-threading) not a physical core. A typical physical core contains two virtual processing cores.
Right now there are three partitions, general, production and priority, are configured. Please use
Scicomp portal Job page for the status of active and recently finished jobs, as well as the most current partition information.
Job Status and Other Information:
Once jobs are submitted into the slurm system, one can use
squeue command to check the status of the jobs, utilize
scancel command to cancel one or a list of jobs, and make use of
scontrol command to hold jobs. For detailed information about slurm commands, please checkout
slurm official documentation.