How to run a parallel mpi jobs which needs a big memory?
When run a mpi job which needs a large memory, we recommend to use exclusive mode and limit the job to one host. To do this, add these slurm options and without request memory. So the job will land on a node which has large memory. At this moment farm1976, it has 512G memory and 128 core, we may increase more memory on other nodes if needed. For example a 64 core job, if lands on this node exclsively, all memory will be available for the job. Instead of running on ten more nodes and each request 40G memory. Use these tags job will get scheduled faster and save other nodes for other jobs to use.
#SBATCH --nodes=1
#SBATCH --ntasks=64
#SBATCH --exclusive
#SBATCH ---constraint=bigmem
#SBATCH --exclusive
#SBATCH --mem-per-cpu=48000M
What partition to specify when submit a slurm job?
The most common Slurm partitions are priority and production. The priority partition has the higest priorty and it is used to run test or debug jobs. But each user can only run 16 jobs at same time, the max walltime is only 1 day. On otherhand, production partition has 4 days max walltime, there is no max job limit (other than some special circumstance). The ifarm partition is for interactive jobs; osg partition is for osg jobs; jupter partition is for Jupter notebook; gpu partition is for jobs using sciml gpu nodes. Please reference
Slurm Partition page for completed partition list.
What does 'Invalid account or account/partition combination specified' error mean?
When use --acount=my-account slurm option and user doesn't belong to that account, this error output "sbatch: error: Batch job s
ubmission failed: Invalid account or account/partition combination specified" will be print out. Use
Slurm User Account page to find out which account you belong to, or send a serviceNow to request to be added to a specified account.
How to limit Slurm Job's notification email?
If you submit many slurm jobs, and want to get Slurm notifcation only when certain event types occur, add this mail-type option when submit a slurm job. For example: --mail-type=END or --mail-type=FAIL. The most common used type values are BEGIN, END, FAIL, REQUEUE, TIME_LIMIT. Refernce sbatch man page for completed list.
Why my job disappear without any log message?
If --output or --error are not set in slurm option file, the log location will be the directory which sbach is called. Use --output=/path/to-log-file/log-file-name and --error=/path/to-log-file/error-file-name, to set log file location you want. Please pay attention, the directory /path/to-log-file must be created before call sbatch to submit the job to slurm. If failed to do so, the job will die immediately with out any output or/and error file created.
How to kill my Slurm job?
Run Slurm command scancel slurm-job-id to kill user's own jobs. Use man scancel to find out its options and examples.
How do I find the status of a Slurm job?
Use Slurm command scontrol show job slurm-job-id to get the detailed job info. Please notice this command can only use for outstanding job (pending and running) and jobs finished in last hour. To get old finished job, use sacct -j slurm-job-id. Check sacct man page to find out more useful options and examples. Following sacct option will print out most useful information about a completed job.
sacct --format=JobID%18,User,JobName%20,State%15,Partition,Account%13,Nnodes,NCPUS,ReqMem,ExitCode,Submit,Start,End,Elapsed,Nodelist -j slurm-job-id
How to submit a job array to Slurm?
Add "--array=<indexes>" to sbatch to submit a job array, which span multiple jobs to be executed with identical parameters. The indexes specification identifies what array index values should be used. Multiple values may be specified using a comma separated list and/or a range of values with a "-" separator. For example, "--array=0-15" or "--array=0,6,16-32". Refernce sbatch man page for completed information.
How to request a exclusive node for a Job?
Add slurm options "--exclusive and --nodes=1" to request a whole node (without requesting any memory). Using this slurm-option-file to request one farm18 node (whole node) to run a 24 hours job.
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --exclusive
#SBATCH --partition=production
#SBATCH --account=your-account-name
#SBATCH --constraint=farm18
#SBATCH --time=24:00:00
path_to_executable
Comments
memory use calculation
How do I find out how much memory my job needs? For example, say I need an array of 200,000 doubles. How much memory does that correspond to? When I request, say, mem=250M, is this in units of bytes? Where is it listed the number of bytes for a given variable type?