Specifying resources in Slurm
When submitting a job request to Slurm, you need to specify which resources your job will need. The full list of options here can be found in the Slurm sbatch page, but there are a lot of options that you will never need. This page summarises the key options you may want to play with on M3.
Quick example
Here is an example sbatch
command to run a script called my-script.slurm
with 1 hour, 16 GB memory, and 8 CPUs:
sbatch --time=1:00:00 --mem=16G --cpus-per-task-8 my-script.slurm
In general, an sbatch
command always looks like:
sbatch [OPTIONS...] SCRIPT
where you can provide as many valid options as you want before placing the script's name at the very end.
Change your job's name
--job-name="Some interesting name"
Maximum time limit for your job
Specify the maximum time that your job might need. You must set this large enough such that your job is guaranteed to finish within the time limit. If your job is still running beyond this time limit, it will be automatically killed. However, a longer --time
may lead to your job being in the queue for longer. It is up to you to set an appropriate --time
to minimise your queuing time while guaranteeing that your job does not get terminated early.
Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds".
Ask for 15 minutes:
--time=15
Ask for 1 hour, 30 minutes:
--time=1:30:00
Ask for 2 days, 12 hours:
--time=2-12:00:00
Memory
Slurm offers multiple options for configuring memory. The only ones you'll need are:
- --mem: memory per node.
- --mem-per-cpu: memory per CPU.
If your job only uses one node, then use --mem
to specify the total memory. E.g. to ask for 64 GB of memory (per node):
--mem=64G
To ask for 512 MB of memory (per node):
--mem=512M
Perhaps you are using MPI to run multiple tasks in parallel. In this case, you may not care so much about the total memory available, but rather how much memory each individual CPU should have. To ensure each CPU has 4 GB memory:
--mem-per-cpu=4G
Number of CPUs
Slurm distinguishes between CPUs, cores, and sockets. You are probably best off only ever thinking in terms of CPUs. If you're curious, see this StackOverflow post for a quick summary of these terms.
If you just want 8 CPUs for a particular job:
--cpus-per-task=8
Note you should only ask for multiple CPUs if the program you are running will actually use them! Some programs do not have any multithreading or multiprocessing and so cannot use more than one CPU at a time.
If you are using MPI, then Slurm lets you specify the number of MPI tasks and the number of CPUs per task. For example, the below asks for 6 tasks, with 2 CPUs per task, giving a total of 12 CPUs for your job:
--ntasks=6 --cpus-per-task=2
Be careful specifying lots of CPUs. If you ask for more CPUs than can be provided by a single node, you will need to specify --nodes=2
(or greater!). Not only will this take a long time for the job to even start, but the communication between two nodes will be slower than communications within a single node, so it might actually be faster for you to just ask for all of the CPUs on a single node.
GPUs
You can specify both the number and (optionally) the type of GPUs you want. This generally involves setting --gres
and --partition
. See GPUs on M3 for details.
Partitions and Quality of Service (QoS)
See Partitions and Quality of Service (QoS).
Getting emails
You can ask for an email to be sent at different stages of your job's lifecycle. Set --mail-user to your email address:
--mail-user=my-email@monash.edu
You also need to specify which events should trigger an email. See the --mail-type
documentation for all the options, but for simplicity, you can just ask for emails for all events with:
--mail-type=ALL
Changing the output files
By default, Slurm saves all of the output of your job (that would ordinarily be printed to the terminal) to a single file called:
slurm-<JOBID>.out
where <JOBID>
is the number corresponding to your job's ID. This file will be placed in whichever directory you ran sbatch
from. You can change this by specifying --output
. You can optionally redirect all error output using --error
. Read the linked Slurm docs for more details.
Despite its name, the --error
option does not guarantee that all error messages from a program will be directed to the specified file. If you suspect errors have occurred, be sure to check all of your job's output files for warnings and error messages.