Running jobs on MonARCH
Launching jobs on MonARCH is controlled by Slurm, the Slurm Workload Manager, which allocates the compute nodes, resources, and time requested by the user through command line options and batch scripts. Submitting and running script jobs on the cluster is a straight forward procedure with 3 basic steps:
- Setup and launch
- Monitor progress
- Retrieve and analyse results
- Simple Batch Jobs
- Partitions available on MonARCH
- How to use QOS parameters
- Checking the status of the cluster
- MPI on MonARCH
- Multithreaded Jobs
- Array Jobs
- Interactive Jobs
- GPU Jobs
- X11 Jobs
- How to diagnose problems
- Advanced Slurm
- Selecting Particular Hardware
- Fair Share in Slurm
Slurm: Useful Commands
What | Slurm command | Comment |
---|---|---|
Job Submission | sbatch <jobScript> | Slurm directives in the jobs script can also be set by command line options for sbatch. |
Check queue | squeue or aliases (sq) | You can also examine individual jobs squeue -j 792412 |
Check cluster | show_cluster | This is a nicely printed description of the current state of the machines in our cluster, built on top of the sinfo -lN command. |
Deleting an existing job | scancel <jobID> | jobID is the Slurm job number |
Show running job | scontrol show job <jobID> or mon_sacct <jobID> or show_job <jobID> | Info on a pending or running job. jobID is the Slurm job number. mon_sacct and show_job are our local helper scripts |
Show finished job | sacct -j <jobID> or mon_sacct <jobID> or show_job <jobID> | Info on a finished job |
Suspend a job | scontrol suspend <jobID> | |
Resume a job | scontrol resume <jobID> | |
Deleting parts 5 to 10 of a job array | scancel <jobID>_5-10 | This deletes the jobs whose array indices are 5 to 10 |
Here are some samples of a quick submissions script to get you started.