Running GPU Jobs
MonARCH is equipped with the following GPU cards:
- A100 cards
- A40 cards
show_cluster Please use the tool show_cluster
to view the current state of our systems. GPUs are in a gpu partition.
When requesting a Nvidia A40 GPU, you need to specify --gres=gpu:A40:<no of cards>
as well as the partition.
#SBATCH --gres=gpu:A40:1
#SBATCH --partition=gpu
When requesting a A100 GPU, you need to specify --gres=gpu:A100:<no of cards>
#SBATCH --gres=gpu:A100:1
#SBATCH --partition=gpu
Sample GPU Slurm scripts​
To submit a job, if you need 1 node with 3 cores and 1 A40 GPU, then the slurm submission script should look like:
#!/bin/bash
#SBATCH --job-name=MyJob
#SBATCH --time=01:00:00
#SBATCH --ntasks=3
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:A40:1
#SBATCH --partition=gpu
If you need 2 nodes with 4 cpu cores and 2 A40 GPUs on each node, then the slurm submission script should look like:
#!/bin/bash
#SBATCH --job-name=MyJob
#SBATCH --time=01:00:00
#SBATCH --ntasks=8
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:A40:2
#SBATCH --partition=gpu
Compiling your own CUDA or OpenCL codes for use on M​
MonARCH has been configured to allow CUDA (or OpenCL) applications to be compiled (device independent code ONLY) on the Login node (no GPUs installed) for execution on a Compute node (with GPU).
Login node
: can compile some of CUDA (or OpenCL) source code (device independent code ONLY) but cannot run it
Compute node
: can compile all CUDA (or OpenCL) source code as well as execute it.
We strongly suggest you compile your code on a compute node. To do that, you need to use an smux
session to gain access to a compute node
smux new-session --gres=gpu:A40:1
Once your interactive session has begun, load the cuda module
module load cuda
To check the GPU device information
nvidia-smi
deviceQuery
Then you should be able to compile the GPU code. Once compilation has run to completion, without error, you can execute your GPU code.
If you attempt to run any CUDA (or OpenCL) application (compiled executable) on the Login node, 'no CUDA device found' error may be reported. This is because no CUDA-enabled GPUs are installed on the Login node. You must run GPU code on a compute node.