squeue
squeue
is a Slurm command for checking your queued and running jobs. See the official Slurm page
for full details.
Usage
By default, squeue
shows you all queued and running jobs on M3. You probably only care about your jobs, so use the --me
flag:
squeue --me
Some arguments you may find useful are:
--start
: show the estimated start time of a pending job.-O
or--Format
: specify which columns are shown in the output.-w
or--nodelist
: specify which nodes to show queued jobs for.
Output
See the example below:
[lexg@m3-login3 ~]$ squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
42016968 comp Interest lexg R 0:06 1 m3j002
42016967 comp test.sh lexg R 0:20 1 m3j002
42016966 comp test.sh lexg R 0:39 1 m3j003
42016970 gpu Another lexg PD 0:00 1 (Resources)
Again, see the official squeue docs for full details, but the default fields are:
Field | Meaning |
---|---|
JOBID | The Slurm job's ID. |
PARTITION | Requested partition for the job. |
NAME | The name of the job. |
USER | The user who submitted the job. |
ST | The state of the job. See job states. |
TIME | The amount of time the job has run for. |
NODES | Requested number of nodes for the job. |
NODELIST(REASON) | If the job is running, the list of nodes it is running on. If it is not yet running, this instead shows the reason the job has not started. See Reasons for a job not starting. |
Job states
See the official squeue docs for a full list of possible job states. Generally, you will only see one of the following:
State | Meaning |
---|---|
PD PENDING | Job has not yet started. |
R RUNNING | Job has started and is currently running. |
CG COMPLETING | Job is in the process of completing. This should generally finish within a few seconds or minutes at most. |
Reasons for a job not starting
The REASON
field of squeue
will sometimes show a particular reason that your job has not already started. See the
Slurm list of reason codes for every possible
reason code, but here are the most common ones you will see:
Reason | Meaning | What should you do? |
---|---|---|
Priority | One or more higher priority jobs exist for this partition or advanced reservation. | Nothing. The best you could do is request fewer resources. |
Resources | The job is waiting for resources to become available. | Nothing. The best you could do is request fewer resources. |
QOSMaxGRESPerUser | Your job requested more GPUs than are allowed by the QoS. | Use mon_qos to check the gpu value in MaxTRESPU , and never request more than that number of GPUs (across all of your jobs). |
QOSMaxWallDurationPerJobLimit | Your job requested more walltime than is allowed by the QoS. | Use mon_qos to check MaxWall , and never request more walltime than this. |
QOSMaxCpuPerUserLimit | Your job requested more CPUs than are allowed by the QoS. | Use mon_qos to check the cpu value in MaxTRESPU , and don't try to use more than this number of CPUs at once (across all of your jobs) |
QOSMaxSubmitJobPerUserLimit | You already have the maximum number of submitted jobs for this QoS. E.g. the desktopq QoS limits this to 1 submitted job at a time. | Use mon_qos to check MaxSubmitPU , and don't submit more than this number of jobs at once. |
MaxGRESPerAccount | Your job requested more GPUs than are allowed for your account. Note your account represents your HPC ID project, i.e. this count is shared amongst your colleagues. | Use mon_qos to check MaxTRESPA , and either wait or ask your colleagues to request fewer GPUs. |
Some of these reasons are only temporary. For example, if you have a job using 4 GPUs in the normal
QoS, and then you submit another job requesting 1 GPU,
Slurm will say QOSMaxGRESPerUser
for your new job since starting that new job would result in you using more than 4 GPUs at once, forbidden by the normal
QoS. If you simply wait for your first job to finish though, your second job should eventually stop saying QOSMaxGRESPerUser
and will be able to start.