Diagnosing problems with jobs
Pending jobs
There are many reasons why a job may be stuck in Pending
status. In
90% of cases, the show_job
script can assist you in determining why
the job is stuck.
show_job
and
show_job [JOBID]
It may also be beneficial to check the cluster status
show_cluster
CPU, Memory, and Desktop Job Limits
These are shared by many users. To ensure that every user gets their
fare share, a user limit has been applied. There are limits for CPU
cores, memory, and number of desktop jobs. For instance, on MonARCH,
each user can only consume 300 CPU Cores at any one time. If this limit
is hit, all new jobs will be set to Pending
. This appears in the
output of show_job
as Reach User Job Limit (CPU)
.
If you encounter this situation, it is very likely that you are consuming a high number of CPU cores. As running jobs complete, your pending jobs will begin to run.
The show_job
command reports how many CPU cores you are currently
using.
$ show_job
*************************************************
* MY JOB SUMMARY *
* Cluster: m3 *
*************************************************
User Name Massive User
User ID masusr
-------------------------------------------------
Num of Submitted Jobs 0
Num of Running Jobs 0
Num of Pending Jobs 0
Num of CPU Cores 20
-------------------------------------------------
*************************************************
* Job Details on m3 *
*************************************************
JOBID JOB NAME Project QOS STATE RUNNING TOTAL NODE DETAILS
TIME TIME
------------------------------------------------------------------------------------------------------------------