Skip to main content

show_cluster

show_cluster is a custom script to quickly check the state of all nodes and partitions on M3, as well as summarise how many CPUs and GPUs are currently free.

Usage

Simply run show_cluster in a terminal. Note the example below omits many nodes for easier reading here.

[lexg@m3-login3 ~]$ show_cluster
NODE TYPE PARTITION CPU Mem (GB) GPU STATUS
(Free) (Free) (Free)
m3a100 A40 desktop 2 167 0 Running
m3a101 A40 desktop 0 59 0 Busy
m3a102 A40 desktop 2 19 0 Running
m3a103 A40 OFFLINE REASON: Kill task failed Offline
m3a104 A40 desktop 0 59 0 Busy
m3a105 A40 gpu 12 96 0 Running
m3a106 A40 gpu 6 580 0 Running
m3a107 A40 gpu 0 453 0 Busy
# and many more nodes...
m3d100 CPU comp 3 5 0 Running
m3d101 CPU comp 31 5 0 Running
m3d102 CPU comp 31 5 0 Running
m3d103 CPU comp 31 5 0 Running
# and many more nodes...

Summary:
+------------+------------+------------+-------------+------------+------------+------------+-------------+-------------+
| | CPUs | Nodes | V100 GPUs | P4 GPUs | T4 GPUs | A40 GPUs | A100 GPUs | H100 GPUs |
|------------+------------+------------+-------------+------------+------------+------------+-------------+-------------|
| Available | 5541 (45%) | 35 (19%) | 4 (44%) | 43 (65%) | 25 (21%) | 0 ( 0%) | 23 (27%) | 8 (100%) |
| In Use | 5759 (47%) | 134 (72%) | 2 (22%) | 17 (26%) | 23 (19%) | 56 (93%) | 57 (68%) | 0 ( 0%) |
| Down | 716 ( 6%) | 14 ( 7%) | 3 (33%) | 0 ( 0%) | 46 (39%) | 4 ( 7%) | 4 ( 5%) | 0 ( 0%) |
| Reserved | 204 ( 2%) | 4 ( 2%) | 0 ( 0%) | 6 ( 9%) | 24 (20%) | 0 ( 0%) | 0 ( 0%) | 0 ( 0%) |
| ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- |
| Total | 12220 | 187 | 9 | 66 | 118 | 60 | 84 | 8 |
+------------+------------+------------+-------------+------------+------------+------------+-------------+-------------+

Output

Nodes

The top half of show_cluster's output is a table of nodes with the following fields:

FieldMeaning
NODEName of the node.
TYPEWil say CPU if the node is CPU-only. Otherwise it will show which type of GPU is available on that node
PARTITIONWhich partition this node belongs to.
CPU (Free)How many unused CPUs are on this node
Mem (GB) (Free)How much unused memory is on this node, in Giga-Bytes
GPU (Free)How many unused GPUs are on this node
STATUSSummary of the node's state. See Possible STATUS values

Summary

Below the nodes table, you'll see a summary of how many CPUs, nodes, and GPUs are currently available.

Possible STATUS values

STATUS valueMeaning
IdleNode is completely free. No jobs running on the node.
RunningSome jobs are running on the node but it still has available resources for new jobs.
BusyNode is completely busy. There are no free resources on the node. No new jobs can start on this node.
OfflineNode is offline and unavailable due to a system issue.
ReservedNode has been booked by other users and is ONLY available for them.

Offline nodes

If a node is offline, you may see an OFFLINE REASON: in its row in the show_cluster output. E.g. the above output showed m3a103 was offline because Kill task failed:

m3a103             A40 OFFLINE REASON:                 Kill task failed        Offline