show_cluster
show_cluster
is a custom script to quickly check the state of all nodes and partitions on M3, as well as summarise how many CPUs and GPUs are currently free.
Usage
Simply run show_cluster
in a terminal. Note the example below omits many nodes for easier reading here.
[lexg@m3-login3 ~]$ show_cluster
NODE TYPE PARTITION CPU Mem (GB) GPU STATUS
(Free) (Free) (Free)
m3a100 A40 desktop 2 167 0 Running
m3a101 A40 desktop 0 59 0 Busy
m3a102 A40 desktop 2 19 0 Running
m3a103 A40 OFFLINE REASON: Kill task failed Offline
m3a104 A40 desktop 0 59 0 Busy
m3a105 A40 gpu 12 96 0 Running
m3a106 A40 gpu 6 580 0 Running
m3a107 A40 gpu 0 453 0 Busy
# and many more nodes...
m3d100 CPU comp 3 5 0 Running
m3d101 CPU comp 31 5 0 Running
m3d102 CPU comp 31 5 0 Running
m3d103 CPU comp 31 5 0 Running
# and many more nodes...
Summary:
+------------+------------+------------+-------------+------------+------------+------------+-------------+-------------+
| | CPUs | Nodes | V100 GPUs | P4 GPUs | T4 GPUs | A40 GPUs | A100 GPUs | H100 GPUs |
|------------+------------+------------+-------------+------------+------------+------------+-------------+-------------|
| Available | 5541 (45%) | 35 (19%) | 4 (44%) | 43 (65%) | 25 (21%) | 0 ( 0%) | 23 (27%) | 8 (100%) |
| In Use | 5759 (47%) | 134 (72%) | 2 (22%) | 17 (26%) | 23 (19%) | 56 (93%) | 57 (68%) | 0 ( 0%) |
| Down | 716 ( 6%) | 14 ( 7%) | 3 (33%) | 0 ( 0%) | 46 (39%) | 4 ( 7%) | 4 ( 5%) | 0 ( 0%) |
| Reserved | 204 ( 2%) | 4 ( 2%) | 0 ( 0%) | 6 ( 9%) | 24 (20%) | 0 ( 0%) | 0 ( 0%) | 0 ( 0%) |
| ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- |
| Total | 12220 | 187 | 9 | 66 | 118 | 60 | 84 | 8 |
+------------+------------+------------+-------------+------------+------------+------------+-------------+-------------+
Output
Nodes
The top half of show_cluster
's output is a table of nodes with the following fields:
Field | Meaning |
---|---|
NODE | Name of the node. |
TYPE | Wil say CPU if the node is CPU-only. Otherwise it will show which type of GPU is available on that node |
PARTITION | Which partition this node belongs to. |
CPU (Free) | How many unused CPUs are on this node |
Mem (GB) (Free) | How much unused memory is on this node, in Giga-Bytes |
GPU (Free) | How many unused GPUs are on this node |
STATUS | Summary of the node's state. See Possible STATUS values |
Summary
Below the nodes table, you'll see a summary of how many CPUs, nodes, and GPUs are currently available.
Possible STATUS values
STATUS value | Meaning |
---|---|
Idle | Node is completely free. No jobs running on the node. |
Running | Some jobs are running on the node but it still has available resources for new jobs. |
Busy | Node is completely busy. There are no free resources on the node. No new jobs can start on this node. |
Offline | Node is offline and unavailable due to a system issue. |
Reserved | Node has been booked by other users and is ONLY available for them. |
Offline nodes
If a node is offline, you may see an OFFLINE REASON:
in its row in the show_cluster
output. E.g. the above output showed
m3a103
was offline because Kill task failed
:
m3a103 A40 OFFLINE REASON: Kill task failed Offline