Hands-on: Checking GPU usage interactively using rocm-smi¶
Presenters: Samuel Antão and Paul Bauer (AMD)
Exercises on the course GitHub.
Q&A¶
-
Why is
srunneeded? What happens if I forget using it (while still being sure I am on a compute node?) I know that one has to use it for multi-node jobs (since the script actually runs only on the first node of the allocation), but what does it do in a single node job?-
In order for you to be on a compute node you must have used srun at some point. The way to get to the compute nodes is
srun your_commandorsrun --overlap --jobid ### --pty bash. The latter starts a bash session and then you can -
If you forget
srunwith salloc, you will be running on the login node. If you forgetsrunin a batch script, you will be running on the first node in the allocation only.
Doesn't
sallocwithout arguments already brings me to a compute node? Just checked: no (on other HPC systems it does, sorry)-
This can be configured differently in SLURM - on LUMI salloc doesn't start a session on the compute nodes. This just reserves the nodes for subsequent srun invocations.
-
Actually, LUMI uses the standard behaviour of Slurm. Those systems that do give you a shell on the compute node have modified the Slurm configuration. Often this is because those sites were running another resource manager and scheduler before that launched a shell on the compute node by default with the equivalent command to
salloc(e.g., Torque/MAUI or Torque/MOAB).
-
-
There was a question if rocm-smi can show memory usage in absolute numbers instead of a percentage. This can be done with:
-
For rocm-smi:
rocm-smi --showmeminfo vram -
With the newer amd-smi (available in the containers) you can use:
amd-smi metric -m
-