Skip to content

Your first training job on LUMI

Presenters: Mats Sjöberg (CSC) and Oskar Taubert (CSC)

Content:

  • Using LUMI via the command line
  • Submitting and running AI training jobs using the batch system

Extra materials

Q&A

  1. Question from the audience about dynamic job size

    • No, this cannot be done. And it is normal also, as there is no guarantee that the resources would be immediately available, so your job would be stuck while still keeping the nodes that it already has, which is very expensive.

      You can write dependent jobs, where you specify that one job should only start when another job has finished. So you could have a second job with more resources that can get scheduled as soon as the first job with fewer resources ends. See this part of the sbatch manual

    • Running an inference service on a HPC system with inelastic job scheduling is not an ideal fit, but increasingly asked for and there are "overlay" solutions that try to provide a bridge at the cost of some efficiency.

    • An HPC system has advantages over cloud infrastructure (it can be cheaper), but then it also comes with restrictions that a cloud infrastructure does not have...

  2. What does the flag --overlap do in srun?

    • It allows multiple job steps in Slurm to overlap resources such as CPU cores and GPUs.
    • But this is a very brief introduction to Slurm. In the regular intro courses, we have
    • a 2 hour lecture
    • and even that is not enough to explain all that in detail...
  3. Info Link to the HyperQueue presentation mentioned after a question in the room

  4. I'm attempting to view the training stats on TensorBoard using the path to runs but it displays empty loss curves. Is this expected at this stage of the tutorial? EDIT: (It started showing some data)

    • Tensorboard is not the fastest.

    • Also: it should write the tensorboard logs to scratch, not projappl (unless you change that).

    • When you launch tensorboard you need to add the path /scratch/project_465002178/USERNAME/runs/ to the TensorBoard log directory (replace USERNAME with your actual username)

  5. Got this error:

    + srun singularity exec /appl/local/containers/sif-images/lumi-pytorch-rocm-6.2.4-python-3.12-pytorch-v2.6.0.sif /users/sharmavi/test.py --output-path /scratch/project_465002178/sharmavi/data/ --logging-path /scratch/project_465002178/sharmavi/runs/
    **FATAL:   permission denied **
    

    • This will try to run the script test.py. So

      1. This has to be an executable script, and
      2. It has to start with the correct shebang line to tell Linux to use Python (and which Python), i.e., that line that starts with #!.
    • Or you just forgot the python after the sif-file.