Skip to content

Moving your AI training jobs to LUMI workshop - Copenhagen, May 29-30 2024

Course organisation

Setting up for the exercises

During the course

If you have an active project on LUMI, you should be able to make the exercises in that project. To reduce the waiting time during the workshop, use the SLURM reservations we provide (see above).

You can find all exercises on our AI workshop GitHub page

After the termination of the course project

Setting up for the exercises is a bit more elaborate now.

  • The containers used in some of the exercises are no longer available in /scratch/project_465001063/containers. You'll have to replace that directory now with /appl/local/training/software/ai-20240529.

    Alternatively you can download the containers as a tar file and untar in a directory of your choice (and point the scripts to that directory where needed).

  • The exercises as they were during the course are available as the tag ai-202405291 in the GitHub repository. Whereas the repository could simply be cloned during the course, now you have to either:

    • Download the content of the repository as a tar file or bzip2-compressed tar file or from the GitHub release where you have a choice of formats,

    • or clone the repository and then check out the tag ai-202405291:

      git clone https://github.com/Lumi-supercomputer/Getting_Started_with_AI_workshop.git
      cd Getting_Started_with_AI_workshop
      git checkout ai-202405291
      

Note also that any reference to a reservation in Slurm has to be removed.

The exercises were thoroughly tested at the time of the course. LUMI is an evolving supercomputer though, so it is expected that some exercises may fail over time, and modules that need to be loaded, will also change as at every update we have to drop some versions of the LUMI module as the programming environment is no longer functional. Likewise it is expected that at some point the ROCm driver on the system may become incompatible with the ROCm versions used in the containers for the course.

Course materials

Note: Some links in the table below will remain invalid until after the course when all materials are uploaded.

Presentation Slides recording
Welcome and course introduction / video
Introduction to LUMI slides video
Using the LUMI web-interface slides video
Hands-on: Run a simple PyTorch example notebook / video
Your first AI training job on LUMI slides video
Hands-on: Run a simple single-GPU PyTorch AI training job / video
Understanding GPU activity & checking jobs slides video
Hands-on: Checking GPU usage interactively using rocm-smi / video
Running containers on LUMI slides video
Hands-on: Pull and run a container / video
Building containers from Conda/pip environments slides video
Hands-on: Creating a conda environment file and building a container using cotainr / /
Extending containers with virtual environments for faster testing slides video
Scaling AI training to multiple GPUs slides video
Hands-on: Converting the PyTorch single GPU AI training job to use all GPUs in a single node via DDP / video
Hyper-parameter tuning using Ray slides video
Hands-on: Hyper-parameter tuning the PyTorch model using Ray / video
Extreme scale AI slides video
Demo/Hands-on: Using multiple nodes / video
Loading training data from Lustre and LUMI-O slides video
Coupling machine learning with HPC simulation slides video