Skip to content

[package list]

PyTorch

License information

The PyTorch license can be found in the LICENSE file in the PyTorch GitHub.

Note however that in order to use PyTorch you will also be using several other packages that have different licenses.

User documentation (singularity container)

BETA VERSION, problems may occur and may not be solved quickly.

The PyTorch container is developed by AMD specifically for LUMI and contains the necessary parts to run PyTorch on LUMI, including the plugin needed for RCCL when doing distributed AI, and a suitable version of ROCm for the version of PyTorch. The apex, torchvision, torchdata, torchtext and torchaudio packages are also included.

The EasyBuild installation with the EasyConfigs mentioned below will do two things:

  1. It will copy the container to your own file space. We realise containers can be big, but it ensures that you have complete control over when a container is removed again.

    We will remove a container from the system when it is not sufficiently functional anymore, but the container may still work for you. E.g., after an upgrade of the network drivers on LUMI, the RCCL plugin for the LUMI Slingshot interconnect may be broken, but if you run on only one node PyTorch may still work for you.

    If you prefer to use the centrally provided container, you can remove your copy after loading of the module with rm $SIF followed by reloading the module. This is however at your own risk.

  2. It will create a module file. When loading the module, a number of environment variables will be set to help you use the module and to make it easy to swap the module with a different version in your job scripts.

    • SIF and SIFPYTORCH both contain the name and full path of the singularity container file.

    • SINGULARITY_BINDPATH will mount all necessary directories from the system, including everything that is needed to access the project, scratch and flash file systems.

    • RUNSCRIPTS and RUNSCRIPTSPYTORCH contain the full path of the directory containing some sample run scripts that can be used to run software in the container, or as inspiration for your own variants.

  3. It creates 3 scripts in the $RUNSCRIPTS directory:

    • conda-python-simple: This initialises Python in the container and then calls Python with the arguments of conda-python-simple. It can be used, e.g., to run commands through Python that utilise a single task but all GPUs.

    • conda-python-distributed: Model script that initialises Python in the container and also creates the environment to run a distributed PyTorch session. At the end, it will call Python with the arguments of the conda-python-distributed command.

    • get-master: A helper command for conda-python-distributed.

The container uses a miniconda environment in which Python and its packages are installed. That environment needs to be activated in the container when running, which can be done with the command that is available in the container as the environment variable WITH_CONDA (which for this container it is source /opt/miniconda3/bin/activate pytorch).

Example of the use of WITH_CONDA: Check the Python packages in the container in an interactive session:

module load LUMI PyTorch/2.1.0-rocm-5.6.1-python-3.10-singularity-20231123
singularity shell $SIF

which takes you in the container, and then in the container, at the Singularity> prompt:

$WITH_CONDA
pip list

The container (when used with SINGULARITY_BINDPATH of the module) also provides several wrapper scripts to start Python from the conda environment in the container. Those scripts are also available outside the container for inspection after loading the module in the $RUNSCRIPTS subdirectory and you can use those scripts as a source of inspiration to develop a script that more directly executes your commands or does additional initialisations.

Example (in an interactive session):

salloc -N1 -pstandard-g -t 30:00
module load LUMI PyTorch/2.1.0-rocm-5.6.1-python-3.10-singularity-20231123
srun -N1 -n1 --gpus 8 singularity exec $SIF /runscripts/conda-python-simple \
    -c 'import torch; print("I have this many devices:", torch.cuda.device_count())'

This command will start Python and run PyTorch on a single CPU core with access to all 8 GPUs.

After loading the module, the docker definition file used when building the container is available in the $EBROOTPYTORCH/share/docker-defs subdirectory. As it requires some licensed components from LUMI and some other files that are not included, it currently cannot be used to reconstruct the container and extend its definition.

Installation

To install the container with EasyBuild, follow the instructions in the EasyBuild section of the LUMI documentation, section "Software", and use the dummy partition container, e.g.:

module load LUMI partition/container EasyBuild-user
eb PyTorch-2.1.0-rocm-5.6.1-python-3.10-singularity-20231123.eb

To use the container after installation, the EasyBuild-user module is not needed nor is the container partition. The module will be available in all versions of the LUMI stack and in the CrayEnv stack (provided the environment variable EBU_USER_PREFIX points to the right location).

User-installable modules (and EasyConfigs)

Install with the EasyBuild-user module:

eb <easyconfig> -r
To access module help after installation and get reminded for which stacks and partitions the module is installed, use module spider PyTorch/<version>.

EasyConfig:

Singularity containers with modules for binding and extras

Install with the EasyBuild-user module in partition/container:

module load LUMI partition/container EasyBuild-user
eb <easyconfig>
The module will be available in all versions of the LUMI stack and in the CrayEnv stack.

To access module help after installation use module spider PyTorch/<version>.

EasyConfig:

Technical documentation (user EasyBuild installation)

EasyBuild

Version 1.12.1

  • The EasyConfig is a LUST development and based on wheels rather than compiling ourselves due to the difficulties of compiling PyTorch correctly. We do however use a version of the RCCL library installed through EasyBuild, with the aws-ofi-rccl plugin which is needed to get good performance on LUMI.

  • A different version of NumPy was needed as in the Cray Python module that is used. It is also installed from a wheel hence is not using the Cray Scientific Libraries for BLAS support.

Technical documentation (singularity container)

How to check what's in the container?

  • The Python, PyTorch and ROCm versions are included in the version of the module.

  • To find the version of Python packages,

    singularity exec $SIF bash -c '$WITH_CONDA ; pip list'
    

    after loading the module. This can even be done on the login nodes. It will return information about all Python packages.

  • Deepspeed:

    • Leaves a script 'deepspeed' in /opt/miniconda3/envs/pytorch/bin

    • Leaves packages in /opt/miniconda3/envs/pytorch/lib/python3.10/site-packages/deepspeed

    • Finding the version:

      singularity exec $SIF bash -c '$WITH_CONDA ; pip list | grep deepspeed'
      

      or the clumsy way without pip:

      singularity exec $SIF bash -c \
        'grep "version=" /opt/miniconda3/envs/pytorch/lib/python3.10/site-packages/deepspeed/git_version_info_installed.py'
      

      (Test can be done after loading the module on a login node.)

  • flash-attention and its fork, the ROCm port

    • Leaves a flash_attn and corresponding flash_attn-<version>.dit-info subdirectory in /opt/miniconda3/envs/pytorch/lib/python3.10/site-packages.

    • To find the version:

      singularity exec $SIF bash -c '$WITH_CONDA ; pip list | grep flash-attn'
      

      or the clumsy way without `pip:

      singularity exec $SIF bash -c \
        'grep "__version__" /opt/miniconda3/envs/pytorch/lib/python3.10/site-packages/flash_attn/__init__.py'
      

      (Test can be done after loading the module on a login node.)

    To run a benchmark:

    srun -N 1 -n 1 \
      --cpu-bind=mask_cpu:0xfe000000000000,0xfe00000000000000,0xfe0000,0xfe000000,0xfe,0xfe00,0xfe00000000,0xfe0000000000 \
      --gpus 8 \
      singularity exec $SIF /runscripts/conda-python-simple \
      -u /opt/wheels/flash_attn-benchmarks/benchmark_flash_attention.py
    
  • xformers:

    • Leaves a xformers and corresponding xformers-<version>.disti-info subdirectory
      in /opt/miniconda3/envs/pytorch/lib/python3.10/site-packages.

    • To find the version:

      singularity exec $SIF bash -c '$WITH_CONDA ; pip list | grep xformers'
      

      or the clumsy way without pip:

      singularity exec $SIF bash -c \
        'grep "__version__" /opt/miniconda3/envs/pytorch/lib/python3.10/site-packages/xformers/version.py'
      

      (Test can be done after loading the module on a login node.)

    • Checking the features of xformers:

      singularity exec $SIF bash -c '$WITH_CONDA ; python -m xformers.info'