Skip to content

[package list]

rocm

License information

ROCmâ„¢ is made available by Advanced Micro Devices, Inc. under the open source license identified in the top-level directory for the library in the repository on Github.com (Portions of ROCm are licensed under MITx11 and UIL/NCSA. For more information on the license, review the license.txt in the top-level directory for the library on Github.com).

User documentation (central installation)

There is a big disclaimer with these modules.

THIS IS ROCM INSTALLED IN A WAY IT IS NOT MEANT TO BE INSTALLED.

The ROCm installations outside of the Cray PE modules (so the 5.2.5, 5.3.3, 5.4.6 and 5.6.1 modules) come without any warranty nor support as they are not installed in the proper directories suggested by AMD thus may break links encoded in the RPMs from which these packages were installed and as they are are also not guaranteed to be compatible with modules from the Cray PE as only HPE Cray can give that warranty and as their inner working and precise requirements is not public.

  • The ROCm 5.2.5 and 5.3.3 modules have some PDF documentation in $EBROOTROCM/share/doc/rocgdb, $EBROOTROCM/share/doc/tracer (5.3.3 only), $EBROOTROCM/share/doc/rocm_smi and $EBROOTROCM/share/doc/amd-dbgapi. The EBROOTROCM environment variable is defined after loading the module.

  • The rocm/5.4.6 module can be used with PrgEnv-amd but comes without matching amd/5.4.6 module. It is sufficient to load the rocm/5.4.6 module after the PrgEnv-amd module (or cpeAMD module) to enable this ROCm version also for the compiler wrappers in that programming environment.

    The 5.4.6 module is created because the driver that is on the system at the time of writing (March 2024) is still the ROCm 5.2.3 driver which only officially supports ROCm versions up to 5.4.x and as we noted too many problems with ROCm 5.6.1 with that driver. Though supported by the driver, this is still not an official ROCm installation done by HPE so even though we have run some test suites, we cannot fully exclude problems in combination with the HPE Cray PE (including its MPI libraries).

    The module is available in CrayEnv and in LUMI/23.09 partition/G and has been tested in combination with the 23.09 release of the programming environment.

  • rocm/5.6.1 module: This module comes with a matching amd/5.6.1 module for use with PrgEnv-amd or cpeAMD.

    The rocm/5.6.1 module is not officially supported by the ROCm 5.2.3 driver on the system at the time of writing (March 2024). Some software runs well, some software doesn't, and there is nothing the LUMI User Support Team can do about this until the system is upgraded.

    Known problems included:

    • Memory reporting is broken so programs that rely on ROCm calls, e.g., to determine how much memory they can use, may not function correctly.

    • We have observed very slow performance of GPU-aware MPI.

    • The "xnack" feature is also broken.

Note that using ROCm in containers is still subject to the same driver compatibility problems. Though containers will solve the problem of ROCm being installed in a non-standard path (which was needed for the modules as the standard path is already occupied by a different ROCm version), it will not solve any problem caused by running a newer version of ROCm on a too old driver (and there may be problems running an old version of ROCm on a too new driver also).

User documentation (singularity container)

BETA VERSION, problems are possible and they may not be solved quickly.

The rocm container is developed by AMD specifically for LUMI and contains the necessary parts explore ROCm. The use is rather limited because at the moment the methods that can be used to build upon an existing container are rather limited on LUMI due to security concerns with certain functionality needed for that. The can however be used as a base image for cotainr and it is also possible in some cases to extend them using the so-called SingularityCE "unprivileged proot build" process.

It is entirely normal that some features in some of the containers will not work. Each ROCm driver supports only particular versions of packages. E.g., the ROCm driver from ROCm 5.2.3 is only guaranteed to support ROCm versions up to and including 5.4 and hence problems can be expected with ROCm 5.5 and newer. There is nothing LUMI support can do about it. Only one driver version can be active on the system, and installing a newer version depends on other software on the system also and is not as trivial as it would be on a PC.

Use via EasyBuild-generated modules

The EasyBuild installation with the EasyConfigs mentioned below will do three things:

  1. It will copy the container to your own file space. We realise containers can be big, but it ensures that you have complete control over when a container is removed.

    We will remove a container from the system when it is not sufficiently functional anymore, but the container may still work for you.

    If you prefer to use the centrally provided container, you can remove your copy after loading of the module with rm $SIF followed by reloading the module. This is however at your own risk.

  2. It will create a module file. When loading the module, a number of environment variables will be set to help you use the module and to make it easy to swap the module with a different version in your job scripts.

    • SIF and SIFROCM both contain the name and full path of the singularity container file.

    • SINGULARITY_BINDPATH will mount all necessary directories from the system, including everything that is needed to access the project, scratch and flash file systems.

  3. It will create the runscripts subdirectory in the installation directory that can be used to store scripts that should be available in the container, and the bin subdirectory for scripts that run outside the container.

    Currently there is one script outside the container: start-shell will start a bash session in the container, and can take arguments just as bash. It is provided for consistency with planned future extensions of some other containers, but really doesn't do much more than calling

    singularity exec $SIFROCM bash
    

    and passing it the arguments that were given to the command.

    Note that the installation directory is fully erased when you re-install the container module using EasyBuild. So if you chose to use it to add scripts, make sure you store them elsewhere also so that they can be copied again if you rebuild the container module for some reason.

Installation via EasyBuild

To install the container with EasyBuild, follow the instructions in the EasyBuild section of the LUMI documentation, section "Software", and use the dummy partition container, e.g.:

module load LUMI partition/container EasyBuild-user
eb rocm-5.6.1-singularity-20231108.eb

To use the container after installation, the EasyBuild-user module is not needed nor is the container partition. The module will be available in all versions of the LUMI stack and in the CrayEnv stack (provided the environment variable EBU_USER_PREFIX points to the right location).

Direct access

The ROCm containers are available in the following subdirectories of /appl/local/containers:

  • /appl/local/containers/sif-images: Symbolic link to the latest version of the container for each ROCm version provided. Those links can change without notice!

  • /appl/local/containers/tested-containers: Tested containers provided as a Singulartiy .sif file and a docker-generated tarball. Containers in this directory are removed quickly when a new version becomes available.

  • /appl/local/containers/easybuild-sif-images: Singularity .sif images used with the EasyConfigs that we provide. They tend to be available for a longer time than in the other two subdirectories.

If you depend on a particular version of a container, we recommend that you copy the container to your own file space (e.g., in /project) as there is no guarantee the specific version will remain available centrally on the system for as long as you want.

When using the containers without the modules, you will have to take care of the bindings as some system files are needed for, e.g., RCCL. The recommended mininmal bindings are:

-B /var/spool/slurmd,/opt/cray/,/usr/lib64/libcxi.so.1,/usr/lib64/libjansson.so.4

and the bindings you need to access the files you want to use from /scratch, /flash and/or /project.

Note that the list recommended bindings may change after a system update.

Using the images as base image for cotainr

We recommend using these images as the base image for cotainr if you want to build a container with cotainr that needs ROCm. You can use the --base-image=<my base image> flag of the cotainr command to indicate the base image that should be used.

If you do so, please make sure that the GPU software you install from conda-forge or via pip with cotainr is compatible with the version of ROCm in the container that you use as the base image.

PyTorch with cotainr (click to expand)

To start, create a Yaml file to tell cotainr which software should be installed. As an example, consider the file below which we name py311_rocm542_pytorch.yml

name: py311_rocm542_pytorch
channels:
- conda-forge
dependencies:
- certifi=2023.07.22
- charset-normalizer=3.2.0
- filelock=3.12.4
- idna=3.4
- jinja2=3.1.2
- lit=16.0.6
- markupsafe=2.1.3
- mpmath=1.3.0
- networkx=3.1
- numpy=1.25.2
- pillow=10.0.0
- pip=23.2.1
- python=3.11.5
- requests=2.31.0
- sympy=1.12
- typing-extensions=4.7.1
- urllib3=2.0.4
- pip:
    - --extra-index-url https://download.pytorch.org/whl/rocm5.4.2/
    - pytorch-triton-rocm==2.0.2
    - torch==2.0.1+rocm5.4.2
    - torchaudio==2.0.2+rocm5.4.2
    - torchvision==0.15.2+rocm5.4.2

Now we are ready to generate a new Singularity .sif file with this defintion:

module load LUMI
module load cotainr
cotainr build my-new-image.sif --base-image=/appl/local/containers/sif-images/lumi-rocm-rocm-5.4.6.sif --conda-env=py311_rocm542_pytorch.yml

We don't have a container that matches the ROCm 5.4.2 version of ROCm for which the Python packages above are generated (they are only available for a small selection of ROCm version) but the 5.4.6 container should be close enough to work without problems.

You're now ready to use the new image with the direct access method. As in this example we installed PyTorch, the information on the PyTorch page page in this guide is also very relevant. And if you understand very well what you're doing, you may even adapt one of the EasyBuild recipes for the PyTorch containers to use your new image and install the wrapper scripts etc. that those modules provide (pointing EasyBuild to your image with the --sourcepath flag of the eb command).

Pre-installed modules (and EasyConfigs)

To access module help and find out for which stacks and partitions the module is installed, use module spider rocm/<version>.

EasyConfig:

Singularity containers with modules for binding and extras

Install with the EasyBuild-user module in partition/container:

module load LUMI partition/container EasyBuild-user
eb <easyconfig>
The module will be available in all versions of the LUMI stack and in the CrayEnv stack.

To access module help after installation use module spider rocm/<version>.

EasyConfig:

Technical documentation (central installation)

Easybuild

ROCm 4.5.2 (archived)

The EasyConfig unpacks the official RPMs and copies them to the installation directory. This is a temporary setup so that the users that have access to the Early Access Platform can compile their code from the login node.

ROCm 5.2.5 and 5.3.3

  • Unpacked form RPMs like previous version but use an EasyBlock to easy the process of EasyConfigs creation.

  • ROCm 5.3.3 documentation

ROCm 5.4.6 and 5.6.1

  • Unpacked from RPMs but with an additional step to set the RPATH of the libraries and avoid using the system rocm libraries if the module is not loaded.

  • The 5.4.6 module was developed at a later time as the 5.6.1 module and was made to work around some problems we observed with 5.6.1 at that time. The 5.4.6 version was chosen as this at that time was the latest version of ROCm officially supported on the driver on the system at that time.

    One difference with the 5.6.1 version is that there is no equivalent amd module. Instead some additional environment modules are set in the rocm/5.4.6 module so that if you load it AFTER loading the PrgEnv-amd module, the compiler wrappers would still use the compilers from rocm/5.4.6.

  • Documentation:

    -   [ROCm 5.4.6 documentation](https://rocm.docs.amd.com/en/docs-5.4.3/)
    -   [ROCm 5.6.1 documentation](https://rocm.docs.amd.com/en/docs-5.6.1/)
    

Archived EasyConfigs

The EasyConfigs below are additonal easyconfigs that are not directly available on the system for installation. Users are advised to use the newer ones and these archived ones are unsupported. They are still provided as a source of information should you need this, e.g., to understand the configuration that was used for earlier work on the system.