rocm
License information
ROCmâ„¢ is made available by Advanced Micro Devices, Inc. under the open source license identified in the top-level directory for the library in the repository on Github.com (Portions of ROCm are licensed under MITx11 and UIL/NCSA. For more information on the license, review the license.txt in the top-level directory for the library on Github.com).
User documentation (central installation)
There is a big disclaimer with these modules.
THIS IS ROCM INSTALLED IN A WAY IT IS NOT MEANT TO BE INSTALLED.
The ROCm installations outside of the Cray PE modules (so the 5.2.5, 5.3.3, 5.4.6, 5.6.1 and 6.2.2 modules) come without any warranty nor support as they are not installed in the proper directories suggested by AMD thus may break links encoded in the RPMs from which these packages were installed and as they are also not guaranteed to be compatible with modules from the Cray PE as only HPE Cray can give that warranty and as their inner working and precise requirements is not public.
-
The only modules officially supported by the current AMD GPU driver at the time of writing (October 2024) are the
5.6.1
and6.2.2
modules. Using the5.6.1
module is recommended only if a performance regression is observed with the6.0.3
or6.2.2
modules. The use of the other modules (5.2.5
,5.3.3
and5.4.6
) is strongly discouraged and no longer supported by the LUMI User Support Team. -
The ROCm modules have some PDF documentation in
$EBROOTROCM/share/doc/rocgdb
,$EBROOTROCM/share/doc/tracer
,$EBROOTROCM/share/doc/rocm_smi
and$EBROOTROCM/share/doc/amd-dbgapi
. TheEBROOTROCM
environment variable is defined after loading the module. -
The
6.2.2
modules can be used withPrgEnv-amd
but comes without matchingamd/6.2.2
module. It is sufficient to load therocm/6.2.2
module after thePrgEnv-amd
module (orcpeAMD
module) to enable this ROCm version also for the compiler wrappers in that programming environment. -
The
6.2.2
modules is not compatible with the CCE 17.0.0 and 17.0.1 compilers due to an incompatibility between LLVM 17 on which the CCE is based and LLVM 18 from ROCm 6.2. The only supported programming environments are PrgEnv-gnu (or cpeGNU) and PrgEnv-amd (or cpeAMD). -
Since ROCm 6.2, hipSolver depends on SuiteSparse. If an application depends on hipSolver, it is the user responsibility to load the SuiteSparse module which corresponds to the CPE they wish to use (cpeAMD or cpeGNU). Note that the SuiteSparse module needs to be loaded before the
rocm/6.2.2
module orrocm/6.0.3
will be used. -
In the
CrayEnv
environment, omniperf dependencies have been installed for allcray-python
versions available at the time of the module installation (October 2024, Python 3.9, 3.10 and 3.11) but thecray-python
module is not loaded as a dependency to let the choice of the Python version to the user. Therefore, if you want to use omniperf, you need to load acray-python
module yourself. In theLUMI
environment, the only supported version of Python is the one coming from the corresponding release of the CPE. For example, forLUMI/24.03
omniperf dependencies have been installed for version 3.11. Omniperf is not compatible with the system Python (version 3.6).
Note that using ROCm in containers is still subject to the same driver compatibility problems. Though containers will solve the problem of ROCm being installed in a non-standard path (which was needed for the modules as the standard path is already occupied by a different ROCm version), it will not solve any problem caused by running a newer version of ROCm on a too old driver (and there may be problems running an old version of ROCm on a too new driver also).
User documentation (singularity container)
BETA VERSION, problems are possible and they may not be solved quickly.
The rocm container is developed by AMD specifically for LUMI and contains the necessary parts explore ROCm. The use is rather limited because at the moment the methods that can be used to build upon an existing container are rather limited on LUMI due to security concerns with certain functionality needed for that. The can however be used as a base image for cotainr and it is also possible in some cases to extend them using the so-called SingularityCE "unprivileged proot build" process.
It is entirely normal that some features in some of the containers will not work. Each ROCm driver supports only particular versions of packages. E.g., the ROCm driver from ROCm 5.2.3 is only guaranteed to support ROCm versions up to and including 5.4 and hence problems can be expected with ROCm 5.5 and newer. There is nothing LUMI support can do about it. Only one driver version can be active on the system, and installing a newer version depends on other software on the system also and is not as trivial as it would be on a PC.
Use via EasyBuild-generated modules
The EasyBuild installation with the EasyConfigs mentioned below will do three things:
-
It will copy the container to your own file space. We realise containers can be big, but it ensures that you have complete control over when a container is removed.
We will remove a container from the system when it is not sufficiently functional anymore, but the container may still work for you.
If you prefer to use the centrally provided container, you can remove your copy after loading of the module with
rm $SIF
followed by reloading the module. This is however at your own risk. -
It will create a module file. When loading the module, a number of environment variables will be set to help you use the module and to make it easy to swap the module with a different version in your job scripts.
-
SIF
andSIFROCM
both contain the name and full path of the singularity container file. -
SINGULARITY_BINDPATH
will mount all necessary directories from the system, including everything that is needed to access the project, scratch and flash file systems.
-
-
It will create the
runscripts
subdirectory in the installation directory that can be used to store scripts that should be available in the container, and thebin
subdirectory for scripts that run outside the container.Currently there is one script outside the container:
start-shell
will start a bash session in the container, and can take arguments just as bash. It is provided for consistency with planned future extensions of some other containers, but really doesn't do much more than callingand passing it the arguments that were given to the command.
Note that the installation directory is fully erased when you re-install the container module using EasyBuild. So if you chose to use it to add scripts, make sure you store them elsewhere also so that they can be copied again if you rebuild the container module for some reason.
Installation via EasyBuild
To install the container with EasyBuild, follow the instructions in the
EasyBuild section of the LUMI documentation, section "Software",
and use the dummy partition container
, e.g.:
To use the container after installation, the EasyBuild-user
module is not needed nor
is the container
partition. The module will be available in all versions of the LUMI stack
and in the CrayEnv
stack
(provided the environment variable EBU_USER_PREFIX
points to the right location).
Direct access
The ROCm containers are available in the following subdirectories of /appl/local/containers
:
-
/appl/local/containers/sif-images
: Symbolic link to the latest version of the container for each ROCm version provided. Those links can change without notice! -
/appl/local/containers/tested-containers
: Tested containers provided as a Singulartiy.sif
file and a docker-generated tarball. Containers in this directory are removed quickly when a new version becomes available. -
/appl/local/containers/easybuild-sif-images
: Singularity.sif
images used with the EasyConfigs that we provide. They tend to be available for a longer time than in the other two subdirectories.
If you depend on a particular version of a container, we recommend that you copy the container to
your own file space (e.g., in /project
) as there is no guarantee the specific version will remain
available centrally on the system for as long as you want.
When using the containers without the modules, you will have to take care of the bindings as some system files are needed for, e.g., RCCL. The recommended mininmal bindings are:
and the bindings you need to access the files you want to use from /scratch
, /flash
and/or /project
.
Note that the list recommended bindings may change after a system update.
Using the images as base image for cotainr
We recommend using these images as the base image for cotainr if you want to
build a container with cotainr
that needs ROCm. You can use the --base-image=<my base image>
flag of the cotainr
command
to indicate the base image that should be used.
If you do so, please make sure that the GPU software you install from conda-forge or via pip
with cotainr is compatible with the version of ROCm in the container that you use as the base
image.
PyTorch with cotainr (click to expand)
To start, create a Yaml file to tell cotainr which software should be installed.
As an example, consider the file below which we name py311_rocm542_pytorch.yml
name: py311_rocm542_pytorch
channels:
- conda-forge
dependencies:
- certifi=2023.07.22
- charset-normalizer=3.2.0
- filelock=3.12.4
- idna=3.4
- jinja2=3.1.2
- lit=16.0.6
- markupsafe=2.1.3
- mpmath=1.3.0
- networkx=3.1
- numpy=1.25.2
- pillow=10.0.0
- pip=23.2.1
- python=3.11.5
- requests=2.31.0
- sympy=1.12
- typing-extensions=4.7.1
- urllib3=2.0.4
- pip:
- --extra-index-url https://download.pytorch.org/whl/rocm5.4.2/
- pytorch-triton-rocm==2.0.2
- torch==2.0.1+rocm5.4.2
- torchaudio==2.0.2+rocm5.4.2
- torchvision==0.15.2+rocm5.4.2
Now we are ready to generate a new Singularity .sif
file with this defintion:
module load LUMI
module load cotainr
cotainr build my-new-image.sif --base-image=/appl/local/containers/sif-images/lumi-rocm-rocm-5.4.6.sif --conda-env=py311_rocm542_pytorch.yml
We don't have a container that matches the ROCm 5.4.2 version of ROCm for which the Python packages above are generated (they are only available for a small selection of ROCm version) but the 5.4.6 container should be close enough to work without problems.
You're now ready to use the new image with the direct access method. As in this example we installed
PyTorch, the information on the PyTorch page page in this guide is also very
relevant. And if you understand very well what you're doing, you may even adapt one of the EasyBuild
recipes for the PyTorch containers to use your new image and install the wrapper scripts etc. that
those modules provide (pointing EasyBuild to your image with the --sourcepath
flag of the eb
command).
Pre-installed modules (and EasyConfigs)
To access module help and find out for which stacks and partitions the module is
installed, use module spider rocm/<version>
.
EasyConfig:
Singularity containers with modules for binding and extras
Install with the EasyBuild-user module in partition/container
:
To access module help after installation use module spider rocm/<version>
.
EasyConfig:
-
EasyConfig rocm-5.4.5-singularity-20231110.eb, will provide rocm/5.4.5-singularity-20231110 (with docker definition)
-
EasyConfig rocm-5.4.5-singularity-20240124.eb, will provide rocm/5.4.5-singularity-20240124
-
EasyConfig rocm-5.4.5-singularity-20240207.eb, will provide rocm/5.4.5-singularity-20240207
-
EasyConfig rocm-5.4.6-singularity-20231110.eb, will provide rocm/5.4.6-singularity-20231110 (with docker definition)
-
EasyConfig rocm-5.4.6-singularity-20240124.eb, will provide rocm/5.4.6-singularity-20240124
-
EasyConfig rocm-5.4.6-singularity-20240207.eb, will provide rocm/5.4.6-singularity-20240207
-
EasyConfig rocm-5.5.1-singularity-20231110.eb, will provide rocm/5.5.1-singularity-20231110 (with docker definition)
-
EasyConfig rocm-5.5.1-singularity-20240124.eb, will provide rocm/5.5.1-singularity-20240124
-
EasyConfig rocm-5.5.1-singularity-20240207.eb, will provide rocm/5.5.1-singularity-20240207
-
EasyConfig rocm-5.5.3-singularity-20231108.eb, will provide rocm/5.5.3-singularity-20231108 (with docker definition)
-
EasyConfig rocm-5.5.3-singularity-20240124.eb, will provide rocm/5.5.3-singularity-20240124
-
EasyConfig rocm-5.5.3-singularity-20240207.eb, will provide rocm/5.5.3-singularity-20240207
-
EasyConfig rocm-5.6.0-singularity-20240315.eb, will provide rocm/5.6.0-singularity-20240315
-
EasyConfig rocm-5.6.1-singularity-20231108.eb, will provide rocm/5.6.1-singularity-20231108 (with docker definition)
-
EasyConfig rocm-5.6.1-singularity-20240124.eb, will provide rocm/5.6.1-singularity-20240124
-
EasyConfig rocm-5.6.1-singularity-20240207.eb, will provide rocm/5.6.1-singularity-20240207
-
EasyConfig rocm-5.7.1-singularity-20240124.eb, will provide rocm/5.7.1-singularity-20240124
-
EasyConfig rocm-5.7.1-singularity-20240207.eb, will provide rocm/5.7.1-singularity-20240207
Technical documentation (central installation)
Easybuild
ROCm 4.5.2 (archived)
The EasyConfig unpacks the official RPMs and copies them to the installation directory. This is a temporary setup so that the users that have access to the Early Access Platform can compile their code from the login node.
ROCm 5.2.5 and 5.3.3
-
Unpacked form RPMs like previous version but use an EasyBlock to easy the process of EasyConfigs creation.
ROCm 5.4.6, 5.6.1 and 6.2.2
-
Unpacked from RPMs but with an additional step to set the RPATH of the libraries and avoid using the system rocm libraries if the module is not loaded.
-
The 5.4.6 and 6.2.2 modules were developed at a later time as the 5.6.1 module and were made to work around some problems we observed with 5.6.1 at that time. The 6.2.2 version was chosen as this at that time was the latest version of ROCm officially supported on the driver on the system at that time.
One difference with the 5.6.1 version is that there is no equivalent
amd
module. Instead some additional environment variables are set in therocm/5.4.6
and6.2.2
modules so that if you load it AFTER loading thePrgEnv-amd
module, the compiler wrappers would still use the compilers fromrocm/5.4.6
or6.2.2
. -
The 6.2.2 version is not compatible with CCE 17.x due to a LLVM incompatibility.
-
Documentation:
- [ROCm 5.4.6 documentation](https://rocm.docs.amd.com/en/docs-5.4.3/) - [ROCm 5.6.1 documentation](https://rocm.docs.amd.com/en/docs-5.6.1/) - [ROCm 6.2.2 documantation](https://rocm.docs.amd.com/en/docs-6.2.2/)
Archived EasyConfigs
The EasyConfigs below are additonal easyconfigs that are not directly available on the system for installation. Users are advised to use the newer ones and these archived ones are unsupported. They are still provided as a source of information should you need this, e.g., to understand the configuration that was used for earlier work on the system.
-
Archived EasyConfigs from LUMI-SoftwareStack - previously centrally installed software