PyTorch/2.6.0-rocm-6.2.4-python-3.12-singularity-20250326 (PyTorch-2.6.0-rocm-6.2.4-python-3.12-singularity-20250326.eb)
Install with the EasyBuild-user module in partition/container
:
module load LUMI partition/container EasyBuild-user
eb PyTorch-2.6.0-rocm-6.2.4-python-3.12-singularity-20250326.eb
To access module help after installation use module spider PyTorch/2.6.0-rocm-6.2.4-python-3.12-singularity-20250326
.
EasyConfig:
# Developed by Kurt Lust and Mihkel Tiks for LUMI
#DOC Contains PyTorch 2.6.0 with torchaudio 2.6.0, torchdata 0.9.0+cpu, torchtext 0.18.0+cpu,
#DOC torchvision 0.21.0 GPU version, DeepSpeed 0.15.1, flash-attention 2.7.3, transformers 4.50.1,
#DOC xformers 0.0.30+a0a401e4.d20250322 and vllm 0.7.2.post2+rocm624, on Python 3.12 and ROCm 6.2.3.
#DOC mpi4py 3.1.6 interfacing to Cray MPICH is also included.
#DOC The container also fully assists the procedure to add extra packages in a Python virtual environment.
#DOC
#DOC This version works with $WITH_CONDA, $WITH_VENV and $WITH_CONDA_VENV for initialisation of the
#DOC conda / Python venv / or both environments respectively.
easyblock = 'MakeCp'
local_c_rocm_version = '6.2.4'
local_c_python_mm = '3.12'
local_c_PyTorch_version = '2.6.0'
local_c_dockerhash = '33fb441e91f5'
local_c_date = '20250326'
local_c_DeepSpeed_version = '0.15.1'
local_c_flashattention_version = '2.7.3'
local_c_transformers_version = '4.50.1'
local_c_xformers_version = '0.0.30+a0a401e4.d20250322'
local_c_vllm_version = '0.7.2+rocm624'
local_c_mpi4py_version = '3.1.6'
local_conda_env = 'pytorch'
local_c_python_m = local_c_python_mm.split('.')[0]
name = 'PyTorch'
version = local_c_PyTorch_version
versionsuffix = f'-rocm-{local_c_rocm_version}-python-{local_c_python_mm}-singularity-{local_c_date}'
local_sif = f'lumi-pytorch-rocm-{local_c_rocm_version}-python-{local_c_python_mm}-pytorch-v{local_c_PyTorch_version}-dockerhash-{local_c_dockerhash}.sif'
#local_docker = f'lumi-pytorch-rocm-{local_c_rocm_version}-python-{local_c_python_mm}-pytorch-v2.2.0.docker'
homepage = 'https://pytorch.org/'
whatis = [
'Description: PyTorch, a machine learning package',
'Keywords: PyTorch, DeepSpeed, flash-attention, xformers, vllm'
]
description = f"""
This module provides a container with PyTorch %(version)s (with torchaudio,
torchdata, torchtext and torchvision) on Python {local_c_python_mm}. It also contains
DeepSpeed {local_c_DeepSpeed_version}, flash-attention {local_c_flashattention_version}, transformers {local_c_transformers_version},
xformers {local_c_xformers_version} and vllm {local_c_vllm_version}. mpi4py {local_c_mpi4py_version}
interfacing with Cray MPICH is also included.
The module defines a number of environment variables available outside the
container:
* SIF and SIFPYTORCH: The full path and name of the Singularity SIF file
to use with singularity exec etc.
* SINGULARITY_BIND: Mounts the necessary directories from the system,
including /users, /project, /scratch and /flash so that you should be
able to use your regular directories in the container.
* RUNSCRIPTS and RUNSCRIPTSPYTORCH: The directory with some sample
runscripts.
* CONTAINERROOT: Root directory of the container installation. Alternative
for EBROOTPYTHORCH.
There are also a number of environment variables available inside the container.
These are not strictly needed though as the module already ensures that all
necessary environment variables are set to activate the Conda environment in
the container and on top of that the virtual environment for additional packages.
* WITH_CONDA: Command to execute to activate the Conda environment used for
the Python installation.
* WITH_VENV: Command to execute to activate the pre-created Python virtual
environment.
* INIT_CONDA_VENV: Command that can be used to initialise the Conda environment
and then on top of it the Python virtual environment.
Outside of the container, the following commands are available:
* start-shell: To start a bash shell in the container. Arguments can be used
to, e.g., tell it to start a command. Use the -c flag of bash if you want to
pass commands to that shell as otherwise the conda and virtual environments
are not properly initialised.
* make-squashfs: Make the user-software.squashfs file that would then be mounted
in the container after reloading the module. This will enhance performance if
the extra installation in user-software contains a lot of files.
* unmake-squashfs: Unpack the user-software.squashfs file into the user-software
subdirectory of $CONTAINERROOT to enable installing additional packages.
* python, python{local_c_python_m} and python{local_c_python_mm} are wrapper scripts to start Python in the
container, passing along all arugments.
They should work in the same way as those in the pytorch modules in the
local CSC software stack.
* pip, pip{local_c_python_m} and pip{local_c_python_mm} are wrapper scripts to start pip in the container.
passing along all arguments.
They should work in the same way as those in the pytorch modules in the
local CSC software stack.
* Other such wrappers are accelerate, huggingface-cli, ray and torchrun.
They should work in the same way as those in the pytorch modules in the
local CSC software stack.
Inside the container, the following scripts are available in /runscripts
(and can be checked or edited outside the container in $CONTAINERROOT/runscripts):
* conda-python-simple: Start Python in the conda + Python venv environment.
* conda-python-distributed: Example script that can be used to start Python
in a distributed way compatible with the needs of PyTorch. You should pass
the Python commands to be executed with the options that the python executable
would take.
* get-master: A script used by conda-python-distributed.
Note that these scripts are meant as examples and in no way do they cover all possible
use cases.
Note also that any change that you make to files in $CONTAINERROOT will be fully erased
whenever you reinstall the container with EasyBuild so backup all changes or
additions!
"""
docurls = [
'DeepSpeed web site: https://www.deepspeed.ai/',
'Latest LUMI AI training: https://lumi-supercomputer.github.io/AI-latest',
]
toolchain = SYSTEM
sources = [
{
'filename': local_sif,
'extract_cmd': '/bin/cp -L %s .'
},
# {
# 'filename': local_docker,
# 'extract_cmd': '/bin/cp -L %s .'
# },
]
skipsteps = ['build']
files_to_copy = [
([local_sif], '.'),
# ([local_docker], 'share/docker-defs/')
]
#
# Code for scripts in the bin subdirectory
#
local_bin_start_shell = """
#!/bin/bash -e
# Run application
if [ -f "/.singularity.d/Singularity" ]
then
# In a singularity container, just in case a user would add this to the path.
exec bash "$@"
else
# Not yet in the container
if [ -z $SIFPYTORCH ] || [ ! -f $SIFPYTORCH ]
then
>2& echo "SIFPYTORCH is undefined or wrong, use this command with the PyTorch module properly loaded!"
exit
fi
singularity exec $SIFPYTORCH bash "$@"
fi
""".replace( '$', '\\$' )
local_bin_python = """
#!/bin/bash
#
# Python wrapper script, also used for some other commands.
#
# This will start python, or whatever the name of the link to this script is,
# in the PyTorch container.
#
if [ -z $SIFPYTORCH ] || [ ! -f $SIFPYTORCH ]
then
>&2 echo "SIFPYTORCH is undefined or wrong, use this command with the PyTorch module properly loaded!"
exit
fi
REAL_PYTHON="${BASH_SOURCE[0]}"
EXEC_BIN=$(basename "$0")
if [ -d /.singularity.d/ ]; then
# In a singularity container, just in case a user would add this to the path.
exec -a $REAL_PYTHON $EXEC_BIN "$@"
else
# The second variant comes from CSC, might be better if we try to mimic
# the behaviour of their wrapper, but then we need to set PYTHONPATH
# as Python doesn't notice it is starting from the virtual environment.
singularity exec $SIFPYTORCH $EXEC_BIN "$@"
#singularity exec $SIFPYTORCH bash -c "exec -a $REAL_PYTHON $EXEC_BIN $( test $# -eq 0 || printf " %q" "$@" )"
fi
""".replace( '$', '\\$' )
local_bin_make_squashfs = """
#!/bin/bash -e
if [[ -f "/.singularity.d/Singularity" ]]
then
# In a singularity container, just in case a user would add this to the path.
>&2 echo 'The make-squashfs command should not be run in the container.'
exit 1
fi
cd "%(installdir)s"
if [[ ! -d "user-software" ]]
then
>&2 echo -e 'The $CONTAINERROOT/user-software subdirectory does not exist, so there is nothing to put into the SquashFS file.'
exit 2
fi
if [[ -f "user-software.squashfs" ]]
then
>&2 echo -e '$CONTAINERROOT/user-software.squashfs already exists. Please remove the file by' \\\\
'\\nhand if you are sure you wish to proceed and re-run the make-squashfs command.'
exit 3
fi
mksquashfs user-software user-software.squashfs -processors 1 -no-progress |& grep -v Unrecognised
echo -e '\\nCreated $CONTAINERROOT/user-software.squashfs from $CONTAINERROOT/user-software.' \\\\
'\\nYou need to reload the PyTorch module to ensure that the software is now mounted' \\\\
'\\nfrom $CONTAINERROOT/user-software.squashfs. Note that /user-software in the' \\\\
'\\ncontainer will then be a read-only directory.' \\\\
'\\nAfter reloading the module, you can also remove the $CONTAINERROOT/user-software' \\\\
'\\nsubdirectory if you so wish.\\n'
""".replace( '$', '\\$' )
local_bin_unmake_squashfs = """
#!/bin/bash -e
if [[ -f "/.singularity.d/Singularity" ]]
then
# In a singularity container, just in case a user would add this to the path.
>&2 echo 'The unmake-squashfs command should not be run in the container.'
exit 1
fi
cd "%(installdir)s"
if [[ ! -f "user-software.squashfs" ]]
then
>&2 echo -e '$CONTAINERROOT/user-software.squashfs does not exist so cannot uncompress it.'
exit 2
fi
if [[ -d "user-software" ]]
then
>&2 echo -e 'The $CONTAINERROOT/user-software subdirectory already exists. Please remove this directory by hand' \\\\
'(rm -r $CONTAINERROOT/user-software) if you are sure you wish to proceed and re-run the unmake-squashfs command.'
exit 3
fi
unsquashfs -d ./user-software user-software.squashfs
echo -e '\\nCreated $CONTAINERROOT/user-software subdirectory from $CONTAINERROOT/user-software.squasfs.' \\\\
'\\nYou need to reload the PyTorch module to ensure that the software is now mounted from the' \\\\
'\\n$CONTAINERROOT/user-software directory and can now write to /user-software in the container.' \\\\
'\\nYou can then also remove the $CONTAINERROOT/user-software.squashfs file if you so wish.\\n'
""".replace( '$', '\\$' )
#
# Code for scripts in the runscript subdirectory
#
local_runscript_init_conda_venv=f"""
#
# Source this file to initialize both the Conda environment and
# predefined virtual environment in the container.
#
# This script is still useful to initialise the environment when the
# module is not loaded, e.g., to execute commands in the `postinstallcmds` section.
#
source /opt/miniconda3/bin/activate {local_conda_env}
source /user-software/venv/{local_conda_env}/bin/activate
"""
local_runscript_python_simple="""
#!/bin/bash -e
# Start conda environment inside the container
# eval "$WITH_CONDA_VENV"
# Run application
python "$@"
""".replace( '$', '\\$' )
local_runscript_python_distributed="""
#!/bin/bash -e
# Make sure GPUs are up
if [ $SLURM_LOCALID -eq 0 ] ; then
rocm-smi
fi
sleep 2
# MIOPEN needs some initialisation for the cache as the default location
# does not work on LUMI as Lustre does not provide the necessary features.
export MIOPEN_USER_DB_PATH="/tmp/$(whoami)-miopen-cache-$SLURM_NODEID"
export MIOPEN_CUSTOM_CACHE_DIR=$MIOPEN_USER_DB_PATH
# Set MIOpen cache to a temporary folder.
if [ $SLURM_LOCALID -eq 0 ] ; then
rm -rf $MIOPEN_USER_DB_PATH
mkdir -p $MIOPEN_USER_DB_PATH
fi
sleep 2
# Set interfaces to be used by RCCL.
# This is needed as otherwise RCCL tries to use a network interface it has
# no access to on LUMI.
export NCCL_SOCKET_IFNAME=hsn
export NCCL_NET_GDR_LEVEL=3 # Not really needed anymore for ROCm 6.2 as this is now the default
# Set ROCR_VISIBLE_DEVICES so that each task uses the proper GPU
export ROCR_VISIBLE_DEVICES=$SLURM_LOCALID
# Report affinity to check
echo "Rank $SLURM_PROCID --> $(taskset -p \$\$); GPU $ROCR_VISIBLE_DEVICES"
# The usual PyTorch initialisations (also needed on NVIDIA)
# Note that since we fix the port ID it is not possible to run, e.g., two
# instances via this script using half a node each.
export MASTER_ADDR=$(/runscripts/get-master "$SLURM_NODELIST")
export MASTER_PORT=29500
export WORLD_SIZE=$SLURM_NPROCS
export RANK=$SLURM_PROCID
# Run application
python "$@"
""".replace( '$', '\\$' )
local_runscript_get_master="""
#!/usr/bin/env python3
# This way of starting Python should work both on LUMI and in the container, though
# this script is really meant to be used in the container.
import argparse
def get_parser():
parser = argparse.ArgumentParser(description="Extract master node name from Slurm node list",
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument("nodelist", help="Slurm nodelist")
return parser
if __name__ == '__main__':
parser = get_parser()
args = parser.parse_args()
first_nodelist = args.nodelist.split(',')[0]
if '[' in first_nodelist:
a = first_nodelist.split('[')
first_node = a[0] + a[1].split('-')[0]
else:
first_node = first_nodelist
print(first_node)
"""
#
# Now install does scripts and do further preparations of the container.
#
local_singularity_bind = '/var/spool/slurmd,/opt/cray,/usr/lib64/libcxi.so.1,/usr/lib64/libjansson.so.4,' + \
'%(installdir)s/runscripts:/runscripts,' + \
'/pfs,/scratch,/projappl,/project,/flash,/appl'
postinstallcmds = [
# Install the scripts in the bin subdirectory
'mkdir -p %(installdir)s/bin',
f'cat >%(installdir)s/bin/start-shell <<EOF {local_bin_start_shell}EOF',
'chmod a+x %(installdir)s/bin/start-shell',
f'cat >%(installdir)s/bin/python <<EOF {local_bin_python}EOF',
'chmod a+x %(installdir)s/bin/python',
f'ln -s ./python %(installdir)s/bin/python{local_c_python_m}',
f'ln -s ./python %(installdir)s/bin/python{local_c_python_mm}',
'ln -s ./python %(installdir)s/bin/pip',
f'ln -s ./python %(installdir)s/bin/pip{local_c_python_m}',
f'ln -s ./python %(installdir)s/bin/pip{local_c_python_mm}',
'ln -s ./python %(installdir)s/bin/accelerate',
'ln -s ./python %(installdir)s/bin/huggingface-cli',
'ln -s ./python %(installdir)s/bin/ray',
'ln -s ./python %(installdir)s/bin/torchrun',
f'cat >%(installdir)s/bin/make-squashfs <<EOF {local_bin_make_squashfs}EOF',
'chmod a+x %(installdir)s/bin/make-squashfs',
f'cat >%(installdir)s/bin/unmake-squashfs <<EOF {local_bin_unmake_squashfs}EOF',
'chmod a+x %(installdir)s/bin/unmake-squashfs',
# Install the runscripts
'mkdir -p %(installdir)s/runscripts',
f'cat >%(installdir)s/runscripts/init-conda-venv <<EOF {local_runscript_init_conda_venv}EOF',
'chmod a-x %(installdir)s/runscripts/init-conda-venv',
f'cat >%(installdir)s/runscripts/conda-python-simple <<EOF {local_runscript_python_simple}EOF',
'chmod a+x %(installdir)s/runscripts/conda-python-simple',
f'cat >%(installdir)s/runscripts/conda-python-distributed <<EOF {local_runscript_python_distributed}EOF',
'chmod a+x %(installdir)s/runscripts/conda-python-distributed',
f'cat >%(installdir)s/runscripts/get-master <<EOF {local_runscript_get_master}EOF',
'chmod a+x %(installdir)s/runscripts/get-master',
# Create the virtual environment and space for other software installations that
# can then be packaged.
'mkdir -p %(installdir)s/user-software/venv',
# For the next command, we don't need all the bind mounts yet, just the user-software one is enough.
f'singularity exec --bind %(installdir)s/user-software:/user-software %(installdir)s/{local_sif} bash -c \'$WITH_CONDA ; cd /user-software/venv ; python -m venv --system-site-packages {local_conda_env}\'',
]
sanity_check_paths = {
# We deliberately don't check for local_sif as the user is allowed to remove that file
# but may still want to regenerate the module which would then fail in the sanity check.
#'files': [f'share/docker-defs/{local_docker}'],
'files': [],
'dirs': ['runscripts'],
}
sanity_check_commands = [
# Full syntax check of bash scripts
'echo "Syntax check of start-shell" ; bash -n start-shell',
'echo "Syntax check of python" ; bash -n python',
'echo "Syntax check of make-squashfs" ; bash -n make-squashfs',
'echo "Syntax check of unmake-squashfs" ; bash -n unmake-squashfs',
'echo "Syntax check of conda-python-simple" ; bash -n %(installdir)s/runscripts/conda-python-simple',
'echo "Syntax check of conda-python-distributed" ; bash -n %(installdir)s/runscripts/conda-python-distributed',
# Check python wrapper script and reported version
('echo "Testing Python wrapper script and version" ; '
f'python --version | sed -e \'s|.* \([[:digit:]]\.[[:digit:]]\+\).*|\\1|\' | grep -q "{local_c_python_mm}"'),
# Check pythonMAJOR wrapper script and reported version
(f'echo "Testing python{local_c_python_m} wrapper script and version" ; '
f'python{local_c_python_m} --version | sed -e \'s|.* \([[:digit:]]\.[[:digit:]]\+\).*|\\1|\' | grep -q "{local_c_python_mm}"'),
# Check pythonMAJOR.MINOR wrapper script and reported version
(f'echo "Testing python{local_c_python_mm} wrapper script and version" ; '
f'python{local_c_python_mm} --version | sed -e \'s|.* \([[:digit:]]\.[[:digit:]]\+\).*|\\1|\' | grep -q "{local_c_python_mm}"'),
# Check pip and deepspeed version
(f'echo "Testing pip wrapper script and DeepSpeed version (expected {local_c_DeepSpeed_version})" ; '
f'pip freeze | grep deepspeed | sed -e \'s|.*=\(.*\)|\\1|\' | grep -q "{local_c_DeepSpeed_version}"'),
# Check pipMAJOR and transformers version
(f'echo "Testing pip{local_c_python_m} wrapper script and transformers version (expected {local_c_transformers_version})" ; '
f'pip{local_c_python_m} freeze | grep transformers | sed -e \'s|.*=\(.*\)|\\1|\' | grep -q "{local_c_transformers_version}"'),
# Check pipMAJOR.MINOR and deepspeed version
(f'echo "Testing pip{local_c_python_mm} wrapper script and DeepSpeed version (expected {local_c_DeepSpeed_version})" ; '
f'pip{local_c_python_mm} freeze | grep deepspeed | sed -e \'s|.*=\(.*\)|\\1|\' | grep -q "{local_c_DeepSpeed_version}"'),
# Check pip and xformers version
(f'echo "Testing pip wrapper script and xformers version (expected {local_c_xformers_version})" ; '
f'pip list | grep xformers | awk \'{{ print $2}}\' | grep -q "{local_c_xformers_version}"'),
# Check pip and flashattention version
(f'echo "Testing pip wrapper script and xformers version (expected {local_c_flashattention_version})" ; '
f'pip list | grep flash_attn | awk \'{{ print $2}}\' | grep -q "{local_c_flashattention_version}"'),
# Check pip and vllm version
(f'echo "Testing pip wrapper script and xformers version (expected {local_c_vllm_version})" ; '
f'pip list | grep vllm | awk \'{{ print $2}}\' | grep -q "{local_c_vllm_version}"'),
# Check pip and torch version
(f'echo "Testing pip wrapper script and torch version (expected {local_c_PyTorch_version})" ; '
f'pip list | egrep "^torch " | awk \'{{ print $2}}\' | sed -e \'s|\\+ro.*||\' | grep -q "{local_c_PyTorch_version}"'),
# Check if the accelerate wrapper script can run accelerate
'echo "Checking if the accelerate wrapper can run accelerate" ; accelerate -h',
# Check if the huggingface-cli wrapper script can run huggingface-cli
'echo "Checking if the huggingface-cli wrapper can run huggingface-cli" ; huggingface-cli version',
# Check if the ray wrapper script can run ray
'echo "Checking if the ray wrapper can run ray" ; ray --version',
# Check if the torchrun wrapper script can run torchrun
'echo "Checking if the torchrun wrapper can run torchrun" ; torchrun -h',
]
modextravars = {
# SIF variables currently set by a function via modluafooter.
#'SIF': '%(installdir)s/' + local_sif,
#'SIFPYTORCH': '%(installdir)s/' + local_sif,
'CONTAINERROOT': '%(installdir)s',
'RUNSCRIPTS': '%(installdir)s/runscripts',
'RUNSCRIPTSPYTORCH': '%(installdir)s/runscripts',
#'SINGULARITY_BIND': local_singularity_bind,
#
# The following two lines inject the environment variables WITH_VENV and WITH_CONDA_VENV into
# the container that have a similar function as WITH_CONDA: the first one is the command to
# activate the Python virtrual environment defined by default and the second one activates
# both the conda and Python virtual environments, with the latter environment built on top of the
# former.
#
'SINGULARITYENV_WITH_VENV': f'source /user-software/venv/{local_conda_env}/bin/activate',
'SINGULARITYENV_WITH_CONDA_VENV': 'source /runscripts/init-conda-venv',
#
# The following lines inject environment variables into the container that
# basically have the same effect as activating the conda environment and Python
# virtual environment. When these are defined in the module, WITH_CONDA, WITH_VENV or
# WITGH_CONDA_VENV are not really needed.
#
'SINGULARITYENV_PREPEND_PATH': '/runscripts:/user-software/venv/pytorch/bin:/opt/miniconda3/envs/pytorch/bin:/opt/miniconda3/condabin',
'SINGULARITYENV_CONDA_DEFAULT_ENV': 'pytorch',
'SINGULARITYENV_CONDA_EXE': '/opt/miniconda3/bin/conda',
'SINGULARITYENV_CONDA_PREFIX': '/opt/miniconda3/envs/pytorch',
#'SINGULARITYENV_CONDA_PYTHON_EXE': '/opt/miniconda3/bin/python', # This Python should not be used as-is. Instead the wrapper from the Python venv should be used.
'SINGULARITYENV_VIRTUAL_ENV': '/user-software/venv/pytorch',
# Typical NCCL environment variables
'NCCL_SOCKET_IFNAME': 'hsn',
'NCCL_NET_GDR_LEVEL': '3', # Not really needed anymore for ROCm 6.2 as this is now the default
}
modluafooter = f"""
conflict( 'singularity-AI-bindings' )
-- Call a routine to set the various environment variables.
create_container_vars( '{local_sif}', 'PyTorch', '%(installdir)s', '{local_singularity_bind}' )
"""
moduleclass = 'devel'