Software on LUMI
How to find the available software, how to install, and specific software tutorials.
Tutorial: Finding available & installable software
How to find available software from LUMI software collection, and how to see what is pre-installed and what is user-installable. what other things to pay attention to (referring to previous chapter).
Tutorial: Installing user-installable software with EasyBuild
An easy example, a general hand-hold tutorial how to do this
Tutorial: PyTorch with DDP
Setup
Run through the following three steps to initialize a PyTorch environment on LUMI. This only needs to be done once.
1st Step
Install ROCm Communication Collectives Library (RCCL) if you want to leverage multiple GPUs:
$ module load LUMI/22.08 partition/G
$ module load EasyBuild-user
$ eb aws-ofi-rccl-66b3b31-cpeGNU-22.08.eb -r
2nd Step
Get the ROCm PyTorch Docker container:
$ SINGULARITY_TMPDIR=/tmp/tmp-singularity SINGULARITY_CACHEDIR=/tmp/tmp-singularity singularity pull docker://rocm/pytorch:rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1
3rd Step
Create a virtual environment (venv) for each of your projects and add additional modules you might need:
$ singularity exec pytorch_rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1.sif bash
Singularity> python -m venv my_project_env --system-site-packages
Singularity> . my_project_env/bin/activate
(my_project_env) Singularity> Install what else you'd need: pip install ...
Create more virtual environments
You might want to create more virtual environments, one for each of your projects, or for experimenting with different module versions. Simply repeat the commands with providing individual virtual environment names.
Use it
We use the following job script (run_sbatch.sh
):
#!/bin/bash -l
#SBATCH --job-name="PyTorch MNIST"
#SBATCH --output=output.txt
#SBATCH --error=error.txt
#SBATCH --partition=standard-g
#SBATCH --nodes=2
#SBATCH --ntasks=16
#SBATCH --ntasks-per-node=8
#SBATCH --time=1:00:00
#SBATCH --account=project_<YOURID>
#SBATCH --gpus-per-node=8
module load LUMI/22.08 partition/G
module load singularity-bindings
module load aws-ofi-rccl
export SCRATCH=$PWD
export NCCL_SOCKET_IFNAME=hsn
export NCCL_NET_GDR_LEVEL=3
export MIOPEN_USER_DB_PATH=/tmp/${USER}-miopen-cache-${SLURM_JOB_ID}
export MIOPEN_CUSTOM_CACHE_DIR=${MIOPEN_USER_DB_PATH}
export CXI_FORK_SAFE=1
export CXI_FORK_SAFE_HP=1
export FI_CXI_DISABLE_CQ_HUGETLB=1
export SINGULARITYENV_LD_LIBRARY_PATH=${EBROOTAWSMINOFIMINRCCL}/lib:/opt/cray/xpmem/2.5.2-2.4_3.50__gd0f7936.shasta/lib64:${SINGULARITYENV_LD_LIBRARY_PATH}
master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export SINGULARITYENV_MASTER_ADDR="$master_addr"
export SINGULARITYENV_MASTER_PORT=6200
echo "MASTER_ADDR="$SINGULARITYENV_MASTER_ADDR "MASTER_PORT="$SINGULARITYENV_MASTER_PORT
srun --mpi=cray_shasta singularity exec --pwd /work --bind $SCRATCH:/work pytorch_rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1.sif /work/my_project_env/bin/python PyTorch_MNIST-DDP.py
This job script executes 16 processes over two GPU nodes. That is, one process for each of the eight GPUs in a single node. The first node is used as master node for NCCL/RCCL communication management with port 6200.
Using the project scratch volume
Replace project_<YOURID>
with your project ID. Consider to use the working directory /scratch/project_<YOURID>
for running your experiemnts. Your virtual environment should also be located there.
Place the following PyTorch_MNIST-DDP.py
script into your working directory:
import os
from tqdm import tqdm
import torch
from torch import nn, optim
from torch.nn.parallel.distributed import DistributedDataParallel
from torch.utils.data import DataLoader, DistributedSampler
from torchvision import datasets, transforms
batch_size = 128
epochs = 5
use_gpu = True
dataset_loc = './mnist_data'
#rank = int(os.environ['OMPI_COMM_WORLD_RANK'])
#local_rank = int(os.environ['OMPI_COMM_WORLD_LOCAL_RANK'])
#world_size = int(os.environ['OMPI_COMM_WORLD_SIZE'])
rank = int(os.environ['SLURM_PROCID'])
local_rank = int(os.environ['SLURM_LOCALID'])
world_size = int(os.environ['SLURM_NTASKS'])
#os.environ["MASTER_ADDR"] = "127.0.0.1"
#os.environ["MASTER_PORT"] = "6108"
print(os.environ["MASTER_ADDR"], os.environ["MASTER_PORT"])
if rank == 0:
print(torch.__version__)
print(local_rank, rank, world_size)
torch.cuda.set_device(local_rank)
torch.distributed.init_process_group(backend="nccl", rank=rank, world_size=world_size)
transform = transforms.Compose([
transforms.ToTensor() # convert and scale
])
if rank == 0:
train_dataset = datasets.MNIST(dataset_loc,
download=True,
train=True,
transform=transform)
test_dataset = datasets.MNIST(dataset_loc,
download=True,
train=False,
transform=transform)
torch.distributed.barrier()
if rank != 0:
train_dataset = datasets.MNIST(dataset_loc,
download=False,
train=True,
transform=transform)
test_dataset = datasets.MNIST(dataset_loc,
download=False,
train=False,
transform=transform)
train_sampler = DistributedSampler(train_dataset,
num_replicas=world_size, # number of all GPUs
rank=rank, # (global) ID of GPU
shuffle=True,
seed=42)
test_sampler = DistributedSampler(test_dataset,
num_replicas=world_size, # number of all GPUs
rank=rank, # (global) ID of GPU
shuffle=False,
seed=42)
train_loader = DataLoader(train_dataset,
batch_size=batch_size,
shuffle=False, # sampler does it
num_workers=4,
sampler=train_sampler,
pin_memory=True)
val_loader = DataLoader(test_dataset,
batch_size=batch_size,
shuffle=False,
num_workers=4,
sampler=test_sampler,
pin_memory=True)
def create_model():
model = nn.Sequential(
nn.Linear(28*28, 128), # Input: 28x28(x1) pixels
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(128, 128),
nn.ReLU(),
nn.Linear(128, 10, bias=False) # Output: 10 classes
)
return model
model = create_model()
if use_gpu:
device = torch.device("cuda:{}".format(local_rank))
model = model.to(device)
model = DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank)
optimizer = optim.SGD(model.parameters(), lr=0.01)
loss = nn.CrossEntropyLoss()
# Main training loop
for i in range(epochs):
model.train()
#train_loader.sampler.set_epoch(i)
# Training steps per epoch
epoch_loss = 0
pbar = tqdm(train_loader)
for x, y in pbar:
if use_gpu:
x = x.to(device, non_blocking=True)
y = y.to(device, non_blocking=True)
x = x.view(x.shape[0], -1) # flatten
optimizer.zero_grad()
y_hat = model(x)
batch_loss = loss(y_hat, y)
batch_loss.backward()
optimizer.step()
batch_loss_scalar = batch_loss.item()
epoch_loss += batch_loss_scalar / x.shape[0]
pbar.set_description(f'training batch_loss={batch_loss_scalar:.4f}')
# Run validation at the end of each epoch
with torch.no_grad():
model.eval()
val_loss = 0
pbar = tqdm(val_loader)
for x, y in pbar:
if use_gpu:
x = x.to(device, non_blocking=True)
y = y.to(device, non_blocking=True)
x = x.view(x.shape[0], -1) # flatten
y_hat = model(x)
batch_loss = loss(y_hat, y)
batch_loss_scalar = batch_loss.item()
val_loss += batch_loss_scalar / x.shape[0]
pbar.set_description(f'validation batch_loss={batch_loss_scalar:.4f}')
if rank == 0:
print(f"Epoch={i}, train_loss={epoch_loss:.4f}, val_loss={val_loss:.4f}")
if rank == 0:
torch.save(model.state_dict(), 'model.pt')
Execute the job script in the directory of your project which contains the virtual environment:
$ sbatch run_sbatch.sh
Refer to the output.txt
and error.txt
files for the results.
Tutorial: Tensorflow with Horovod
Setup
Run through the following three steps to initialize a Tensorflow environment on LUMI. This only needs to be done once.
1st Step
Install ROCm Communication Collectives Library (RCCL) and OpenMPI if you want to leverage multiple GPUs:
$ module load LUMI/22.08 partition/G
$ module load EasyBuild-user
$ eb aws-ofi-rccl-66b3b31-cpeGNU-22.08.eb -r
$ eb OpenMPI-4.1.3-cpeGNU-22.08.eb -r
2nd Step
Get the ROCm Tensorflow Docker container:
$ eb singularity-bindings-system-cpeGNU-22.08.eb -r
$ SINGULARITY_TMPDIR=/tmp/tmp-singularity SINGULARITY_CACHEDIR=/tmp/tmp-singularity singularity pull docker://rocm/tensorflow:rocm5.5-tf2.11-dev
3rd Step
Install ensurepip
first by executing the following script (download_ensurepip.sh
)
base_url=https://raw.githubusercontent.com/python/cpython/3.9/Lib
files=("__init__.py" \
"__main__.py" \
"_uninstall.py" \
"_bundled/__init__.py" \
"_bundled/pip-23.0.1-py3-none-any.whl" \
"_bundled/setuptools-58.1.0-py3-none-any.whl")
for _f in ${files[@]}; do
f=ensurepip/${_f}
if test ! -f "$f"; then
wget -q --show-progress ${base_url}/${f} -P $(dirname $f);
else
echo -e "?${f}? already exists. Nothing to do."
fi
done
with:
$ cd $HOME
$ bash download_ensurepip.sh
Adjust for your versions
Depending on the Python environment you chose to use (as part of the Tensorflow Docker container), you might need to adjust the file names. E.g., the version of pip
can be different. Please look at https://github.com/python/cpython
in the Lib/ensurepip
directory for the Python version (branch) used in the Docker conatiner.
Create a virtual environment (venv) for each of your projects and add additional modules you might need:
$ singularity exec -B $HOME/tensorflow/ensurepip:/usr/lib/python3.9/ensurepip tensorflow_rocm5.5-tf2.11-dev.sif bash
Singularity> python -m venv my_project_env --system-site-packages
Singularity> . my_project_env/bin/activate
(my_project_env) Singularity> pip install tensorflow-datasets # Needed for later
(my_project_env) Singularity> Install what else you'd need: pip install ...
Create more virtual environments
You might want to create more virtual environments, one for each of your projects, or for experimenting with different module versions. Simply repeat the commands with providing individual virtual environment names.
4th Step
Build and install Horovod. Use a clean environment for this!
Use a clean environment
Do not load aws-ofi-rccl
or singularity-bindings
modules before building Horovod as they interfere with the build process!
$ module load LUMI/22.08 partition/G
$ module load EasyBuild-user
$ singularity exec tensorflow_rocm5.5-tf2.11-dev.sif bash
Singularity> . ./my_project_env/bin/activate
(my_project_env) Singularity> export HOROVOD_WITHOUT_MXNET=1
(my_project_env) Singularity> export HOROVOD_WITHOUT_PYTORCH=1
(my_project_env) Singularity> export HOROVOD_WITH_TENSORFLOW=1
(my_project_env) Singularity> export HOROVOD_GPU=ROCM
(my_project_env) Singularity> export HOROVOD_GPU_OPERATIONS=NCCL
(my_project_env) Singularity> export HOROVOD_ROCM_PATH=/opt/rocm
(my_project_env) Singularity> export HOROVOD_RCCL_HOME=/opt/rocm/rccl
(my_project_env) Singularity> export RCCL_INCLUDE_DIRS=/opt/rocm/rccl/include
(my_project_env) Singularity> export HOROVOD_RCCL_LIB=/opt/rocm/rccl/lib
(my_project_env) Singularity> export HCC_AMDGPU_TARGET=gfx90a
(my_project_env) Singularity> export HIP_PATH=/opt/rocm
(my_project_env) pip install --no-cache-dir --force-reinstall horovod==0.28.1
Use it
We use the following job script (run_sbatch.sh
):
#!/bin/bash -l
#SBATCH --job-name="PyTorch MNIST"
#SBATCH --output=output.txt
#SBATCH --error=error.txt
#SBATCH --partition=standard-g
#SBATCH --nodes=2
#SBATCH --ntasks=16
#SBATCH --ntasks-per-node=8
#SBATCH --time=1:00:00
#SBATCH --account=project_<YOURID>
#SBATCH --gpus-per-node=8
module load LUMI/22.08 partition/G
module load singularity-bindings
module load aws-ofi-rccl
module load OpenMPI/4.1.3-cpeGNU-22.08
export SCRATCH=$PWD
export NCCL_SOCKET_IFNAME=hsn
export MIOPEN_USER_DB_PATH=/tmp/${USER}-miopen-cache-${SLURM_JOB_ID}
export MIOPEN_CUSTOM_CACHE_DIR=${MIOPEN_USER_DB_PATH}
export CXI_FORK_SAFE=1
export CXI_FORK_SAFE_HP=1
export FI_CXI_DISABLE_CQ_HUGETLB=1
export SINGULARITYENV_LD_LIBRARY_PATH=/openmpi/lib:/opt/rocm/lib:${EBROOTAWSMINOFIMINRCCL}/lib:/opt/cray/xpmem/2.5.2-2.4_3.50__gd0f7936.shasta/lib64:$SINGULARITYENV_LD_LIBRARY_PATH
scontrol show hostname $SLURM_JOB_NODELIST > hostfile
mpirun -np 16 -hostfile hostfile singularity exec --pwd /work --bind $SCRATCH:/work tensorflow_rocm5.5-tf2.11-dev.sif /work/my_project_env/bin/python keras_horovod_example.py
This job script executes 16 processes over two GPU nodes. That is, one process for each of the eight GPUs in a single node. This uses MPI for management/communication across the processes/nodes.
Using the project scratch volume
Replace project_<YOURID>
with your project ID. Consider to use the working directory /scratch/project_<YOURID>
for running your experiemnts. Your virtual environment should also be located there.
Place the following keras_horovod_example.py
script into your working directory:
import os
import time
import datetime
import tensorflow as tf
import tensorflow.compat.v2 as tf
import tensorflow_datasets as tfds
import horovod.tensorflow.keras as hvd
batch_size = 32
epochs = 4
# Initialize Horovod
hvd.init() # HVD
physical_devices = tf.config.list_physical_devices()
print(physical_devices)
tfds.disable_progress_bar()
tf.enable_v2_behavior()
gpus = tf.config.experimental.list_physical_devices('GPU')
print("Num GPUs available: ", len(gpus))
local_gpu = hvd.local_rank() #HVD
if gpus:
try:
tf.config.set_visible_devices(gpus[local_gpu], 'GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
except RuntimeError as e:
print(e)
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
(ds_train, ds_test), ds_info = tfds.load(
'mnist',
split=['train', 'test'],
shuffle_files=True,
as_supervised=True,
with_info=True,
)
ds_train.apply(tf.data.experimental.assert_cardinality(ds_info.splits['train'].num_examples))
ds_test.apply(tf.data.experimental.assert_cardinality(ds_info.splits['test'].num_examples))
num_steps_train = ds_info.splits['train'].num_examples // batch_size
num_steps_test = ds_info.splits['test'].num_examples // batch_size
print("Total number of training steps for training/testing: {}/{}".format(num_steps_train, num_steps_test))
def normalize_img(image, label):
"""Normalizes images: `uint8` -> `float32`."""
return tf.cast(image, tf.float32) / 255., label
ds_train = ds_train.map(
normalize_img, num_parallel_calls=tf.data.experimental.AUTOTUNE)
ds_train = ds_train.shuffle(ds_info.splits['train'].num_examples)
ds_train = ds_train.batch(batch_size)
ds_train = ds_train.shard(hvd.size(), hvd.rank()) # HVD
ds_train = ds_train.cache()
ds_train = ds_train.prefetch(tf.data.experimental.AUTOTUNE).repeat()
ds_test = ds_test.map(
normalize_img, num_parallel_calls=tf.data.experimental.AUTOTUNE)
ds_test = ds_test.batch(batch_size)
ds_test = ds_test.shard(hvd.size(), hvd.rank()) # HVD
ds_test = ds_test.cache()
ds_test = ds_test.prefetch(tf.data.experimental.AUTOTUNE).repeat()
def create_model():
inp = tf.keras.Input(shape=[28, 28, 1])
flat = tf.keras.layers.Flatten(input_shape=(28, 28, 1))(inp)
dense1 = tf.keras.layers.Dense(128,activation='relu')(flat)
dense2 = tf.keras.layers.Dense(10, activation='softmax')(dense1)
return tf.keras.Model(inp, dense2)
fmodel = create_model()
callbacks = [
# Horovod: broadcast initial variable states from rank 0 to all other processes.
# This is necessary to ensure consistent initialization of all workers when
# training is started with random weights or restored from a checkpoint.
hvd.callbacks.BroadcastGlobalVariablesCallback(0), # HVD
]
opt = tf.optimizers.Adam(0.001 * hvd.size()) # HVD
opt = hvd.DistributedOptimizer(opt) # HVD
# Horovod: Specify `experimental_run_tf_function=False` to ensure TensorFlow
# uses hvd.DistributedOptimizer() to compute gradients.
fmodel.compile(
loss='sparse_categorical_crossentropy',
optimizer=opt,
metrics=['accuracy'],
experimental_run_tf_function=False # HVD
)
# Horovod: save checkpoints only on worker 0 to prevent other workers from corrupting them.
if hvd.rank() == 0: # HVD
callbacks.append(tf.keras.callbacks.ModelCheckpoint('./checkpoint-{epoch}.h5'))
if hvd.rank() == 0:
print("Number of processes: {}".format(hvd.size()))
time_s = time.time()
fmodel.fit(
ds_train,
steps_per_epoch=num_steps_train // hvd.size(), # HVD
epochs=epochs,
validation_data=ds_test,
validation_steps=num_steps_test // hvd.size(), # HVD
callbacks=callbacks,
verbose=1 if hvd.rank() == 0 else 0)
if hvd.rank() == 0:
print("Runtime: {}".format(time.time() - time_s))
Execute the job script in the directory of your project which contains the virtual environment:
$ sbatch run_sbatch.sh
Refer to the output.txt
and error.txt
files for the results.
Tutorial: …
A software tutorial
Tutorial: …
A software tutorial