Questions session 8 February 2024¶

The questions of the hedgedoc document have been reordered according to topics, and some questions that are not relevant after the course have been omitted.

Icebreaker question: What kind of software do/will you run on LUMI?¶

Hope to run pytorch(pyro) or tensorflow probability
NequIP and Allegro (pytorch based) and run LAMMPS with pair_allegro (MPI + pytorch)
AI/ML Stuff, PyTorch 1
Pytorch + huggingface and stuff (deepspeed etc) +1
PyTorch, other NMT stuff_
Pytorch , JAX and DL Megatron-DeepSpeed in AMD-Rocm GPU and using containers and SLURM to run distributed GPU training
Tensorflow, pytorch, cpu and gpu jobs with high I/O
Spark
Conda environments for Python
OpenFoam
CPU jobs with high I/O
MPI/OpenMPI
MPI - OpenACC
Fortran + OpenMP offloading to GPUs

LUMI Hardware¶

What does CCD stand for?
- Core complex dies. There are 8 CCDs per processor with 8 cores each. LUMI-C has 2 processors (CPUs) per node while LUMI-G nodes have one.
What is the use of NUMA
- It is a way of designing CPUs. It means that not all cores have the same memory access time with regards to L3 cache. So data stored in one L3 cache (shared by 8 cores) can be accessed very efficiently by those 8 cores but takes longer to be access by the other 56 cores in that CPU.
  
  WikiPedia article on NUMA
Can you say something about storage to GPU data path...
- Can you elaborate a bit what you want to know?
To get data to GPU, should that be read to RAM first or is there any majic like the slingshot GPU connection
- Is your question: "is there an AMD equivalent of NVIDIA's GPU direct storage?"
yes
- Unfortunately, no.
Is there any benchmarking results on, reading data to GPU from HDD Vs sending data in one GPU to another GPU in another machine via slingshot, i.e. what is the best way to distribute 128Gb across GPUs.
- I don't think we did any benchmark of data loading from the file system to the GPU memory but GPU-to-GPU communication will always be faster than File system-to-GPU data loading.
What is the reasoning behind choosing AMD GPUs vs NVIDIA GPUs? Are we going to get AMD's MI300 GPUs at LUMI as well? Is it because of cheaper and environmental reasons?
- The AMD offer was better compared to the NVIDIA offer during the procurement of LUMI. NVIDIA knows they are in a quasi-monopoly position with their proprietary CUDA stack and tries to exploit this to raise prices...
- MI300: Not at the moment but it can't be excluded if an extension of LUMI occurs at some point
Is it possible to visit LUMI supercomputer in Kajaani?
- Rather not but it of course depends on what is the reason and context. Send us a ticket with some more info and we will come back to you. https://lumi-supercomputer.eu/user-support/need-help/

Programming Environment & modules¶

The GNU compilers do not have OpenMP offload to GPUs, ok. But can we use them with HIP?
- Not to compile HIP code but we have build applications mixing HIP and Fortran using the Cray/AMD compilers for HIP and GNU gfortran for the fortran part. HIP code can only be compiled using a LLVM/clang based compiler like the AMD ROCm compilers or the Cray C/C++ compilers.
  
  But this is precisely why you have to load the rocm module when using the GNU or Cray compilers to compile for the GPUs...
Are the modules LUMI/22.08 (S,D) LUMI/22.12 (S) LUMI/23.03... tool chains ?
- They are software stacks. Kurt will discuss them in the software stacks session.
What are differences between GNU GCC compiler and Cray compilers?
- For general differences between the compilers there are many sources in internet. On LUMI, there are pages in our docs for Cray and GNU compilers:
  - Working with CCE
  - Working with the GNU compilers
- They are totally different code bases to do the same thing. They are as different as Chrome and Firefox are: Just as these are two browsers that can browse the same web pages, the Cray and GNU compilers are two sets of compilers that can compile the same code but have nothing in common otherwise.
  
  The Cray compilers are based on Clang and LLVM technology. Most vendors are actually moving to that code base for commercial compilers also. All GPU compilers are currently based on LLVM technology, also for GPUs from NVIDIA and Intel. The new Intel compilers are also based on Clang and LLVM (and just as Cray they use their own frontend due to lack of an open source one of sufficient quality).
Are these craype-... modules loaded automatically when you loadthe software stack?
- By default, when you log in, PrgEnv-cray is loaded. It includes the Cray compilers, cray-mpich and cray-libsci (BLAS, LAPACK, ...)
- I'll come back to that in the software stack presentation.
How do software stacks, Programming Env, tool-chains are related to each other conceptually?
- Basically Programming Env is compiler (C,C++,Fortran), it's runtime libraries and entire set of libraries built against the compiler (AMD environment lacks Fortran compiler); Software Stack is entire application collection built with possibly all Programming Environments in a given release version (toolchains); Toolchain is technical concept for a specific Programming Env version and fixed set of related libraries.
- Software Stack could be CrayEnv (native Cray Programming Environment), LUMI or Spack
- In practice you can select Programming Env with either PrgEnv- (gnu, cray, amd) modules (Cray's native) or cpeGNU, cpeCray, cpeAMD; these are equivalent but latter ones are used in LUMI toolchains
- Toolchain is a concept used with LUMI Software Stack and they are cpeGNU/x.y or cpeCray/x.y or cpeAMD/x.y where x.y stands for specific LUMI/x.y release which in turn follows x.y release of the Cray Programming Environment.
What kind of support is there for Julia-based software development? Do I need to install julia and Julia ML libraries like Flux.jl locally?
- We have some info in our docs here: https://docs.lumi-supercomputer.eu/runjobs/scheduled-jobs/julia/
- Alternatively, you can use a julia module provided by CSC: https://docs.lumi-supercomputer.eu/software/local/csc/
- Setting up proper Julia development environment might be quite complex on LUMI. One of the possible ways is to use Spack (which is available as an alternative LUMI Software Stack).
- Basically the Julia developers themselves advise to not try to compile Julia yourself and give very little information on how to do it properly. They advise to use their binaries...

Modules¶

module av seems to be quite slow, am I missing something?
- It happend to me as well first time, but subsequent calls are faster, may be some caching ? (let me try... yes that's right)
- The Lmod cache is purged every day. The first module av of the day will always be slow but subsequent commands should be way faster.
Is there any guide to help to quickly find a desired module (e.g. LAMMPS)? It seems that module av | grep -i lammps or module spider LAMMPS cannot help.
- There is the Software Library page from which you can at least easily see what modules are available
- We have very few modules preinstalled but as Kurt will explain soon. It is very easy to install them yourself using EasyBuild based on the recipes listed on the above mentioned software library.

LUMI Software Stacks¶

Are you going to install more scientific packages in future, or it's on users to install them via the EasyBuild?
- You can see from the LUMI software library what is pre-installed or installable with EasyBuild. More EasyBuild recipes are constantly developed by the user support team. Does this answer to your question?
I think so, from the link I see that it is mostly on users to install their own packages if possible.
- Yes, the collection of pre-installed software is kept small for a reason. The current presentation enlightens this.
Do you encourge users to use conda even for installing non-python packages due to (large) storage space they probably take on user home directory (e.g. ~/.conda)?
- You can use Conda but not natively as you are used to from your laptop and maybe other clusters. We do not encourage native conda installations (just using conda create) as this creates many tens to hundreds of thousands of files and puts quite some pressure on the filesystem. Instead we offer two tools to create a conda environment inside a container. One of them is cotainr
Do mean like using Singularity container?
- Yes
I'm not quite sure that using Singularity env works well for all cases. For example, what if a user develops code on ondemand/jupyter and wants to use his/her own Singularity-based conda env as a custom kernel?
- Open OnDemand is actually heavily based on containerised software itself...
- We really encourage users to use software installed properly for the system via EasyBuild or Spack, but that is not always possible because sometimes the dependency chains of especially bioinformatics software are too long. For PyTorch and TensorFlow we advise to try to build on top of containers provided by AMD and discusses in the LUMI Software Library.
  
  The size of a software installation in terms of number of gigabytes is not a problem for a computer as LUMI. What is a problem is the number of files, and in particular the number of files that is being read while starting/using the package, and that determines if it is better to put it in a container.
How does EasyBuild manage versions of our custom software?
- Do you mean EB recipes that you install from LUMI software stack, or EB recipes that you have developed/modified yourself?
The ones that I develop myself
- The ones you have developed yourself are managed the same way as the ones from LUMI software stack, if you just locate your own recipes in a correct place. This is documented shortly in the lumi documentation EasyBuild page.
Thanks
- I'm not sure what to write about this without just repeating the documentation, but please ask if something is unclear
I understand it now, I'm not very used to use EB.
I will need Netcdf-c and netcdf-fortran compiled with the GNU toolchain (my application only works with that, not with other compilers) is that available as modules already or will I have to install them myself with Easybuild?
- cray-netcdf modules (part of the Cray Programming Environment) are recommended to use unless other specific version is required. They combine the C and Fortran interfaces in a single module, not in 3 different modules like some default EasyBuild installations do.
OK so I found a combination of module which seems "compatible": module load LUMI/22.08 partition/C gcc/12.2.0 craype cray-mpich/8.1.27 cray-hdf5-parallel/1.12.1.5 cray-netcdf-hdf5parallel/4.8.1.5 but it does not have pnetcdf,
- Parallel netCDF is served by another module called cray-parallel-netcdf
There is this combination: module load LUMI/22.08 partition/C gcc/12.2.0 cray-mpich/8.1.25 cray-parallel-netcdf/1.12.2.5 but it still has not got pnetcdf: --has-pnetcdf -> no Parallel netcdf and pnetcdf are two different things
- cray-parallel-netcdf/1.12.2.5 does not have the nc-config command so you likely have some other module loaded that provides that command. All I can find in that module is pnetcdf-config that declares it is "PNetCDF 1.12.2".
That would be great if there was a netcdf-c/netcdf-fortran that was built with it, is there? All I need is a set netcdf-c/netcdf-fortran built with pnetcdf in the gcc "familly", so maybe

It is netcdf-c and netcdf-fortran I need, my application does not use pnetcdf directly but the netcdf has to be build with pnetcdf, otherwise the performance is very bad

module keyword netcdf pnetcdf finds 3 matches:
```
cray-netcdf: cray-netcdf/4.8.1.5
cray-netcdf-hdf5parallel: cray-netcdf-hdf5parallel/4.8.1.5
cray-parallel-netcdf: cray-parallel-netcdf/1.12.2.5
```
and none of the has both netcdf and pnetcdf, strange, no?
- Not so strange I think. Isn't PNetCDF a rather old backend?
No, it is maintained, and still used a lot (all the latest releases of netcdf use it)
- The other two netCDF modules provided by Cray use HDF5 in different configurations (one of them parallel) as the backend. That should also give very good parallel I/O performance when used in the proper way.
  
  But it shows the point Kurt made in the talk: A central software stack is not practical anymore as too many users want specialised configurations that are different from others... You'll probably have to compile your own versions if the C and Fortran interface provided by cray-parallel-netcdf is different.
Maybe should I build it myself, if there is an Easybuild recipe available?
- There is none at the moment as so far the 3 Cray-provided configurations have been enough for everybody. There is also none with the common EasyBuild toolchains. It is just as the Cray modules: Either netCDF-C etc. with HDF5 backend, or PnetCDF as a separate package comparable in configuration to cray-parallel-netcdf.
  
  Spack seems to support building netCDF-C/-Fortran with PnetCDF but it is also not the default configuration.
OK, to start with I will try with load LUMI/22.08 partition/C gcc/12.2.0 craype cray-mpich/8.1.27 cray-hdf5-parallel/1.12.1.5 cray-netcdf-hdf5parallel/4.8.1.5 (that is without pnetcdf)
I wanted to install some modules in EasyBuild. I did this:
```
module load LUMI/23.09 partition/C
module load EasyBuild-user
eb ncview-2.1.9-cpeCray-23.09.eb -r
eb CDO-2.3.0-cpeCray-23.09.eb -r
eb NCO-5.1.8-cpeCray-23.09.eb -r
```
and then I loaded everything and worked, but when I try it in a new tab it does not work. Does anyone know why?
```
jelealro@uan01:~> module load ncview/2.1.9-cpeCray-23.09

Lmod has detected the following error:  The following module(s) are unknown:
"ncview/2.1.9-cpeCray-23.09"

Please check the spelling or version number. Also try "module spider ..."
It is also possible your cache file is out-of-date; it may help to try:
  $ module --ignore_cache load "ncview/2.1.9-cpeCray-23.09"
```
- You have to load the same version of the software stack that you used to compile. I.e. module load LUMI/23.09 partition/C then you can find the modules with module avail. Alternatively, module spider NCO will still list the package and show you how to load it.
I opened a new tab and did this:
```
module purge
module load LUMI/23.09 partition/C
module load ncview/2.1.9-cpeCray-23.09
module load CDO/2.3.0-cpeCray-23.09 
module load NCO/5.1.8-cpeCray-23.09
```
but the error still remained
- Did you add export EBU_USER_PREFIX=... to your bashr? Otherwise lmod doesn't know where your modules are.
When I built the Easybuld I did it at my home just for testing, here. export EBU_USER_PREFIX=/users/jelealro/my_easybuild. And no, I don't have it in my bashrc.
- Try logging in again, then do
```
export EBU_USER_PREFIX=/users/jelealro/my_easybuild
module load LUMI/23.09 partition/C
module av NCO
```
It worked! thanks you. I was missing the first line EBU_USER_PREFIX:...
- As discussed it is best to have Easybuild install into your project (but if you only have the training project now, your home is also okay for testing). Put the line in your .bashrc then it will always find your installed modules.
Noted it, I will just re do it in the project folder. Thank you!

Exercise session 1¶

/

Running jobs¶

I am a bioinformatician and don't really understand all of the computer science behind LUMI. I have used PBS job submission at Oak Ridge National Lab, so I have some background to begin (not entirely lost), however I have no idea where to start with LUMI to download my program and submit jobs. Is this covered at a beginner level in this section about slurm submission?
- I hope you will find it helpful to start. But you may need a more elementary course like the ones that the local organisation should give to train beginners. This course is too fast-paced for beginners.
- And what system at ORNL still uses PBS, or do you mean Slurm?
- If you are familiar with Slurm, I'd suggest to see some of the [LUMI specific examples from the documentation. If you are not familiar with Slurm, a basic Slurm tutorial at first could be helpful. E.g. DeiC (Denmark) has developed this Slurm learning tutorial. About what to do on LUMI to get your software in use, it depends what software you are using. If you can't find your software in the LUMI software library or from local stack by CSC or otherwise have any questions of how to proceed in practice, you can also [open a ticket]https://www.lumi-supercomputer.eu/user-support/need-help/).
Should we reserve 8 GPUs per node when submitting a SLURM job, considering that 4 GPUs act like 8?
- Yes, Slurm thinks of one GCD (Graphics Compute Die) (each MI250X consists of two GCDs) as one GPU. So ask for 8 gpus if you want to book the whole node.
Does this apply for LUMI C, LUMI G, and so on?
- Only LUMI-G nodes have GPUs, so it only applies to slurm partitions on LUMI-G (standard-g, dev-g, small-g)
Follow up to the previous question. I got following error when book 8 gpus per node: `Node 0: Incorrect process allocation input. Do I miss something?
- Can you show me what slurm parameters you use? Which partition?
Sure:
```
#SBATCH --partition=standard-g
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8
```
- Strange, but it also doesn't look like a slurm error. Probably best to open a ticket. https://lumi-supercomputer.eu/user-support/need-help/
Aha, OK - just wanted to ensure if I am doing some wrong when booking.
- You may need to limit GPU visibility to each task if your application expects one GPU per task (MPI rank)
Thanks for the suggesstion. I am doing it by setting the ROCR_VISIBLE_DEVICES=\$SLURM_LOCALID at runtime.
- So it is likely not the case.
I guess if the error is not related to Slurm, then I must look into application configuration parameters. Thanks.
Is it possible to run and debug a GPU dependent code without submitting it as a batch job, during development and small testing phase.
- You have to use a slurm job but you can use an interactive job to just get a bash on a LUMI-G node.
  
  Link in the LUMI documentation
Why use salloc instead of just providing all the options to srun?
- About what usage scenario are you thinking? Interactive runs or job scripts?
Interactive runs. im not used to run salloc first and then use srun to reach a compute resource. Usually I provide everything as options to srun. nodes, cores, memory, time, partitions, projects, etc..
- You can do both ways, and somewhat this is a matter of preference, I think. I've understood that salloc would be more useful in some more complex cases, though.
- salloc is a command to create an allocation. The srun command is meant to create a job step in an allocation. It has a side effect though: If it is run outside an allocation it will create an allocation. However, some options for creating an allocation and for a job step have a different meaning for both tasks. And this can lead to unexpected side effects when you use srun to create the allocation and start a job step with a single command.
  
  srun is particularly troublesome if you want an interactive session in which you can then start a distributed memory application.
If we submit a slurm script with --partition=standard-g but without requesting any GPUs, which resources are billed? The CPU or GPU hours ?
- You will be billed GPU hours, and in the case of standard-g you effectively get the whole node, whether you use it or not, so you will be billed 4 GPU hours for every hour you use the node. It is only normal: On LUMI you are billed for resources that others cannot use because of your request, whether you use them or not. Likewise, if you would ask for resources on small-g you will be billed based on the amount of cores, amount of GPUs and amount of memory you request. If you request a disproportional amount of one resource, you'll be billed for a similar amount of the other resources. So if you would ask for half of the cores or half of the memory, you'd still be billed for 4 GCDs (so 2 GPU hours per hour use) as you effectively make 4 GCDs unusable for others.
The output of lumi-allocations command is:
```
Project             |                    CPU (used/allocated)|               GPU (used/allocated)|           Storage (used/allocated)
--------------------------------------------------------------------------------------------------------------------------------------
project_465000961   |         12/10000000   (0.0%) core/hours|          0/1000   (0.0%) gpu/hours|             0/10   (0.0%) TB/hours
```
which means so far we only used CPU-resources (?)
- Maybe you've not done what you think, but also, lumi-allocations is not immediate. The data of a job has to be processed first offline and the tables that lumi-allocations shows are updated only a few times per day because of this.
- According to the billing pagesin the documeentation this will be billed in GPU hours, even if you only use CPU hours.
Can you run a CPU-GPU hybrid code on GPU partition?
- Sure. You have 56 cores available on each G node. You could also do heterogenous slurm jobs with some part (some MPI ranks) run on C nodes and some on G nodes. But this is a bit more advanced.
Do we need to have "module load" things in the job file?
- That's a matter of preference if you want to load necessary modules before sending your job script to queue, or in the job script
- I would recommend putting all module loads into the job script to make it more obvious what is happening and more reproducible. We get enough tickets from users claiming that they ran exactly the same job as before and that it used to work but now doesn't work, and often it is because the job was launched from a different environment and does not build the complete environment it needs in the job script.
So, if I'm running a job on 100 nodes, with --exclusive and I want to use all the memory on the nodes, with --mem=0 it can lead to strange behaviour?
- There have been some problems in the past with nodes that had less memory available than expected due to memory leaks in the OS. By asking explicitly for nodes with 224G (LUMI-C) or 480G (LUMI-G) you ensure that you don't get nodes where less is available due to a memory leak.
How do I run gpu_check? I loaded
```
module load LUMI/23.09
module load lumi-CPEtools
```
and allocated resources with salloc and when I do srun gpu_check -l (as shown in the slides) I get slurmstepd: error: execve(): gpu_check: No such file or directory
- At least you seem to be missing loading the partition/G ?
Indeed, I was missing partition/G. thanks!
- As gpu_check can only work on LUMI-G nodes, I did not include it in the other versions for lumi-CPEtools.
Can I get different thread count for different tasks in the same job with one binary ?
- Heterogeneous jobs can do that. Or you take the largest number that you want for each task and use, e.g., OpenMP functions in your code to limit threads depending on the process, but that may be hard. Is there a good use case for that? A single binary that takes an input argument to behave differently depending on the value of that input argument?
What's the difference between ROCR_VISBLE_DEVICES and HIP_VISBLE_DEVICES?
- I found this discussion about the differences or this doc page.
- HIP_VISBLE_DEVICES seems to only affect device indices exposed to HIP applications while ROCR_VISBLE_DEVICES applies to all applications using the user mode ROCm software stack.
So, in principle, can one use them interchangeably for HIP application?
- I wouldn't do so because Slurm already uses ROCR_VISIBLE_DEVICES. If they get conflicting values you may be in for some painful debugging...
To be safe is it better to not bind to closest and do it explicitly? I'm not sure if, e.g., for PyTorch, there's direct communication between GPUs.
- It is safer indeed. PyTorch uses RCCL as far as I know so yes, it will do direct communication between GPUs and given that many GPU configurations used for AI have much slower communication via the CPU than direct communication between GPUs (NVIDIA links between GPUs are really fast compared to PCIe, and the external bandwidth between LUMI GPU packages is also 250 GB/s compared to 72 GB/s to the CPU) having good direct communication may be essential for performance.
If I submit a 256 cores job on 2 nodes without hyperthreading, and if I use the multi_prog option of srun, what should my program configuration file look like ? I want to be sure that my tasks are on both nodes, and I am confused by the numbering (does it change depending on the hyperthreading option?).
```
0    ./prog_1.exe
...
127 ./prog_2.exe
128 ./prog_2.exe
...
255 ./prog_2.exe
```
or
```
0    ./prog_1.exe
...
127 ./prog_2.exe <--- Sure? Shouldn't it be prog_1? No. Well, the distribution of the programs among the tasks is another question, but for starter I just want to be sure that I use the 2 nodes
256 ./prog_2.exe
...
383 ./prog_2.exe
```
- If you want to be sure, I recommend using the tools in the lumi-CPEtools module to check how tasks and threads are allocated... That's what we also do before we give answers to such questions as we are not a dictionary either that know all ins and outs of Slurm without checking things.
When I bind the CPU using these hex values, do I always use the same mask? This assumes allocation to a full node? In case I'm not using the full node, should I use bindings?
- All binding parameters only work with the --exclusive flag set (which is done implicitely on standard-g). You can't affect the binding on small-g (except if you set --exclusive.
- The mask uses 7 cores and one GPU per task and 8 tasks, if you want to use less cores or less GPUs you have to adapt it.
  
  But if your program uses OpenMP threads on the CPU side, you can still use the "large" mask and further restrict with the OpenMP environment variables (OMP_NUM_THREADS).
Refering to slide 36 here, is there a reason why NUMA and GPU numbering are completely independent ? Wouldn't it make more sense, for simpler usability, to have similar numbering, or if the default binding was the optimal one ?
- Yes, that is quite annoying but there seems to be some HW reason for that. I don't know why it is not possible to map it, so that you don't see it as a user.
- CCDs get their numbering from the position in the CPU package. GCDs in a package get their numbering from their position in the GPU packages, and between GPUs I think some order in communication links will determine a numbering when booting the node.
  
  Now the problem is really to lay all the connections on the circuit board. I'm sure there would be an ordering so that they number in the same way, but that may not be physically possible or would require a much more expensive circuit board with more layers to make all connections between GCDs and between GCDs and CCDs.
Probably this depends on the application, but roughly, how much worse is the performance if one does not do the correct CPU → GPU binding ?
- I believe most spectacular difference we have seen is almost double. It is probably more important for HIP codes and GPU to GPU communication.
- The heavier traffic between CPU and GPU, the larger the difference will be...
What if I want to modify one of these provided containers to add some application. How should we do it?
- One possible approach is with singularity overlays https://docs.sylabs.io/guides/3.11/user-guide/persistent_overlays.html
Is there anyway to measure the energy/power consumed by the application? +1
- No. In theory it should be possible at the node level, but even that is not implemented at the moment. On a shared node it is simply impossible.
- ROCm tools can report some numbers but they are known to be unreliable.
Are node-level measurements also not possible on --exclusive booked node?
- We simply don't have the software that could be called with user rights to gather the data from the counters in the node and service modules. And even then the data is very coarse and hard to reproduce as on modern computers there is a lot of variability between nodes.
  
  To get as good a result as possible on the Linpack benchmark for the Top500 they actually needed to play with individual power caps for nodes and GPUs to make all of them about as fast as it is the slowest GPU that determines progress of the parallel benchmark, while they also had to stay within a certain power consumption limit per rack to avoid overheating.
  
  If you could measure, don't be surprised that when your application runs on a different node, power consumption could differ by 20% or more...
Is there a way (example, a script) to get the cpu and memory performance of a finished job?
- There is some very coarse information stored in the Slurm accounting database that you can request via sacct. But this is only overal use of memory and overall consumed CPU time.
When I use `sacct --account=, it is basically printing the headings but no information related to the job. May I know what I am missing?
- If you want to give a jobID the option is -j or --jobs and not --account. Moreover, you'll have to specify the output that you want with -o or --format. There is a long field of possible output fields and some examples in the sacct manual page. Often sacct only searches in a specific time window for information so depending on the options that you use you may have to specify a start and end time for the search.
We are supposed to use Cray MPI, but when working with containers we need the singularity-bindings, correct? I have an open ticket regarding these bindings, and apparently they are not working. Do we have an ETA for when they'll be available again?
- This should be an easy fix. Can you provide the ticket number?
Sure: LUMI #3552
- Oh, OK, your ticket was in the hand of a member of the team who quit recently so it was not progressing. I will take it.

Exercises 2¶

Introduction to Lustre¶

How do you deal with hierarchical file formats such as zarr (which have many subfolders and small files) on LUMI?
- I don't know for sure for zarr specifically and how it works with the file system, so the answer may not be entirely accurate.
  
  If those subfolders are INSIDE a big file Lustre only has to deal with the big file and it should work well. If it is one of those things that thinks that it should simply dump those files and folders as regular files and folders, then it is not a technology that is suitable for HPC clusters with parallel file systems. If my quick googling returned the right information then it is doing the latter and simply not made for HPC systems. It compares itself with netCDF and HDF5 but these are proper technologies for HPC that do the work themselves INSIDE a big file rather than letting the regular file system deal with it.
  
  From all the information I have at the moment, zarr is the perfect example of something mentioned in the architecture presentation of the course: Not all technologies developed for workstations or for cloud infrastructures, work well on HPC systems (and vice-versa). Zarr is an example of a technology built for a totally different storage model than that used on the LUMI supercomputer. It may be perfect for a typical cloud use case, where you would be using a fat node as a big workstation or a small virtual cluster with its own local file system, but at first it looks terrible for a parallel file system shared across a large HPC cluster like LUMI.
  
  On systems the size of LUMI you have no other solution than to work with hierarchies. It is the case for the job system: Slurm cannot deal with hundreds of thousands of minute-sized jobs but you need to use a hierarchical scheduling system for that. And it is the case for data formats. Lustre cannot deal with hundreds of thousands of small files, but you need a hierarchical approach with a file system inside a big file. You'd need file system that costs several times more per PB to deal with those things at the scale of LUMI.
What Block Size do you have on the LUSTRE Filesysten? i want to generate one billion 2 byte files
- You're simply not allowed to generate one billion 2 byte files and will never get the file quota for that. On the contrary, this will be considered as a denial-of-service attack on the file system and abuse of LUMI with all consequences that come with that.
  
  2 billion 2 byte numbers belong in a single file on an HPC cluster, and you read that file as a whole in memory before using the data.
  
  If the reason to use 1B files is that you want to also run 1B small processes that generate those files and that therefore you cannot use a single file: That is also a very, very bad idea, even if you use a subscheduler such as HyperQueue as just starting those 1B small processes may stretch the metadata service a lot.
  
  Scaling software is not starting more copies of it, and just starting more copies of a program is not what a supercomputer like LUMI is built for. You need a different and way more expensive type of infrastructure for that. Scaling would be turning that program into a subroutine that you can call in a loop to generate a lot of those 2-byte numbers in a single run and store those intelligently in a single file.
How can we specify the location such as scracth to store the experiment results (>20GB) generated during the execution?
- The output is usually automatically located at the same directory location from where you submit the job.
- Hopefully you're not pumping 20GB of output via printf to stdout? That is not a good strategy to get a good I/O bandwidth. You should write such files properly with proper C/Fortran library calls. And then it is your program or probably the start directory of your program that will determine where the files will end up.
They are HDF5 files. Could you please specify which #SBATCH option you mentioned above to redirect them?
- No we can't, because it is your specific application that determines where the files will land, not Slurm. Maybe they will land in the directory where the application is started (so go to that directory with cd will do the job), maybe your application does something different. You cannot redirect arbitrary files in Slurm, you can only redirect the stdout and stderr devices of Linux.
- About redirecting stdout and stderr, please see the sbatch manual page (e.g. #SBATCH -o /your/chosen/location/output.%a.out) but indeed this doesn't actually redirect the output created by the application
Let’s assume I have one HDF5 file (~300GB), which stores my entire dataset, consisting of videos (~80k). I store each video as a single HDF5 dataset, where each element consists of the bytes of the corresponding video frame. I spawn multiple threads (pinned to each processor core), which randomly access the videos. What would be some rules of thumb to optimise the Lustre stripping for better performance?
- I think we need to ask HPE for advice on that and even they may not know.
Besides going for sequential access (e.g., webdataset), is there anything a user can do to limit the I/O bottleneck involving random access (i.e., typical machine learning workflow)?
- Random I/O in HDF5 will already be less of a bottleneck for the system than random access to data in lots of individual files on the file system. I'd also expect the flash filesystem to perform better than the hard disk based file systems. They are charged at 10 times the rate of the hard disk based ones, but there is a good reason for that: they were also 10 times as expensive per PB...
- I think general rule is to use high stripe-count value for such a large dataset files. For instance -1 will use all OSTs. There are 12 OSTs.

Questions session 8 February 2024¶

Icebreaker question: What kind of software do/will you run on LUMI?¶

LUMI Hardware¶

Programming Environment & modules¶

Modules¶

LUMI Software Stacks¶

Exercise session 1¶

Running jobs¶

Exercises 2¶

Introduction to Lustre¶

LUMI support¶

General Q&A¶