Questions session 21 September 2023¶

LUMI Hardware¶

When we run sbatch with --cpus-per-node=X are we allocating X cores or X CCDs or X NUMA nodes ...?
- You allocate cores (not threads). But admittedly the slurm nomenclature is really confusing.
- Slurm had threads (hardware threads), cores (physical cores) and CPUs. A CPU is the smallest individually allocatable unit which on LUMI is configured to be the core. Unfortunately Slurm does not fully understand the full hierarchy of a modern cluster which makes it different to request all cores in a single CCD or single NUMA domain.
I have been experiencing very long queueing times recently and have been warned that i am out of storage hours although i have more or less cleaned my output (|I am aware disk space and storage hours are not the same thing) but i am wondering if these long times are related to storage hours
- It's not related. Removing files when running out of storage allocation (in TB/hours) does make the TB/hours as each file stored on LUMI will consumes these TB/hours from your allocation as long as it's present on the system. When you delete a file, it will stop being billed but the TB/hours consumed will still be definitively gone.
Thanks, then what is the way to go forward as i have yet only spent 30% of my CPU hours :)
- Is your allocation granted by a consortium country or EuroHPC JU?
Consortium country
- Contact the resource allocator of your country to request additional TB/hours
Will do so, thanks a lot :)
- For the queue time, we are sorry, it's caused by three things:
  1. An oversubscription factor that was way to high
  2. Lots of huge jobs in the queue. To start a big job, the scheduler has to collect nodes that in the mean time can only be used for backfill by small jobs but we don't always have enough of those small jobs. So you may note a lot of idle nodes while your job still does not start quickly.
  3. A lot of CPU nodes have been down lately which with the large oversubscription factor create long queueing time
    The oversubscription factor will be reduced next year and the new LUMI-C nodes will be put in production at the end of october. It should improve the situation. Resource allocators have also been asked to not use LUMI-C for extreme scale projects anymore that require running lots of very large jobs (large in terms of node counts)
  4. I see, thanks then we need to live with for now :)
Out of curiosity, if LUMI is a GPU-first system, why offer (what remains a quite large amount of) CPU-only nodes?
- I think there are many answers to that question. I guess that some are political, but they idea is also to support heterogenous jobs with some parts of a workflow to run on CPU nodes with others running on GPUs.
- Additionally, LUMI performance is 2% LUMI-C and 98% LUMI-G.
Same question as 1. What about when we run sbatch with --gpus-per-node=X, what are we allocating?
- One GCD, so you can ask for a maximum of 8 per LUMI-G node.
I've been communicated that GPUs process batches of 32 items in 1 cycle on Nvidia (ie. using batch size of 33 first does 32 items in one cycle and 1 item in a separate cycle). Is this the same on AMD? And is this a hardware feature, as one could assume?
- AMD compute GPUs use 64-wide wavefronts but there is a catch. In practice, the wavefront will be divided in 4x16 workitems which match the compute units (CUs) architecture that feature 4x16-wide SIMD units. Each of these units are assigned a wavefront. The wavefront once assigned to one SIMD unit will be processed in 4 cycles (16 workitems/cycle). As there is 4 SIMD units per CU, 4 wavefronts can be active at the same time in a CU and the total throughput of a CU can be seen as 1x 64-wide wavefront/cycle/CU.
- See the documentation for more details about the MI250x GPUs.
How are GPU hours billed on standard-g and on small-g? Is it the number of GPU hours that you request for a job, using the argument #SBATCH --time, or is it the actual GPU usage per job, which is usually less than the requested hours?
- For GPU compute, your project is allocated GPU-core-hours that are consumed when running jobs on the GPU nodes https://docs.lumi-supercomputer.eu/runjobs/lumi_env/billing/#gpu-billing
- For the standard-g partition, where full nodes are allocated, the 4 GPUs modules are billed. For the small-g and dev-g Slurm partitions, where allocation can be done at the level of Graphics Compute Dies (GCD), you will be billed at a 0.5 rate per GCD allocated.
Thanks! I understand this, but if e.g. I request #SBATCH --time=24h and my job fails after 2 hours, am I billed for 24h or for 2h?
- You should be billed for 2 hours if your job is killed. Beware that there is a possibility that your job hangs when your job fails and you'd be billed for the time it hangs as well.

Programming Environment & modules¶

The slide mentioned the cray-*-ofi for OFI. Do we still need to use the AWS OFI mentioned in some compiling instructions?
- Do you mean the AWS OFI plugin for RCCL? That's different for the cray-*-ofi module. The craype-network-ofi is meant to select the Cray MPI network backend. With the Slingshot-11 network only libfabric/OFI is supported. In the past we had Slingshiot-10 interconnects and the craype-network-ucx module can be use to select the UCX backend. It's no longer supported. However, the craype-network-* modules are still useful as it's basically a way to switch MPI on and off in the compiler wrapper:
  - module load craype-network-none: disable MPI
  - module load craype-network-ofi: enable MPI
- The AWS OFI plugin is a plugin for RCCL (AMD GPU collective communication library, replacement for NVIDIA NCCL). This plugin is used so that RCCL can use the Slingshot-11 interconnect as it does not support it out of the box.
If we need to compile a library with gcc to generate our executables with support for MPI, do we have to load all the corresponding Cray modules or one of the PrgEnvs and the cray MPI module?
- Most of the time only loading PrgEnv-gnu is sufficient as the MPI module is loaded by default. The Cray compiler wrapper will automatically link to the correct MPI library for the Programming Environment you selected by loading a PrgEnv-* module.
What does it mean that a module is hidden?, it seems that it would be silently skipped, how we can change that state?
- It means that the is not listed in any searches by default because it might have problems or incompatibilities. you can display all modules including the hidden ones by loading the ModulePowerUser/LUMI module.
Do we have a PyTorch module?
- Yes, as user installable with EasyBuild (https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/p/PyTorch/) or in CSC software collecion (https://docs.lumi-supercomputer.eu/software/local/csc/)
I just want to point out that the slides of today seem to be inaccessible, as are the ones from previous training days. E.g. https://lumi-supercomputer.github.io/LUMI-training-materials/1day-20230509/ and clicking on any "slides" link fails.
- Works for me. Both the old and new slides load as expected.
- I get network errors (PR_CONNECT_RESET_ERROR) in both Firefox and Chrome. Not a big issue in any case.
Same issue as above. Can the slides be hosted somewhere else so that they are accessible to everyone?
- You could download it from LUMI with wget and then move to your computer from there.
This worked for me, but accessing it with three different browsers did not work.
- I also guess that is something firewall related but weird that you can access lumi via ssh but not LUMI-O.
- I read that some browser extensions and some proxies can also cause this problem. Most of these connection problems are actually not caused by the server (LUMI-O for the slides and recordings) but somewhere else between the server and your browser or the browser itself. It's a bit weird, we've been using LUMI-O as the source for big files for a while now and never got complaints or tickets.
  
  Sysadmins did note some random problems with LUMI-O recently and are investigating, it may or may not be related to that also. But I (KL) have been pushing lots of data to and pulling lots of data from LUMI-O in the past days from different computers and browsers while preparing this course and the upcoming 4-day course without issues.

LUMI Software Stacks¶

What is the difference between lumi-container-wrapper/cotainr and Singularity containers?
- Both our tools use singularity in the background but help you with creating the containers, so you don't have to build the container yourself.
Let's say I want to build PyTorch (with GPU support of course). Am I understanding correctly that I should load PrgEnv-amd?
- Both PrgEnv-amd, -cray and -gnu work with rocm which GPU enabled codes rely on. Basically first two environmnents give you Clang/LLVM based compilers. I doubt PyTorch requires Clang as a host compiler to be compiled for AMD GPUs.
- For PyTorch the way to go is to use GNU for the host (CPU) compilation in conjunction with the rocm module to have access to hipcc for GPU code compilation. Compiling PyTorch with PrgEnv-cray or PrgEnv-amd is likely to fail due to some packages using intel flavoured inline assembly that is not supported by Clang based compilers.
  - Okay. Got it. But the latest rocm module available is 5.3.3 that is very old (current that is VERY new is 5.7.0). Do I need to compile my own rocm also?
- ROCm version is related to AMD GPU driver version. With current SLES kernel version Cray OS is based on, ROCm versions > 5.4 are not supported, unfortunately. A major system update with new AMD GPU driver will be at the end of the year at the earliest.
- You can try other ROCm versions in containers. It does turn out that some newer versions work for some users even with the current driver. E.g., one of our AMD support people has used PyTorch with ROCm 5.5 in a container. The problem with newer ROCm versions is that (a) they are based around gcc 12 and the gcc 12 on the system is broken so we cannot fully support it and (b) ROCm libraries are also used by, e.g., GPU-aware MPICH and newer versions cause problems with cray-mpich.
  
  Recent ROCm versions show improvements in particular in the support libraries for AI so I can understand why as a user of PyTorch or Tensorflow you'd like a newer version.
  
  We realise many users are frustrated by this. The problem is that installing ROCm on a supercomputer is not as simple as it is on a workstation. On LUMI, it triggers a chain of events. We need a newer GPU driver. That in turn needs a newer OS kernel. However, software on a supercomputer is managed through a management environment and that management environment needs an update also to support that newer kernel. So at the end updating ROCm requires updating the full software stack on LUMI and likely a week or more of downtime, and extensive testing before starting the process. There are only a few systems like LUMI in the world which also implies that the installation procedures are not thoroughly tested and that whenever a big update of LUMI is done, problems show up. Nowadays we have a test system to detect most of the problems before they are actually rolled out on the big systems, but in the early days we didn't and the first big update which was scheduled to take 3 weeks took 7 weeks in the end due to problems... So I hope you can understand why a big machine as LUMI is not the best environment if you want the very latest... It is only a pity that there is no second smaller development machine on which we could take more risks as it wouldn't matter as much if that one would be down for a few weeks.
- Our AMD support person has also been building a properly set up container for PyTorch. I'd have to check in the LUMI documentation where to find it but that may be a more easy way. Compiling PyTorch can be tricky.
Regarding the long queue times i ve asked in question 2, would using the small nodes instead of the standard nodes help as some of the runs are literally taking only 2 minutes to run, which prepare the model for the actual production run that should be run in standard nodes?
- It may or may not and the answer is time-dependent. They are scheduled fairly independently. There was actually a mail after the last downtime to users to ask to use small/small-g more for jobs that can run in that partition. I'd have to check what the partition sizes are today, but the sysadmins at some point also moved some nodes from small to standard to try to balance the waiting times more. If those 2-minute jobs are also very small in node count (which I assume as you want to run them in small) and if the time you request is then also very low (like 10 minutes or so to be sure), they are ideal as backfill though and may start quickly on standard/standard-g actually and actually only use time that would otherwise been wasted. I've been rather successful with that strategy when preparing this and another course I am teaching about LUMI ;-) Sometimes my waiting times on standard/standard-g also became longer, and I assume an overloaded scheduler was partly to blame.
So instead of asking standard with 48 hours request, asking small with say 1 hour (or smaller?) does not really change? another reason i use standard is because the model runs hardcored with 88 CPUs whether or not it is a preperation run or a production run.
- The element that really matters if you want your job to start quickly, is to be realistic with the time your request. If you know that it is something that finishes quickly, don't request the maximum allowed time. On standard and standard-g there is usually quite some room to run programs as backfill. The scheduler will schedule lower priority jobs on idle nodes that it is saving for a big job if it knows that that job will have finished before it expects to have collected enough nodes for that big highest priority job. If you always request the maximum wall time even if you know that it will not be needed, the job will never be used as backfill. But if you know a job will end in 5 minutes and then only request like, say, 10 minutes to have some safety margin, there is a high chance that the scheduler will select it to fill up holes in the machine. Nothing worse for a scheduler than all users just requesting the default maximum wall time rather than a realistic walltime as then it has no room to play with to fill up gaps in the machine.
- And there is even another element to be realistic with wall times. There are ways in which a job can crash where the scheduler fails to detect that the job has terminated and so keeps the allocation. It looks like in particular on LUMI some MPI crashes can remain undetected, probably because MPI fails to kill all processes involved. You will then be billed for the whole time that the job keeps the allocation, not for the time before it crashed.

Exercise session 1¶

/

Running jobs¶

Why does the small partition allow to allocate 256G memory per node while debug allows only 224G?
- It's because the high-memory nodes (512GB and 1TB) are in the small partition. Standard nodes in both small and debug have the same amount of memory available (224GB). If you go above that in the small partition, you will get an allocation on one of the high-memory nodes instead of a a standard one. Note that if you go above 2GB/cores you will be billed for this extra memory usage. See here for details of the billing policy.
I see, thanks. So from this I understand that when I request --partition=small; --mem=256G;, the job will not be assigned to a standard node. Only high-memory nodes will be available. It is not made clear on CPU nodes - LUMI-C that not all of the memory can be allocated. I assumed that I can request all 256G from a standard node.
- It's explained here but you are right it's not stated in the hardware section. The reason is that in reality, the node actually have 256GB of memory but part of it is reserved for the operating system. LUMI nodes are diskless, so we have to reserve quite a big chunk of the memory to make sure the OS has enough space.
How much memory can be allocated on 512G and 1TB nodes?
- On all nodes of LUMI it is the physcial amount of RAM minus 32 GB (480 GB and 992 GB). For the LUMI-G nodes: 512 GB installed, 480 GB available.
Is the --mail-user functionality already working for LUMI slurm jobs? It is working on our national clusters, but so far hasn't worked for me on LUMI (with --mail-type=all)
- Unfortunately, it's not active. The LUMI User Support Team has raised the issue multiple times (since the start of the LUMI-C pilot actually) but sysadmins never made the necessary configuration. I understand it can be frustrating for users as it's a very basic feature that should be working.
Does --ntasks=X signify the number of srun calls, i.e. number of steps in a job?
- No. It is used inside srun. srun creates one job step with multiple tasks, each task basically being a copy of a process that is started. It is possible though to ask sbatch for, e.g, 5 tasks with 4 cores each, and then use multiple srun commands with each srun asking to create 1 task with 4 cores. Unfortunately we do see problems with network configurations when trying to run multiple job steps with multple srun commands simultaneously (by starting them in the background with an & and then waiting untill all have ended).
  
  You would use --ntasks=X, e.g., to start an MPI job with X ranks.
I am confused when you could define, say --ntasks=8 and --cpus-per-task=2. Are we then allocating 16 cores or 8 cores?
- 16 cores. Each task can then use 2 cores which would be the case for a hybrid MPI/OpenMP job. It would also guarantee that these cores are in groups of 2, because on small you would have no guarantee that all cores are on a single node. It may instead use cores from several nodes.
I just want to comment on the question on --mail-user. In Norway, on the Saga HPC cluster it was used until a user directed the emails from a large array to the helpdesk of Saga, that filled the helpdesk server. Then it was decided to stop the functionality.
- Even without users redirecting them to the help desk there are potential problems. Not that much on LUMI as we have a very low limit on the number of jobs, but no mail administrators would be happy with a user running a large array job on the cluster as each job in the array is a separate job and would send a mail. Imagine a user doing throughput computing and starting a few 1000s of jobs in a short time.... It might actually lead to mail systems thinking they're being spammed.
  
  Another problem is failures of the receiver etc.
  
  And another problem on LUMI is what to do if no mail user is given in the job script. Due to the way user accounts are created on LUMI (there are several channels) it is not as easy as on some university systems to link a mail address to a userID.
There was a comment that using --gpus-per-task was tricky. Perhaps I missed it, what was the pitfall of using it?
- The problem is the way in which Slurm does GPU binding which is not compatible with GPU-aware MPI. I'm not sure how technical you are and hence if you can understand my answer, but let's try.
  
  For CPUs Linux has two mechanisms to limit access to cores by a process or threads. One is so-called control groups and another one is affinity masks. Control groups really limit what a process can see and Slurm uses it at the job step level with one control group shared by all tasks in a job step on a node. That means that the tasks (processes) on a node can, e.g., share memory, which is used to communicate through memory. At the task/process level affinity masks are used which do not block sharing memory etc.
  
  For GPU binding there are also two mechanisms. One is a Linux mechanism, again the control groups. The other one is via the ROCm runtime via the ROCR_VISIBLE_DEVICES mentioned during the presentation. You can compare this a little bit with affinity masks except that it is not OS-controlled and hence can be overwritten. The problem with --gpus-per-task is that Slurm uses both mechanisms and uses them both at the task level. The consequence is that two tasks cannot see each others memory and that hence communication via shared GPU memory is no longer possible. The funny thing is that Slurm will actually still set ROCR_VISIBLE_DEVICES also in some cases. So it is basically a bug or feature in the way Slurm works with AMD GPUs. It should use control groups only at the job step level, not at the task level, and then things could work.
I don't use MPI, I only have ML applications in Python. Is this still a relevant problem?
- Yes, if you have multiple tasks. I gave MPI as the example but it holds for all communication mechanisms that go via shared memory for efficiency. RCCL for instance will also be affected. If you have something with --ntasks=1 it should not matter though.
I am using torch.distributed.run() for starting my multi-GPU computation. I provide --ntasks=1 (I use only single node). Then as a parameter to torch.distributed.run, I give --nproc_per_node=#NUM_GPUS. AFAIK, the torch.distributed.run then starts #NUM_GPUS processes. Does this cause binding problems? If so, can I somehow provide a custom mapping for this setup?
- Torch will have to do the binding itself if it starts the processes. Our PyTorch expert is not in the call though, I'm not sure about the right answer.
  
  Does it start #NUM_GPUS processes because it has that many GPUs or because it has that many cores? If it is the former I would actually consider to give Torch access to all CPU cores.
  
  I suspect Torch could benefit from a proper mapping, not only a proper mapping of CPU-to-GPU but also even a correct ordering of the GPUs. I understand that RCCL often communicates in a ring fashion so it would be usefull to exploit the fact that there are rings hidden in the topology of the node. But I don't think that anybody in our team has ever experimented with that.
One process per GPU. Thanks! Something that I will have to look into..
What does the numbers in srun --cpu-bind option represent fe00 etc?
- These are hexadecimal numbers where each bit represents a core (actually hardware thread to be precise) with the lowest order bit representing core 0. So for fe00: do not use core 0-7 (the last two 0's, so 8 zero bits), then the e corresponds to the bit pattern 1110 so do not use core 8 but use core 9, 10 and 11, and f corresponds to the bit pattern 1111 which is then use cores 12, 13, 14 and 15. So effectively: this concrete example means use CCD 1 (they are numbered from 0) except for the first core of that CCD which cannot be used because it is set aside for the OS and not available to Slurm.
Adding to the previous question: would this specific example cpu binding scheme also work for jobs on the small-g partition?
- Only if you request the whole node with --exclusive. Manual binding is only possible if you have access to the full node as otherwise you cannot know which subset of cores is assigned to your application, and as currently Slurm is not capable to make sure that you get a reasonable set of cores and matching GPUs on the small-g partition. Which is one of the reasons why the small-g partition is so small: It is not a very good way to work with the GPUs.
To which Numa domain I should bind which GCD? I remember it was not Numa domain 0 to GCD 0, etc.
- Good question. There are examples of valid masks in the GPU examples in the LUMI documentation but that is not the clearest way to present things. There is a graphical view on the GPU nodes page in the LUMI documentation. I've put the information in tabular form in the notes I made for my presentations.

Exercises 2¶

Will we get some information today on how to (in practice) profile (and afterwards, improve) existing code for use on LUMI?
- No. Profiling is a big topic on our 4-day courses that we have two or three times a year. However, if you have a userid on LUMI you have access to recordings of previous presentations. Check material from the course in Tallinn in June 2023 and we'll soon have new material after the course in Warsaw in two weeks. That course is on October 3-6 but I'm not sure if it is still possible to join. On-site is full I believe and online is also pretty full.
  
  There is also some material from a profiling course in April 2023 but especially the HPE part there was a bit more "phylosophical" discussing how to intepret data from profiles and how to use that to improve your application.
Thank you very much, that material will be very useful!
- If you are interested in GPU profiling there are also some examples on Rocprof and Omnitrace here https://hackmd.io/@gmarkoma/lumi_training_ee#Rocprof (it is a part of materials from course in Tallin).

Introduction to Lustre¶

What is the default striping behaviour if we write a file without calling lfs setstripe?
- By default only a single OST will be used. This is to avoid problems with users who don't understand LUSTRE and create lots of small files. The more OSTs a file is spread over, the more servers the metadata server has to talk to when opening and closing a file, and if these are not used anyway this is a waste of resources. It may not seem logical though on a system that is built for large applications and large files...
  
  However, I'm sure there are plenty of people in the course who in practice dump a dataset on LUMI as thousands of 100 kB files and refuse to do the effort to use a structured file format to host the whole dataset in a single file. And then there are those Conda installations with 100k small files.
Does striping matter only if I/O is the bottleneck?
- Mostly. But then we have users who write files that are literally 100s of GB and then it really matters. One user has reported 50 GB/s on LUMI P after optimising the striping for the file...
I just checked. My python venv seems to contain ~6k files. Did not know about the possibility to containerize it before today. Is it worth doing in this case, or if not, how many files should I have before containerizing?
- It's hard to put a hard number on it as it also depends on how the files are used. We tend to consider 6k as still acceptable though it is a lot. It also depends on how you use them. If you run jobs that would start that Python process on 100's of cores simultaneously it is of course a bigger problem than if you have only one instance of Python running at a time.
  
  But as a reference: HPE Cray during the course mentions that one LUMI-P file system is capable of probably 200k metadata operations per second which is not much and surprising little if you compare that to what you can do on a local SSD in a laptop. IOPS don't scale well when you try to build larger storage systems.
  
  If your venv works well with lumi-container-wrapper it may not be much work though to test if it is worth trying.
- It is also not just a LUMI thing but a problem on all large supercomputers. When I worked on a PRACE project on a cluster at BSC, I had a Python installation that took 30 minutes to install on the parallel file system but installed in 10s or so on my laptop...

LUMI support¶

Does LUMI support Jupyter notebooks or has a Jupyter hub? As one of the task of my project is to create catalogs / jupyter notebooks for the data generated in the project.
- No official support but we know that users have gotten notebooks to work. Something will come with Open OnDemand but not date set yet for availability of that. After all, LUMI is in the first place a system to process large batch jobs and not a system for interactive work or workstation replacement, so it does not have a high priority for us.
- Since the data cannot stay on LUMI after your project - LUMI is not a data archive solution - I wonder if LUMI is even the ideal machine to develop those notebooks or if that should be done with the machine where the data will ultimately land on?
- And if the data is really important to you: Please be aware that there are no backups on LUMI!
Does LUMI has interactive nodes through VNC (as in the exercise) to use the visualization nodes interactively?
- VNC is available through the lumi-vnc module which contains help about how it works. But otherwise the visualisation nodes are still pretty broken, not sure if vgl-run actually works. As for the previous question, it is not a very high priority at the moment, not for LUST and not for CSC who has to do much of the basic setup. Support should improve when Open OnDemand becomes available which is being worked on by CSC.
  
  Light VNC sessions can run on the login nodes, but you can always start an interactive job on a compute node, start VNC there and the start-vnc script will actually tell you how you can connect to the VNC server from outside using either a VNC client (the server is TurboVNC) or via a web browser (less efficient though for heavy graphics).
Are the tickets publicly viewable? Is there any plan to add some public issue tracker system? We have something like this on our local cluster, and it's quite nice for seeing what the current problems are and what is being done about them.
- There are no such plans at the moment. Security and privacy are big issues. And since LUMI accounts come onto the system via so many ways organising login is also not easy as we have to interface with multiple IdM systems. We do have a status page at https://www.lumi-supercomputer.eu/lumi-service-status/ but that one is also limited.

General Q&A¶

One question regarding SLURM job scripts: on our clusters, I am using the command seff $SLURM_JOBID at the end of the file to get output on the consumed resources. But I think seff is not available on LUMI?
- It is not on LUMI. It is actually an optional command and not part of core Slurm. We've tested seff and it turns out that the numbers that it produces on LUMI are wrong because it doesn't deal correctly with the way we do hyperthreading and report about that in the Slurm database.
  
  If you really want to try: seff in the LUMI Software Library but don't send us tickets about the wrong output. We know the output is wrong in most cases.