Questions session 16 May 2023¶

LUMI Architecture¶

The slides say the GPUs have 128 GB mem, but when I queried the information with the HIP-framework it only returned 64 GB mem. Does it differ on different partitions or something?
- Kurt will go into more detail soon. But each GPU has 128GB but each GPU conists of 2 dies (basically independent GPUs). Each of those has 64GB. Basically each LUMI-G node has in practise 8 GPUs on 4 cards.
This is maybe a better question for when you will discuss software, but seeing the AMD hardware, my question is how compatible the GPU partitions/software are with DL frameworks such as Tensorflow and PyTorch. Are the systems fully ROCm compatible? Are there downsides compared to CUDA implementations?
- Let's discuss that later in detail. But short answer: ROCm is not as mature as CUDA yet but most DL frameworks work already quite well.
I believe the GPU nodes have 64 cores. But from the slides I understood that the nodes have 1 CPU with 8 cores. Just as a note: this is output from hip:
```
  Device 0:
  Total Global Memory: 63.9844 GB
  Compute Capability: 9.0
  Max Threads per Block: 1024
  Multiprocessor Count: 110
  Max Threads per Multiprocessor: 2048
  Clock Rate: 1700 MHz
  Memory Clock Rate: 1600 MHz
  Memory Bus Width: 4096 bits
  L2 Cache Size: 8 MB
  Total Constant Memory: 2048 MB
  Warp Size: 64
  Concurrent Kernels: Yes
  ECC Enabled: No
  Unified Memory: No
```
- The slide was about the CPUs in the node. The basic element of the GPU is the "Compute Unit" (CU) (the "multiprocessor count" in the above output). And one CU has 4 16-wide SIMD units and 4 matrix core units. AMD doesn't use the word core very often in the context of GPU as what NVIDIA calls a core is actually called an Arithmetic and Logical Unit in a CPU and is only a part of a core.

Cray Programming Environment¶

what is called underneath with compiling with hipcc? rocm compiler I assume? - @bcsj
- Actually it is AMD's branch of Clang.
- @bcsj: sidenote, I've had trouble with compiling GPU code with the CC compiler, which I assume calls something else underneath. The code would run, but in a profiler it showed that a lot of memory was being allocated when it shouldn't. hipcc compiling fixed this issue.
  - I believe CC is using different Clang frontend being Cray's branch of Clang with a slightly different codebase. At the end it should use the same ROCm backend. It may require more debugging to understand the problem.
  - @bcsj: possibly ... it was some very simple kernels though, so I'm pretty confident it was not a kernel-issue. The profiling was done using APEX and rocmprofiler. Kevin Huck helped me, during the TAU/APEX course.
  - The frontend that CC is using depends on which modules you have loaded. IF the cce compiler module is loaded, it will use the Cray clang frontend while if the amd compiler module is loaded it will use the AMD ROCm C++ compiler.
- @mszpindler Could you please post a support ticket than we can investigate memory issues, thanks.
- @bcsj: It's been a while and I restructured my environment to avoid the issue, but I'll see if I can find the setup I used before.
What is the policy on changes to these centrally installed and supported modules? Are versions guaranteed to be available for a certain period of time?
- Unfortunately not at the moment. WE hope to be able to provide LTS (Long term support) versions in the future but they are not yet vendor supported. But we will always inform you about changes to the SW stack.

Module System¶

Are open source simulation software such as quantum espresso centrally installed on LUMI?
- Please use the Software Library https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/q/QuantumESPRESSO/ to understand what the policy is.
- Short answer: No, almost no softare (except debugger, profilers) is installed globally. But you will see soon, that it is very easy to install SW yourself with Easybuild using our easyconfigs (basically building recipes).
The program I am using and developing requires certain dependencies such as paralution, PETSc, Boost... Is it possible to manually install these dependencies if not available among the modules listed?
- Yes, it is actually very easy and will be the topic of the next session (and we have some configurations on Boost on the system, try module spider Boost...)
- We have an installation recipe for one configuration of PETSc also. And PETSc is one of those libraries that does strange things behind the scenes that can cause it to stop working after a system update... We have not tried paralution. But basically there are an estimated 40,000 scientific software packages not counting the 300,000+ packages on PyPi, R packages, etc., so there is no way any system can support them all.
- @mszpindler Please try PETSc with recipes from https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/p/PETSc/ If you see wrong behaviour then please open support ticket.
Can I use modules that are available on CSC supercomputers?
- Basically we use different set of modules and even a different primary system to manage the modules. The machines are different, the team managing the machine is different, and the licenses for software on puhti or mahti do not always allow use on LUMI, and certainly not for all users on LUMI, while we have no way to control who has access and who has not.
If i find out that my application is able to work in different programming environments (different compilers), which should I prefer? Cray?
- No answer that is always right, you have to benchmark to know.
  
  And even for packages for which we offer build recipes we cannot do that benchmarking as it requires domain knowledge to develop proper benchmarks. And the answer may even differe on the test case.
  - Ok, test with Cray first..
- Again, if you look at the Software Library https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/q/QuantumESPRESSO/ then you can see recipe (easyconfigs) versions for specific programmming environments. Those listed there, are supported.

Using and installing software¶

In our own tier-1 I've always used virtual envs successfully. The reason that I do not rely on containers is that I often work on editable installs that I pull from git. Looking at the documentation (https://docs.lumi-supercomputer.eu/software/installing/container-wrapper/#plain-pip-installations) it seems that this is not supported. In other words: I would like to be able to create an environment, install an editable Python git repo, and when an updated comes to the remote repo, just pull it and keep using the env. If I understand correctly I would need to rebuild the container after every git pull on LUMI?
- I t is actually CSC's development called Tykky https://docs.csc.fi/computing/containers/tykky/ and I believe you can try with pip-containerize update ... to update existing image but you need to try if it works with your specific environment.
- Something really difficult: You can spread a Python installation over multiple directories. So you could put all dependencies of your software in a container and mount a directory from outside in which you install the software you're developing. That would already reduce the load of small packages on the file system.
  
  The speed difference can be dramatic. On the installation side I've worked on 5 or 6 big clusters on Europe. On the slowest it took me 2 hours to install a bunch of Python packages that installed in under 30 seconds on the SSD of my laptop... Measuring the speed difference while running is more difficult as I couldn't run the benchmarks on my PC and as there are other factors in play also. But CSC reports that a 30% speedup is not that uncommon from using Tykky.
- I have successfully installed Python repos as editable, by setting PYTHONUSERBASE to point to the project directory and then using both --user and --editable flags with pip.
- There is some information on containers also in the "Additional Softwre on LUMI" talks in our 4-day course. The latest for which notes and a recording are available is the presentation during the February 2023 course.
Sorry, probably you are going to speak about containers later. I'm just interested in the opportunity to use one particular commercial software from within, say, Singularity container on LUMI. So, if I sort out the license server issues, is it possible for virtually any kind of software or there are limitations? If yes, is it, overall, a good idea? Does it make the performance of the software run deteriorate?
- Unless there is MPI involved you can try to run any, say DockerHub, container image on LUMI. For the MPI parallel software packages it may still work on a single node but do not expect them to run on multiple computing nodes.
- There is some information on containers also in the "Additional Softwre on LUMI" talks in our 4-day course. The latest for which notes and a recording are available is the presentation during the February 2023 course.

Exercises¶

Looking at the Exercise 1.1 solution, I don't really see any difference between the outputs from module spider Bison and module spider Bison/3.8.2, but from how the solution reads, it seems like they shouldn't be the same?
- That's because there were two different versions on the system when I prepared the exercise. It found a version generated with Spack also and it looks like the person managing that stack disabled that module again. If module spider finds only a single version it produces the output it would produce when called with the full version as the argument.
  
  I wanted to use it as an example to show you how you can quickly see from the module spider output if a program comes from the LUMI stacks or from Spack...
  - ok, thanks
- You can see the different behaviours with for example module spider buildtools
is there a module similar to cray-libsci for loading/linking hipBlas? So far I've been doing that manually. I tried to module spider hipblas but that returns nothing.
- No, there is not. AMD doesn't polish its environment as much HPE Cray.
  
  Also, all of ROCm is in a single module and as several of those modules are provided by HPE and AMD they don't contain the necessary information to search for components with module spider.
Is there a similar tool to csc-projects for checking info on your projects?
- Yes, please try lumi-allocations command-line tool.
Perhaps I missed this earlier, but what is LUMI-D?
- It is the name that was used for the "Data Analytics and Visualisation partition" which basically turns out to be two partitions that need a different setup. It consists of 8 nodes with 4 TB of memory that really have the same architecture as the login nodes hence can use software of what we will later in the course call partition/L, and 8 nodes with 8 NVIDIA A30 GPUs for visualisation each.

Running jobs with Slurm¶

Would it be possible to show an example of how to run R interactively on a compute node?
- If you don't need X11, one way is
```
srun -n 1 -c 1 -t 1:00:00 -p small -A project_465000XXX --pty bash
module load cray-R
R
```
  which would ask for 1 core for 1 hour. For some shared memory parallel processing you'd use a larger value for c but you may have to set environment variables to tell R to use more threads.
Is Rstudio server exist as a module on LUMI? Is it planned to be added at some point? Is it possible for normal user to install it?
- We expect to be able to offer Rstudio with Open OnDemand, but no time set for now. But as any batch system, LUMI is not that well suited for interactive work as there is no guarantee that your interactive session will start immediately. Some resources will be set aside for Open OnDemand and I assume it will be possible to oversubscribe some resources. But LUST is not currently involved with the setup so there is not much we can say.
May be a beginner question.. When to work with GPU nodes and when it is less efficient to use it? Does it depend on the structure of the data or the software used? Is it possible to use GPU nodes with R scripts?
- GPU comute requires software that explicitly supports compute on GPU. GPUs are never used automatically. Base R does not use GPU compute. Some R packages may as in principle some large linear algebra operations that are used in statistics may benefit from GPU acceleration. However, they will have to support AMD GPUs and not NVIDIA GPUs.
  
  There is no simple answer when GPU compute offers benefits and when it does not as it depends a lot on the software that is being used also.
How to monitor the progress / resources use (e.g. how much RAM / # cores are actually used) of currently running and finished batch jobs?
- Slurm sstat command can give information on running jobs. Once the job has finished, sacct can be used. Both commands have very customizable output but they will not tell you core per core how that core was used. Then you need to do active profiling.
- You can also attach an interactive job step to a running job (as shown on slide 9). I'm not sure rocm-smi works reliably in that case (it didn't always in the past), but commands like top and htop (the latter provided by the systools module) should work to monitor the CPU use.
Do job-arrays open program the number of processes which we specified with --ntasks=1? or it will open 16 independent jobs with 1 processes for each job?
- A job array with 16 elements in the job array is 16 jobs for the scheduler if that is what you mean. If you want multiple processes in one job you'd submit a single job for the combined resources of all 16 jobs and then use srun to start 16 processes in that job.
What happens if we choose job-array=1-16 and --ntasks=2? It will use 16 jobs and each jobs has 2 tasks, right?
- Yes. After all some people may want to use a job array for management where each element of the job array is an MPI program.
Is the limit "total jobs in the queue" or "total concurrntly running"?
- There are two different limits that are both shown in the LUMI documentation. There is a limit on the number of jobs running concurrently and a limit on the number of jobs in the queue (which includes running jobs). That limit is also very low. LUMI is a very large cluster and the total number of jobs that a scheduler can handle of all users together is limited. Schedulers don't scale well. Which is why the limit on the number of jobs is considerably lower than it typically is on small clusters.
  
  The right way to deal with parallelism on big systems is using a hierarchy, and this holds for job management also: The scheduler to allocate bigger chunks of the machine and then run another tool to create parallelism in the job.
Out of curiousity, how did the GPU-id and NUMA-id misalignment occur? It somehow makes me surprised that/(if?) it is consistently wired in this same "weird" manner all over the nodes.
- Luckily it is consistent. I'm pretty sure it is basically a result of the motherboard design. The connections between the GPUs determine which GPU considers itself GPU 0 during the boot procedure, while the connections between the GPU socket and CPU socket determine the mapping between CCDs and GPU dies.
  - Interesting, thanks for the answer!
Regarding the containers and MPI... If I want to run my software on multiple nodes from within a Singularity container with decent scalability, should I use host (LUMI's) MPI, right? Then, which MPI should I make this software work with? Open MPI, Intel MPI - don't work on LUMI, right? What should I aim at then?
- Open MPI doesn't work very well at the moment. Some people got enough performance from it though for CPU codes and I worked on a recent ticket where it worked well for a user within a node.
  
  HPE Cray is also working on a translation layer from Open MPI 4.1 to Cray MPICH.
- We know that Intel internally must have an MPI that is compatible with LUMI as they are building the software environment for the USA Aurora supercomputer that uses the same interconnect as LUMI. But so far experiments with the versions distributed with oneAPI etc. have been a disappointment. It might be possible to try to force the application compiled with Intel MPI to use Cray MPICH as they should have the same binary interface.
- But the MPI on LUMI is basically compatible with the ABI from MPICH 3.4.
- The main advise it though to avoid containers and software that comes as binaries when you want to use MPI. It is often a pain to get the software to work properly. Containers are good for some level of portability between sufficiently similar machines with a close enough OS kernel, same hardware and same kernel modules, and were never meant to be portable in all cases. They work very well in, e.g., a cluster management environment (And they are used on the LUMI management nodes) but then you know that the containers will be moving between identical hardware (or very similar hardware if the vendor provides them ready-to-run).
Is it possible to list all nodes with their resource occupation? I want to see how many GPUs / memory is available on different nodes.
- All GPU nodes are identical. There is only one node type. And the majority of the GPU nodes is job exclusive so another job could not even start on them.
  - So if my application utilizes only 1 GPU, it will still hold the whole node with all GPUs?
- It depends on what partition you use. If you use standard-g, yes and you will pay for the full node. On small-g you are billed based on a combination of memory use, core use and GPU use (as if, e.g., you ask for half the CPU memory of a node you basically make it impossible for others to efficiently use half of the GPUs and half of the CPU cores, so you would be billed for half a node).
  
  But it makes no sense to try to be cleverer than the scheduler and think that "look, there are two GPUs free on that node so if I now submit a job that requires only 2 GPUs it will run immediately" as the scheduler may already have reserved those resources for another job for which it is gathering enough GPUs or nodes to run.
Is it possible for us as users to see the current status of LUMI nodes (e.g. using dashboard or a command line)? I mean how many nodes are used or available to work? How many jobs are currently queuing (including other users). I just need to know what would be the expected waiting time for running my jobs.
- sinfo gives some information but that tells nothing about the queueing time for a job. Any command that gives such information is basically a random number generator. Other jobs can end sooner making resources available earlier, or other users with higher priority jobs may enter the queue and push your job further back.
Are nodes assigned exclusively for projects? I mean, If I am submitting a job for a particular project, would I have access to most of the resources or for specific nodes? Is there a quota per project and how to know how much of this quotaa is used?
- Nodes are either shared among jobs or exclusively assigned to jobs depending on the partition.
- Each job is also attached to a project for billing purposes. Your resource allocator should have given information about that and billing is documented well in the LUMI documentation. Maciej showed the command to check how much you have consumed (lumi-allocations).
- And each job also runs as a user which determines what you can access on the system.
How priority is determined? Does submitting more jobs or jobs that consume time or memory results in lower priority? Does priority is determined per user or project?
- We don't have precise details and it would make no sense either as it is a complicated formula that can be adjusted if the need arises to ensure fair use of LUMI. Priority is a property of a job, not of a project or of a user, but it can be influenced by factors that are user- or project-dependent, like fair use which is actually difficult on LUMI due to the different size of allocations, some projects have a 100 times more compute time than some others so the scheduler should also make sure that those users with huge projects can run jobs often enough.
Is it possible to submit a job that may take more than 3 days?
- No, and we make no exceptions at all. This is on one hand to prevent monopolisation of a partition by a single user, and on the other hand because it creates a maintenance nightmare as a rolling update of the system can take as long as the longest running job.
  
  Moreover, it is also a protection against yourself as large systems are inherently less stable than smaller systems so there is a much higher chance that your long running job may fail an hour before it would have finished.
If I allocate only a single GPU, will I automatically get assigned a GPU and a CPU which are "close" in the setup? (Assuming such an allocation is available)
- Not sure about that and I even doubt it at the moment. It was definitely not the case before the upgrade, but it looks like the scheduler is still not doing a proper job. The only way to properly control binding is on exclusive nodes.
  
  You could try to run 8 subjobs in a job each using their own GPU, but then at the moment you may be hit with another scheduler bug for which we are still waiting for a fix from HPE.
Maybe I'm missing a point, but re slide #32: why do we want to skip core 8, 16, 24, 32, ...?
- For reasons of symmetry as core 0 cannot be used as it is reserved. It's a bit strange to use 7 threads on CCD 0 and then 8 threads on each of the other one.
Okay, so it is becuase you don't want other tasks to behave differently from the first which only got 7 cores. Am I understanding that right?
- If these would be 8 fully independent programs it makes sense to use a different set of resources for each. But in an MPI program it is often easier if each rank has the same amount of resources. And after all the speed of your parallel process is determined by the slowest of the processes anyway.
Okay, I think that makes sense to me.
- On Frontier they actually block each 8^th core by default and we have asked AMD if this may have an advantage as driver threads for each GPU could then run on a different core which may be advantageous if you cannot really use these cores for applications anyway.
Follow-up in Question 31.: If I reserve 2 GPUs, can I expect them to be the two GCDs on 1 device?
- No unfortunately, just as Slurm can also not guarantee on the CPU partition that all your cores would be on a single CCD (or within a single cache domain). With machines as LUMI we really need new scheduling technology that is more aware of the hierarchy in resources for sub-node scheduling.
Okay, so I'd need to reserve the whole node to gain that kind of control.
- Unfortunately yes.

Exercises¶

Rather general question: Does there exist a wiki or open document that would provide suggestions and hints for best practices when using different scientific software on LUMI? The guidelines provided in this course show that LUMI can support a wide range of programming environments and compilers, as well as specific optimization tools and methods. From my understanding, these considerations can vary greatly for each application or software. Therefore, any experience with specific programs could prove invaluable in saving a significant amount of resources.
- No. There are basically too many different packages in use to even be feasible to put an effort in it, and for some packages it really depends a lot on the actual problem you are solving. Sometimes completely different settings are needed depending on which parts of the packages that are being used (some packages support different algorithms that may be parallelised completely differently) and what problem you are trying to solve.
  
  That experience should come from user communities of a package as the same experience may be useful on several systems. And much of it may even transfer to systems that are not identical. I have seen manuals of software packages that do try to provide those insights. But, e.g., if your application is not too much restricted by the communication network than there is really no difference between running on LUMI or any other cluster with 64-core Milan CPUs of which there are more and more out these days, and it may even carry over to the newer Genoa CPU.
  
  For the GPUs it may be a bit different as HPE Cray is the only company that can sell this particular version (the MI250X) and others can only sell the regular MI250 which connects through PCIe. But even then in most cases that experience will be the same. There aren't that many MI200-faily GPUs out in the field though, most companies seem to be waiting for its successor.
Thanks. Alright, I take that the best practice should be seeked from the community of specific software.

Lustre I/O¶

Just a note: I've been on the receiving end of file-system downtime. First my project was down, then it went up and my user folder went down. That felt rather annoying... D: Wish both was on the same partition.
- We know but the maintenance was urgent. It was a bug that corrupts files. One system was hit with corruption issues and it turned out that there were more serious problems on a second one also. It would have been better though if they had taken the whole system down, and maybe they could have done the work in the time it took now to do two of the four file systems...
- It is impossible to get the home directory and project space guaranteed on the same partition. Users can have multiple projects and projects have multiple users so there is no way we can organise that. In fact, several users come in via the course project before they even have a "real" project on LUMI...
My colleague got lucky in this regard, since we only have one project :p
- Most of us in user support have two userids, one with CSC for our LUST work and then we got another one, either from our local organisation or because we needed to have one in Puhuri also for other work (before it was possible to have a single ID for the two systems that are used for user management). One of us was blocked on both accounts simultaneously, then tried to get a third account via an appointment at another university and was unlucky to get that one also on the same file system...
poor guy ...
- Moreover there is a simple formala that connects your userid and project id to the respective storage system it is on...
What is the difference between these folders at higher hierarchy? For example, my user folder is located at /pfs/lustrep4/users. Each of these folders contain folders for users and projects.
- They are all on the same file systems but with different quota policies and different retention policies.
Then may be my personal files and project files are located in different parent folder, isn't it?
- Project and scratch will always be on the same file system but is assigned independently from the user home directory, basically because what I explained above. There is a many-to-many mapping between users and projects.
Are files removed from LUMI after certain days of no activity?
- The policies are an evolving thing but accounts that are not used for too long are blocked even if they have an active project because an unattended account is a security risk.
- Files on scratch and flash will be cleaned in the future after 90 or 30 days respectively (likely access date though and not creation date). This is rather common on big clusters actually.
- All files of your project are removed 90 days after the end of the project and not recoverable.
- Not sure how long a userid can exist without a project attached to it, but we already have closed accounts on LUMI with all files removed.
- And note that LUMI is not meant for data archiving. There is no backup, not even of the home directory. You are responsible for transfering all data to an external storage service, likely your home institute.

LUMI User Support¶

Re the Tallinn course: will it be recorded too, like this one? And where will this course recording be available?
- The problem with the 4-day courses that there is lot of copyrighted material in there that we can only make available to users of LUMI. So far this has been done with a project that was only accessible to those who took the course but we are looking to put them in a place were all users can access the data on LUMI. The HPE lectures will never go on the web though. There is a chance that we will be allowed to put the AMD presentations on the web.
  
  There is actually material from previous courses already on the course archive web site that Jørn referred to in his introduction, on lumi-supercomputer.github.io/LUMI-training-materials.
I'm mostly interested in the GPU-profiling, but I can't attend due to other events.
- Part of that is HPE material for which we are looking for a better solution, part of it is AMD material and so far we have been allowed to put their slides on the web, but we have to ask for the recordings. We only got a access to a system capable of serving the recordings last week so we are still working on that.
Can we cancel our tickets ourselves, if the problem "magically" solves meanwhile :)?
- I am not aware of such an option.
- Simply mail that the issue is resolved and the owner of the ticket will be very happy to close it ;-)