Hands-on: Advancing your own project and general Q&A¶

Bring your own AI code, you want to run on LUMI, and spent some time applying what you have learned during the workshop - with on-site support from LUST/AMD.

General Q&A¶

ROCm 6.3 came out yesterday. What improvements can we expect for AI, and is there any chance to get it on LUMI?
- I am actually unsure if has been officially released. It has been made available for testing but the actual release should come with release notes.
- ROCm 6.3 has to be tested as it is getting out of expected driver support. Hopefully, it can be made available in a container to start with.
Looks like someone at AMD has been a bit uncareful then: This article in an AMD blog was picked up by a lot of press last night.
Is Megatron-LM a possibility on LUMI?
- I think TurkuNLP group (University of Turku, Finland) and Silo AI have been using that for training large language models on LUMI. So, yes, although it might not work out-of-the-box, they might be using a version that has been ported to LUMI.
- Megatron-Deepspeed is supported in AMD GPUs. There are hipified version of Megatron-LM that have been succesfully hipified for LUMI.
Is ONNX supported?
- I haven't tested it myself on LUMI, but I don't see why it wouldn't work.
- ONNX runtime should be supported.
If we use accelerate, do we still need to set the torch thread strategy to spawn in our script?
- Yes
Does it make sense / any advantage to using torch compile with FSDP?
- It's difficult to give general guidance around this. It may make sense to your application - is a matter of testing and see if it works as expected.
I am using a dynamic batch data collator and am seeing 100% VRAM on some GPUs in rocm-smi, will i be fine? Is there a reasonably involved way i can attribute what part of VRAM is used by which part of my pytorch script to understand better how datasets affect the memory requirements?
- Maybe having a look at PyTorch's Understanding CUDA Memory Usage will help you -Lukas

When using FSDP with accelerate and fsdp_cpu_ram_efficient_loading: true i still get an oom at the end of training, when setting the full state dict to save the model, despite the model only being 70B (~140GB with bpw=2), any ideas what to try? logs:

slurmstepd: error: Detected 1 oom_kill event in StepId=8538204.0. Some of the step tasks have been OOM Killed.
slurmstepd: error: Failed to destroy CXI Service ID 5 (cxi0): Device or resource busy
slurmstepd: error: Failed to destroy CXI Service ID 5 (cxi0): Device or resource busy
slurmstepd: error: Failed to destroy CXI Service ID 5 (cxi0): Device or resource busy
slurmstepd: error: Failed to destroy CXI Service ID 5 (cxi0): Device or resource busy
slurmstepd: error: Failed to destroy CXI Service ID 5 (cxi0): Device or resource busy
slurmstepd: error: switch_g_job_postfini: Device or resource busy
srun: error: nid006024: task 0: Out Of Memory
srun: Terminating StepId=8538204.0
srun: error: nid006900: task 1: Terminated
srun: error: nid007125: task 2: Terminated
srun: error: nid007405: task 3: Terminated
srun: Force Terminated StepId=8538204.0
Finished at: 2024-11-27T15:37:59 EET

It looks like it runs out of CPU memory (not GPU memory). It's hard to say what is the problem. If your program puts a lot of files in /tmp, keep in mind that that is a ramdisk so it also uses up CPU memory.