COSMA8 has 2 login nodes, accessed via login8.cosma.dur.ac.uk
COSMA8 has 360 compute nodes, each of which have 1TB RAM and 128 cores (2x AMD 7H12 processors)
There are 2 high RAM (4TB) fat nodes, which should be accessed via the cosma8-shm queue.
There are a number of GPU-enabled servers (see below), and a 1TB AMD Milan test node.
There are 3 relevant SLURM queues:
cosma8: provides exclusive access to nodes, shared with cosma8-serial
cosma8-serial: provides non-exclusive access to nodes. Use this if you want less than 128 cores (and remember to specify your memory requirement too)
cosma8-shm: access to the mad04 and mad05 servers, with 4TB RAM. These is also non-exclusive, so may be shared with other users if you don't require all 128 cores or all 4TB RAM.
MKL: The Intel Math Kernel Library is known to be hobbled on AMD systems. There is a fix that must be applied: please click here.
MKL is available via the intel_comp and oneAPI modules.
OpenBLAS is available via the openblas modules.
The Gnu Scientific Library can be accessed via the gsl modules.
Recommended compilers will depend on the success seen by your application. See the PDFs below for recommended compiler options on AMD systems. Wisdom about best compilers for particular codes is collected here. Here are the available compilers:
intel_comp/2018 - generally stable
intel_comp/latest - possibly better optimisations
oneAPI - The newest versions of the Intel compiler, aliased to intel_comp
gnu_comp/ - 10.2 or 11.1 know about the Zen2 architecture, so will be better optimised
aocc (AMD optimised compiler collection)
aocc/ - the AMD Optimised Compiler Collection - performance generally lower.
Available via the llvm modules
Available via the pgi modules
Usually best to use the newest openmpi module. A version of this with .no-ucx (e.g. openmpi/4.1.1.no-ucx) may offer more stable performance in some cases.
Large jobs may suffer from performance issues. This can sometimes be resolved by selecting the UD protocol over the newer DC (dynamical connection) protocol by setting:
in the job script. See discussion here.
If openmpi is complaining about running out of resources (memory pools being empty), the following may help:
(or some larger value).
UCX settings can be seen with: /cosma/local/ucx/1.10.1/bin/ucx_info -f
For Gadget-4, setting export UCX_UD_MLX5_RX_QUEUE_LEN=16384 has also been shown to help.
2018 module is the fallback option for SWIFT.
Later versions use UCX underneath, and initially suffered from stability issues. However, the newest versions are much improved.
The mvapich module can sometimes offer improved performance. However, in some cases, RAM usage can be increased.
A number of GPU servers are accessible - please ask if you are unsure how to use these:
gn001: 10x NVIDIA V100 GPUs
ga003: 6x AMD MI50 GPUs
ga004: 1x AMD MI100 GPU, 2x 64 core AMD Milan processors.
login8b, mad04, mad05: Between 0-3 NVIDIA A100 GPUs (reconfigurable/moveable as required, please ask if you have a particular setup you wish for)
The current recommended setup (July 2021) is this:
module load intel_comp/2021.1.0 compiler
module load intel_mpi/2018
module load ucx/1.8.1
module load fftw/3.3.9epyc parallel_hdf5/1.10.6 parmetis/4.0.3-64bit gsl/2.5
You can swap in OpenMPI 4.0.5 instead of intel-mpi and get slightly worse performance.
--bind-to none is required to use all the cores correctly.
Allinea Arm Forge and MAP (used for code profiling) is available using the allinea/ddt/20.2.1 module.
Profiles collected during the commissioning period are available in the commissioning report.
SLURM batch scripts
See examples here.
The COSMA8 FAQ details some of the known issues or peculiarities related to COSMA8. Please let us know if there is something you would like added.
Known code issues
There is collective wisdom available when running particular codes on COSMA8.