From Rob Crain:
Since the switch to intel mpi 2018, I've found a repeatable hang in a GADGET3 run with a particular set of ICs. I traced this to a call of MPI_Allgather in mymalloc.c. Note this is *not* the MPI_Allgatherv call where intel mpi gripes over the in-place issue. I found that 64 of 128 tasks returned from the operation fine, and the results stored in their output array were what I expected (which of course requires that *all* of the NTasks successfully communicate), so I am still at a loss as to why this happens.
However out of desperation I tried manually fixing which gather algorithm the routine uses (normally it picks from up to 5 depending on the message size and number). I quickly tried forcing to each one and got a mixed bag of results, which weren't entirely repeatable. But using method 1 (recursive doubling) seems to have resolved the issue (watch this space). To do this I added the following to my slurm script:
Illegal MPI call in forcetree.c
Gadget contains an MPI_Allgatherv call where the input and output buffers overlap. This is not allowed by the MPI standard. Intel MPI detects this error and stops the program with a message about aliased buffers. E.g.:
Fatal error in PMPI_Allgatherv: Invalid buffer pointer, error stack: PMPI_Allgatherv(1452): MPI_Allgatherv(sbuf=0x2b6383018c14, scount=352, MPI_BYTE, rbuf=0x2b6383016568, rcounts=0x2b6383042568, displs=0x2b63830425e8, MPI_BYTE, MPI_COMM_WORLD) failed
This can be fixed by changing the following line in forcetree.c from
MPI_Allgatherv(&DomainMoment[DomainStartList[ThisTask * MULTIPLEDOMAINS + m]], recvcounts[ThisTask], MPI_BYTE, &DomainMoment, recvcounts, recvoffset, MPI_BYTE, MPI_COMM_WORLD);
MPI_Allgatherv(MPI_IN_PLACE, recvcounts[ThisTask], MPI_BYTE, &DomainMoment, recvcounts, recvoffset, MPI_BYTE, MPI_COMM_WORLD);
Some versions of Gadget have a preprocessor macro which can be enabled to make this change (e.g. USE_MPI_IN_PLACE, or FIX_FOR_BLUEGENE_MPI).
Crashes due to MaxMemSize too large
In the Gadget parameter file there is a parameter MaxMemSize which specifies how much memory Gadget should use. Gadget allocates MaxMemSize megabytes of memory on each MPI task at startup. Setting this value too high leaves no memory for MPI to use and will cause runs to crash with various internal MPI errors.
On larger runs MaxMemSize may need to be no more than about 70% of the memory available to each core. MaxMemSize=16500 seems to be a reasonable choice on Cosma-7 when using Intel MPI.
Limits on MPI message sizes
The MPI interface limits messages to (2**31)-1 elements on systems with 32 bit ints. Additionally, some MPI implementations fail if the total size of a message is more than 2Gb.
Some versions of Gadget contain preprocessor macros to limit message sizes. These should be set to some value < 2000 in either the Makefile or Config.sh. E.g.
depending on the version of Gadget. Some older versions of Gadget have neither of these options and will need to be updated before they will work with large numbers of particles on each MPI task.
Intel MPI issue with large messages
Messages >2Gb can cause problems with Intel MPI. The workaround is to set the following environment variables in your batch script:
export I_MPI_DAPL_CHECK_MAX_RDMA_SIZE=enable export I_MPI_DAPL_MAX_MSG_SIZE=1073741824
HDF5 library version
Gadget (at least up to Gadget-3) uses the HDF5 1.6 API. Cosma-7 only has HDF5 1.8 and 1.10 libraries installed, so Gadget should be compiled with the compatibility macro H5_USE_16_API enabled - i.e. add -DH5_USE_16_API to the compiler flags.
Gadget3 type codes : how to for intel:
If you allocate more than 70% of the available RAM/per core in MaxMemSize, your code might crash with strange MPI errors. On COSMA7, please allocate no more than 20000 MB:
If you are using the same buffer for INPUT and OUTPUT in MPI_Allgatherv, then you will have to use MPI_USE_INPLACE in the MPI_Allgatherv call.
It is also advised to use
which you can do in a commandline instruction as
mpirun -genv I_MPI_DAPL_CHECK_MAX_RDMA_SIZE=enable -genv I_MPI_DAPL_MAX_MSG_SIZE=1073741824 -n number-of-slots ....
As the interconnect is Mellanox Infiniband it is highly recommended to pin the processes, as a minimum with the options:
Gadget-3/GIZMO on COSMA8
The following has been found to work with GIZMO (Romeel Dave):
module load intel_comp/2021.3.0 compiler mpi module load ucx/1.10.1 module load fftw/3.3.9epyc module load gsl module load hdf5/1.12.0export HDF5_DISABLE_VERSION_CHECK=1