MPI on Rocky MonARCH
Message Passing Interface (MPI) is a standardized and portable message-passing standard designed to function on parallel computing architectures.
Our Rocky compute nodes use a network technology called ROCE (RDMA Over Converged Ethernet). This protocol is similar to Infiniband, used in may HPC centres.
Traffic for MPI (and to our Lustre file servers) should use the ROCE interface, as this provides for a high-bandwidth low-latency connection.
Unfortunately some extra configurations is needed to use ROCE in our system, as we have a heterogenous cluster, with different sorts of networking cards. Instructions on doing this are given below.
Please follow these instructions if you are running MPI between servers. They may not apply if you are running your program entirely within a server.
Load the MPI Compiler​
The command:
module load hpcx/2.14-redhat9.2
Will load the hpcx module. This contains the Mellanox version of the networking software HPCX as well as the openMPI compiler that comes with it.
Compile the software​
A typical example would be:
mpicc -o mpi.exe mpi.c
mpicc is a wrapper around the GNU compiler. Other commands include:
- mpiCC
- mpifort
- mpic++
To find out what compiler options you use you type
mpicc --showme
Run the program in a parallel environment on a single server.(Rocky OS)​
You can run MPI programs on a single server, with one Unix process being created for every Slurm task that you request. You have two options to invoke the parallel program:
- mpirun or mpiexec
- srun
In previous version of MPI, you had to specify the number of processes with mpirun ,i.e.
#SBATCH --ntasks=32
#SBATCH --nodes=1
mpirun -np 32 myMPI.exe
This code will spawn 32 processes of myMPI on a single node. You can ssh
into the node and view this with top
.
If you wanted to make the Slurm submission script more robust, you can
use some of the Slurm Bash environment variables that are created when a
job is running. See man sbatch
.
#SBATCH --ntasks=16
#SBATCH --nodes=1
mpirun -np $SLURM_NTASKS myMPI.exe
This will run 16 MPI processes on a single node, but if you want to
change the number of tasks, you only need to specify it once. Note that
this will only work in a parallel Slurm environment. If you run the
Slurm script as a bash script on a node, i.e. bash mySubmit.slm
then \$SLURM_NTASKS
will be
undefined.
If you use srun
you do not need to specify the number of
tasks.
#SBATCH --ntasks=16
#SBATCH --nodes=1
srun myMPI.exe
Using additional flags with mpirun/srun​
Unfortunately MPI will not run out of the box for some programs. If you get an error like this:
Error message: hwloc_set_cpubind returned "Error" for bitmap "0"
Then you need to set an additional flag in mpirun. --bind-to none
#SBATCH --ntasks=16
#SBATCH --nodes=1
mpirun --bind-to none -np $SLURM_NTASKS myMPI.exe
If you get an error like this:
A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
PMIX stopped checking at the first component that it did not find.
Host: mk13
Framework: psec
Component: munge
Then you need to set the PMIX_MCA_psec environment variable.
export PMIX_MCA_psec=^munge
mpirun -np 32 myMPI.exe
An alternative to this is to set the following environment variable
export OMPI_MCA_hwloc_base_binding_policy=none
If you see this error:
A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
Open MPI stopped checking at the first component that it did not find.
Host: m3q000
Framework: pml
Component: yalla
You can resolve this by either:
mpirun --mca pml ucx
Or by setting the environment variable:
export OMPI_MCA_pml=ucx
You may see this error with srun
srun myMPI.exe
[mk06:133233] OPAL ERROR: Unreachable in file pmix3x_client.c at line 112
In this case please specify the flag \--pmi=pmix
with
srun
srun --mpi=pmix myMPI.exe
Run the program in a parallel environment bewtween servers.(Rocky OS)​
The code for running MPI on two or more servers should look the same in your Slurm script. The only difference should be your Slurm request, i.e.
#SBATCH --nodes=2
srun myProg.exe
If this works (and on homogenous servers it is likely to do so), all is good.
However as we have different network cards on different nodes, it might be necessary to use some special feature to ensure the right settings are used for ROCE connections:
- use srun not mpirun or mpiexec
- use the following flag with srun --task-prolog=/usr/local/hpcx/ucx_net_devices.sh
For example, here is an example code:
#!/bin/bash
#SBATCH --ntasks=2
#SBATCH --nodes=2
#SBATCH --cpus-per-task=1
#SBATCH --tasks-per-node=1
srun hostname #prints out two nodenames
echo "About to call srun"
module load hpcx/2.14-redhat9.2
srun --task-prolog=/usr/local/hpcx/ucx_net_devices.sh mpi.exe
echo "Finish"
Behind the scenes​
When you load the module file it sets UCX environment variables to point to the correct ROCE network interface. When running between heterogenous hardware cards, we have to do this in real-time when the MPI program runs, hence the --task-prolog= flag.
If you find the automatic method does not work, you may want to set the environment variable yourself. Please contact the help desk before you do this, but the command will look like
export UCX_NET_DEVICES=mlx5_0:1,mlx5_bond_0:1
If you do not set this environment, it may use the slower default TCP/IP interface. In the worst case scenario, your program will lock up and not progress. It would be useful, if possible, to monitor the progress of MPI program within the code to verify that this is not happening.
Profiling MPI​
The easiest way to check that your program working is to login to the node(s) running the code and verify that it is using the resources you reqeusted.
- Find out what nodes your program is using with
scontrol show job \<jobid\>
or `squeue -u $USER - Login to one or more nodes with ssh. (You can only do this if you have a running job on the node)
- Verify the program is running the correct number of processes with
top
orhtop
There are a number of tools that can be used to Profile/Debug MPI programs. Please contact the help desk for information on them
Intel MPI​
We have included for completness the Intel implemenation of the MPI protocol in our software stack. This is not the same as openMPI. It has not been tested on our ROCE interfaces, so is only suitable for software that runs on a single server. Please contact the help desk if you encounter any issues with it.