MPI on Rocky MonARCH

Message Passing Interface (MPI) is a standardized and portable message-passing standard designed to function on parallel computing architectures.

Our Rocky compute nodes use a network technology called ROCE (RDMA Over Converged Ethernet). This protocol is similar to Infiniband, used in may HPC centres.

Traffic for MPI (and to our Lustre file servers) should use the ROCE interface, as this provides for a high-bandwidth low-latency connection.

Unfortunately some extra configurations is needed to use ROCE in our system, as we have a heterogenous cluster, with different sorts of networking cards. Instructions on doing this are given below.

important

Please follow these instructions if you are running MPI between servers. They may not apply if you are running your program entirely within a server.

Load the MPI Compiler

The command:

module load hpcx/2.14-redhat9.2

Will load the hpcx module. This contains the Mellanox version of the networking software HPCX as well as the openMPI compiler that comes with it.

Compile the software

A typical example would be:

mpicc -o mpi.exe mpi.c

mpicc is a wrapper around the GNU compiler. Other commands include:

mpiCC
mpifort
mpic++

To find out what compiler options you use you type

mpicc --showme

Run the program in a parallel environment on a single server.(Rocky OS)

You can run MPI programs on a single server, with one Unix process being created for every Slurm task that you request. You have two options to invoke the parallel program:

mpirun or mpiexec
srun

In previous version of MPI, you had to specify the number of processes with mpirun ,i.e.

#SBATCH --ntasks=32
#SBATCH --nodes=1
mpirun -np 32 myMPI.exe

This code will spawn 32 processes of myMPI on a single node. You can ssh into the node and view this with top.

If you wanted to make the Slurm submission script more robust, you can use some of the Slurm Bash environment variables that are created when a job is running. See man sbatch.

#SBATCH --ntasks=16
#SBATCH --nodes=1
mpirun -np $SLURM_NTASKS myMPI.exe

This will run 16 MPI processes on a single node, but if you want to change the number of tasks, you only need to specify it once. Note that this will only work in a parallel Slurm environment. If you run the Slurm script as a bash script on a node, i.e. bash mySubmit.slm then \$SLURM_NTASKS will be undefined.

Please note:

If you use srun you do not specify the number of tasks as srun is aware of what you requested in the Slurm submission script.
With later versions of mpi, you do not need to specify the -np flag, as it is Slurm-aware

Example of running 16 MPI processes on a single node.

#SBATCH --ntasks=16
#SBATCH --nodes=1
srun myMPI.exe

Using additional flags with mpirun/srun

Unfortunately MPI may not run out of the box for some programs. In this case you need to call mpirun or srun with additional flags, or define additional environment variables.

Environment Variables Set in our Module file.

We have set some addition flags in our hpcx module file to avoid certain MPI issues. If you encounter any more, please advise us.

setenv OMPI_MCA_oob_tcp_if_exclude e1p1
setenv OMPI_MCA_hwloc_base_binding_policy none

The first variable overcomes a complicated issue with MPI setting up connections on our networks.

The second variable is to avoid errors like this.

Error message:     hwloc_set_cpubind returned "Error" for bitmap "0"

As an alternative to this environment variable, it is also possible set an additional flag in mpirun. --bind-to none. e.g.

#SBATCH --ntasks=16
#SBATCH --nodes=1
mpirun --bind-to none -np $SLURM_NTASKS myMPI.exe

Additional flags you may need to set

If you get an error like this:

A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
PMIX stopped checking at the first component that it did not find.

Host:      mk13
Framework: psec
Component: munge 

Then you need to set the PMIX_MCA_psec environment variable.

export PMIX_MCA_psec=^munge
mpirun -np 32 myMPI.exe

If you see this error:

A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:      m3q000
Framework: pml
Component: yalla

You can resolve this by either:

mpirun  --mca pml ucx

Or by setting the environment variable:

export OMPI_MCA_pml=ucx

You may see this error with srun

srun myMPI.exe
[mk06:133233] OPAL ERROR: Unreachable in file pmix3x_client.c at line 112

In this case please specify the flag --mpi=pmix with srun

srun --mpi=pmix myMPI.exe

Run the program in a parallel environment bewtween servers.(Rocky OS)

The code for running MPI on two or more servers should look the same in your Slurm script. The only difference should be your Slurm request, i.e.

#SBATCH --nodes=2
srun myProg.exe

If this works (and on homogenous servers it is likely to do so), all is good.

However as we have different network cards on different nodes, it might be necessary to use some special feature to ensure the right settings are used for ROCE connections:

use srun not mpirun or mpiexec
use the following flag with srun --task-prolog=/usr/local/hpcx/ucx_net_devices.sh

For example, here is an example code:

#!/bin/bash
#SBATCH --ntasks=2
#SBATCH --nodes=2 
#SBATCH --cpus-per-task=1 
#SBATCH --tasks-per-node=1
srun hostname #prints out two nodenames
echo "About to call srun"

module load hpcx/2.14-redhat9.2
srun --task-prolog=/usr/local/hpcx/ucx_net_devices.sh mpi.exe

echo "Finish" 

Behind the scenes

When you load the module file it sets UCX environment variables to point to the correct ROCE network interface. When running between heterogenous hardware cards, we have to do this in real-time when the MPI program runs, hence the --task-prolog= flag.

If you find the automatic method does not work, you may want to set the environment variable yourself. Please contact the help desk before you do this, but the command will look like

export UCX_NET_DEVICES=mlx5_0:1,mlx5_bond_0:1

If you do not set this environment, it may use the slower default TCP/IP interface. In the worst case scenario, your program will lock up and not progress. It would be useful, if possible, to monitor the progress of MPI program within the code to verify that this is not happening.

Profiling MPI

The easiest way to check that your program working is to login to the node(s) running the code and verify that it is using the resources you reqeusted.

Find out what nodes your program is using with scontrol show job <jobid> or `squeue -u $USER
Login to one or more nodes with ssh. (You can only do this if you have a running job on the node)
Verify the program is running the correct number of processes with top or htop

There are a number of tools that can be used to Profile/Debug MPI programs. Please contact the help desk for information on them

Intel MPI

We have included for completness the Intel implemenation of the MPI protocol in our software stack. This is not the same as openMPI. It has not been tested on our ROCE interfaces, so is only suitable for software that runs on a single server. Please contact the help desk if you encounter any issues with it.

Load the MPI Compiler​

Compile the software​

Run the program in a parallel environment on a single server.(Rocky OS)​

Using additional flags with mpirun/srun​

Environment Variables Set in our Module file.​

Additional flags you may need to set​

Run the program in a parallel environment bewtween servers.(Rocky OS)​

Behind the scenes​

Profiling MPI​

Intel MPI​