MPI Considerations in Multi-Core and Heterogeneous environments

Multi-Core and Heterogeneous Environments

Current systems are comprised of multi-core nodes and contain accelerators. One effect of this is that users may desire to run in so called hybrid threaded MPI mode. Typically this results in fewer MPI processes per node than there are cores in the node. Examples of this are as follows:

  • A node has 8 cores distributed accross 2 sockets (i.e. 2x4 cores). A user may wish to run 2 MPI processes with one running on each socket. Each MPI process may contain up to 8 threads (4 cores with 2 HT threads on each).
  • A node has 8 cores distributed accross 2 sockets (i.e. 2x4 cores) with 4 GPUs. A user may wish to run 4 MPI processes on the node, with each process bound to 1 GPU. The MPI process itself may or may not use multiple OpenMP threads

The feature of binding particular MPI processes to nodes, cores or sockets is referred to as processor affinity. In general each MPI implementation deals with this in its own way. For example the process affinity interface in MVAPICH2-1.8 is described here. OpenMPI on the other hand uses host files and rank files as described: here.

Running on multi-node systems presents typically (at least) two immediately practical questions to the user:

  • How to find out what the core-geometry is for a node
  • How to tell MPI what core binding to use

We describe how to answer these questions below. Please be aware that these are highly system specific. Core and thread mappings vary even between different cluster nodes.

Basics about MPI and Threads

MPI presents some levels of thread safety. In order to use threads safely within MPI, one needs to start the MPI process using MPI_Init_thread rather than the default MPI_Init. MPI offers 4 levels of thread safety:

  • MPI_THREAD_SINGLE : Each MPI process will be single threaded
  • MPI_THREAD_FUNNELED: Each MPI process may be multi-threaded, but the threads will be funneled down to a single thread before calling MPI.
  • MPI_THREAD_SERIALIZED: Multiple threads in each MPI process may call MPI, but calls to MPI will be serialized so no two single threads will make MPI calls concurrently
  • MPI_THREAD_MULTIPLE: The MPI processes are multi-threaded, and can call MPI simultaneously.

MPI_THREAD_SINGLE offers the lowest level of safety, whereas MPI_THREAD_MULTIPLE offers the highest level, at the cost of some overhead.

Basics About OpenMP

OpenMP is an interface to threading. The code is annotated with special #pragma statements to indicate parallel regions, reductions, synchronization etc.

OpenMP typically needs special compiler flags to turn it on. To turn on OpenMP for GCC/G++ set of compilers please use the -fopenmp flag. For the intel compiler please use -openmp.

How to find out about core-geometry

One can find out about core-geometry using commands such as numactl. For example on a node numactl displays:

numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size: 65519 MB
node 0 free: 33278 MB
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node 1 size: 65536 MB
node 1 free: 27611 MB
node distances:
node   0   1 
  0:  10  20 
  1:  20  10 
showing that there are
  • 2 sockets
  • Each socket contains 16 hardware threads (virtual CPUs)
  • These are typically displayed as 'real threads' (0-7 for node 0 in this case) followed by 'hyperthread partners' (16-23 for node 0 in this case)
For example, for a 4 GPU job, a sensible distribution may be to distribute the 'real threads' between the GPUs, e.g. as follows:
  • MPI Process 0 could use node 0 threads: 0 1 2 3 and their HT partners: 16 17 18 19
  • MPI Process 1 could use node 0 threads: 4 5 6 7 and their HT partners: 20 21 22 23
  • MPI process 2 could use node 1 threads: 8 9 10 11 and their HT partners: 24 25 26 27
  • MPI process 3 could use node 1 threads: 12 13 14 15 and their HT partners: 28 29 30 31
with each process using up to potentially 8 OpenMP threads.

Specifying Affinity to MPI

The Host-file

Both MVAPICH and OpenMPI rely on using host-files generated by the PBS system at job launch. On our clusters we request nodes with the "-lnodes=X" argument to qsub, where X is the number of cores (on unaccelerated) nodes, OR the number of GPUs (on accelerated nodes) respectively. This generates a hostfile with 1 entry for each "core" or "accelerator". The name of the hostfile is encoded in the environment variable $PBS_NODEFILE. We need to tell the MPI job to use our nodefile. This can be done as follows in MVAPICH2-1.8:

  • Using the -hostfile argument if using mpirun_rsh (see mpirun_rsh --help)
  • Using the -f argument if using mpiexec.hydra (see mpiexec.hydra --help)

For OpenMPI the hostfile is specified as: using either the -hostfile or --hostfile option (they are equivalent).

Specifying Core Binding

Both MVAPICH and OpenMPI allow core binding in a variety of ways.

Specifying explicit core binding in MVAPICH2-1.8

To get the most full control, in MVAPICH one can turn off MVAPICH's own processor affinity, by setting the environment variable

MV2_ENABLE_AFFINITY=0
Then instead of launching an executable, one can launch a shell script. The shell script can find the process's local MPI rank on the node and bind it to threads appropriately using numactl. An example of such a script is below, which launches a launcher script called script. This launcher script then uses 4 OpenMP threads per MPI task (ie all 16 hyperthreads on a 2x4 socket). The job is specific to a 9G node (as the mapping of processors is system dependent).
#!/bin/bash
#PBS -lnodes=4:c9g
#PBS -A xyz
#PBS -lwalltime=10:00

# Set up the environment
export PATH=/usr/mpi/gcc/mvapich2-1.8/bin:/usr/local/cuda-5.0/bin:$PATH
export LD_LIBRARY_PATH=/usr/mpi/gcc/mvapich2-1.8/lib:/usr/local/cuda-5.0/lib64:/usr/local/cuda-5.0/lib:$LD_LIBRARY_PATH

# Pick the executable 
EXECUTABLE=./chroma_single_sm35
ARGS="-i 4-bicgstab-wf.ini.xml -geom 1 1 1 4"

mpiexec.hydra \
        -np 4 \
    -genv MV2_ENABLE_AFFINITY 0 \
    -genv QUDA_RESOURCE_PATH /home/bjoo/tuning_sm30 \
    -f $PBS_NODEFILE \
    -launcher rsh \
    ./script $EXECUTABLE $ARGS
where the actual executable is launched using the script script shown below:
#!/bin/bash

ARGS=$*

case $MV2_COMM_WORLD_LOCAL_RANK in
0)
 CPUS="0,8,1,9"
 ;;
1)
 CPUS="2,10,3,11"
 ;;
2)
 CPUS="4,12,5,13"
 ;;
3)
 CPUS="6,14,7,15"
 ;;
*)
  echo LOCAL Rank cannot be bigger than 3
  exit 1
  ;;
esac

export OMP_NUM_THREADS=4
/usr/bin/numactl --physcpubind=${CPUS} ${ARGS}
NOTE: the above script should scale to more nodes since the binding on each node is based on the local rank of the MPI task on a given node.

Specifying Mapping in OpenMPI

In OpenMPI one can specify mapping in a variety of ways, for example using rank files. Please see OpenMPI documentation for details.

By default node binding is not enabled in OpenMPI and it can be explicitly disabled using the --bind-to-none flag. In this case one can use the script based approach outlined above for MVAPICH2-1.8 however the environment variable MV2_COMM_WORLD_LOCAL_RANK needs to be replaced with OMPI_COMM_WORLD_LOCAL_RANK, which is OpenMPI's version of the local rank.

MPI Gotcha-s

Below is a list of common MPI gotcha's:

Launching with mpirun from a different MPI (e.g. OpenMPI) than was used to compile the code (e.g. MVAPICH2)
The typical effect is that instead of a multi-node job starting, many single node jobs start. These may or may not run to completion depending on the overall resource size (e.g. memory) needed. If using Chroma one can inspect the logical CPU geometry in the output. It typically looks like:
Lattice initialized:
  problem size = 32 32 32 64
  layout size = 16 32 32 64
  logical machine size = 1 1 1 4
  subgrid size = 32 32 32 16
  total number of nodes = 4
  total volume = 2097152
  subgrid volume = 524288
If the logical machine size shows
  logical machine size = 1 1 1 1
it indicates the job thinks of itself as a single node job.
Forgetting to launch the job with an RSH based launcher
Most MPIs nowadays launch using SSH by default and special steps need to be taken to launch jobs via RSH as is done in our environment.
  • MVAPICH2 can launch using the command mpirun_rsh. The format of the command is something like:
    mpirun_rsh -rsh -np < number of MPI tasks > -machinefile < machine file >  < env vars >  < executable > < arguments >
    
    Please see help for mpirun_rsh
  • MVAPICH2's hydra interface can request an RSH launch using mpiexec.hydra. The format of the command is like:
      mpiexec.hydra -np < Number of Tasks > -launcher rsh ... < executable > < args >
    
    Please see help for mpiexec.hydra
  • OpenMPI needs to set an MCA parameter as below:
    mpirun --mca orte_rsh_agent rsh -np < Number of Tasks > -hostfile  < hostfile > ...
    
Forgetting to specify the hostfile
This can have the effect of launching all the processes requested on one node. This can cause a node to run out of memory and crash, taking the job with it. To a user this can appear as a hardware failure.