Current systems are comprised of multi-core nodes and contain accelerators. One effect of this is that users may desire to run in so called hybrid threaded MPI mode. Typically this results in fewer MPI processes per node than there are cores in the node. Examples of this are as follows:
The feature of binding particular MPI processes to nodes, cores or sockets is referred to as processor affinity. In general each MPI implementation deals with this in its own way. For example the process affinity interface in MVAPICH2-1.8 is described here. OpenMPI on the other hand uses host files and rank files as described: here.
Running on multi-node systems presents typically (at least) two immediately practical questions to the user:
We describe how to answer these questions below. Please be aware that these are highly system specific.
Core and thread mappings vary even between different cluster nodes.
MPI presents some levels of thread safety. In order to use threads safely within MPI, one needs to start the MPI
process using MPI_Init_thread
rather than the default MPI_Init. MPI offers
4 levels of thread safety:
MPI_THREAD_SINGLE offers the lowest level of safety, whereas MPI_THREAD_MULTIPLE offers the highest level, at the cost of some overhead.
OpenMP is an interface to threading. The code is annotated with special #pragma statements
to indicate parallel regions, reductions, synchronization etc.
OpenMP typically needs special compiler flags to turn it on. To turn on OpenMP for GCC/G++ set of compilers please use the -fopenmp flag. For the intel compiler please use -openmp.
One can find out about core-geometry using commands such as numactl. For example
on a node numactl displays:
numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23 node 0 size: 65519 MB node 0 free: 33278 MB node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 node 1 size: 65536 MB node 1 free: 27611 MB node distances: node 0 1 0: 10 20 1: 20 10
showing that there are
For example, for a 4 GPU job, a sensible distribution may be to distribute the 'real threads' between the GPUs, e.g. as follows:
with each process using up to potentially 8 OpenMP threads.
Both MVAPICH and OpenMPI rely on using host-files generated by the PBS system at job launch. On our clusters
we request nodes with the "-lnodes=X" argument to qsub, where X is the number of cores (on unaccelerated) nodes, OR the number of GPUs (on accelerated nodes) respectively. This generates a hostfile
with 1 entry for each "core" or "accelerator". The name of the hostfile is encoded in the environment variable $PBS_NODEFILE. We need to tell the MPI job to use our nodefile. This can be done as follows in MVAPICH2-1.8:
For OpenMPI the hostfile is specified as: using either the -hostfile or --hostfile option (they are equivalent).
Both MVAPICH and OpenMPI allow core binding in a variety of ways.
To get the most full control, in MVAPICH one can turn off MVAPICH's own processor affinity, by setting the environment variable
MV2_ENABLE_AFFINITY=0
Then instead of launching an executable, one can launch a shell script. The shell script can find the process's local
MPI rank on the node and bind it to threads appropriately using numactl. An example of such a script is below, which launches a launcher script called script. This launcher script then uses 4 OpenMP threads
per MPI task (ie all 16 hyperthreads on a 2x4 socket). The job is specific to a 9G node (as the mapping of processors is system dependent).
#!/bin/bash #PBS -lnodes=4:c9g #PBS -A xyz #PBS -lwalltime=10:00 # Set up the environment export PATH=/usr/mpi/gcc/mvapich2-1.8/bin:/usr/local/cuda-5.0/bin:$PATH export LD_LIBRARY_PATH=/usr/mpi/gcc/mvapich2-1.8/lib:/usr/local/cuda-5.0/lib64:/usr/local/cuda-5.0/lib:$LD_LIBRARY_PATH # Pick the executable EXECUTABLE=./chroma_single_sm35 ARGS="-i 4-bicgstab-wf.ini.xml -geom 1 1 1 4" mpiexec.hydra \ -np 4 \ -genv MV2_ENABLE_AFFINITY 0 \ -genv QUDA_RESOURCE_PATH /home/bjoo/tuning_sm30 \ -f $PBS_NODEFILE \ -launcher rsh \ ./script $EXECUTABLE $ARGS
where the actual executable is launched using the script script shown below:
#!/bin/bash ARGS=$* case $MV2_COMM_WORLD_LOCAL_RANK in 0) CPUS="0,8,1,9" ;; 1) CPUS="2,10,3,11" ;; 2) CPUS="4,12,5,13" ;; 3) CPUS="6,14,7,15" ;; *) echo LOCAL Rank cannot be bigger than 3 exit 1 ;; esac export OMP_NUM_THREADS=4 /usr/bin/numactl --physcpubind=${CPUS} ${ARGS}
NOTE: the above script should scale to more nodes since the binding on each node is based on the local rank of the MPI task on a given node.
In OpenMPI one can specify mapping in a variety of ways, for example using rank files. Please see OpenMPI documentation for details.
By default node binding is not enabled in OpenMPI and it can be explicitly disabled using the --bind-to-none flag. In this case one can use the script based approach outlined above for MVAPICH2-1.8 however the environment variable MV2_COMM_WORLD_LOCAL_RANK needs to be replaced with OMPI_COMM_WORLD_LOCAL_RANK, which is OpenMPI's version of the local rank.
Below is a list of common MPI gotcha's:
Lattice initialized: problem size = 32 32 32 64 layout size = 16 32 32 64 logical machine size = 1 1 1 4 subgrid size = 32 32 32 16 total number of nodes = 4 total volume = 2097152 subgrid volume = 524288
If the logical machine size shows
logical machine size = 1 1 1 1
it indicates the job thinks of itself as a single node job.
mpirun_rsh -rsh -np < number of MPI tasks > -machinefile < machine file > < env vars > < executable > < arguments >
Please see help for mpirun_rsh
mpiexec.hydra -np < Number of Tasks > -launcher rsh ... < executable > < args >
Please see help for mpiexec.hydra
mpirun --mca orte_rsh_agent rsh -np < Number of Tasks > -hostfile < hostfile > ...