You are here

Intel MPI

logocolor4.png


In this chapter we note the main points about running Intel MPI on the Knights Landing cluster.

MPI Setup
It is straightforward to set up MPI by source-ing the psxevars.sh script as detailed in the Compilers and Tools section.
 
Running with mpirun
The simplest way is to use the mpirun command. A list of useful switches is provided below:

-rsh rsh

use RSH to connect and launch rather than SSH

-genvall

make available all current environment variables in the MPI processes

-genv ENV value

set environment variable ENV to value in the MPI processes

-n numproc

launch numproc processes

-ppn procpernode

run procpernode processes per node

-hosts host1,host2,...

specify list of hosts to run on

-machinefile

A file listing machines to use

mpirun is a wrapper around the more general mpiexec.hydra command. One can also use mpiexec.hydra, which allows for a different executable to be launched on different hosts. The command has the following format

mpiexec.hydra  -host executable : -host executable2 : ...

Here global options include things like -genvall and -genv ENV_VAR value which will be passed to all MPI processes.The local options include the name of a host to run on (hostname), and options like -n to choose the number of processes to run on the host. In addition one can pass local environment variables with -env ENV_VAR value which will be passed only to the processes on that host. It is possible to run mpiexec.hydra with a configuration file, in which every line refers to a specific host, with its local host options.

Selecting the OPA driver

One should enable the environment variable

I_MPI_FABRICS=shm:tmi

This will make MPI use shared memory transport on-node and fall back to the TMI driver for OmniPath between the nodes.

Disabling OPA Host Fabric Interface Affinity

For hybrid MPI/OpenMP jobs when using the tmi driver you may encounter that despite setting the I_MPI_PIN_DOMAIN to be the whole node, threads still appear to behave as if pinned to one core. In this case one can use the environment variable:

 
export HFI_NO_CPUAFFINITY=1

Process Pinning with Intel MPI

Intel MPI allows one to affinitize its MPI processes. To enable this feature one should set the environment variables I_MPI_PIN and I_MPI_PIN_DOMAIN. In particular I_MPI_PIN_DOMAIN can be used to bind processes to nodes, cores or NUMA domains. Useful values are:

I_MPI_PIN

0 = pinning is off, 1=pinning is enabled.

I_MPI_PIN_DOMAIN=node

The threads of a process are kept within a node -- use this when running 1 process per node.

I_MPI_PIN_DOMAIN=core

The threads of a process are kept within a core -- use this when running 1 process per core.

I_MPI_PIN_DOMAIN=numa

The threads of a process are kept within a NUMA domain -- use this when running e.g. in SNC-4 mode.

In addition to these pin requirements, one can further enforce OpenMP affinity by setting appropriate KMP_PLACE_THREADS, and KMP_AFFINITY environment variables as described in the OpenMP chapter.

Machine Files:

The batch system will generate a default machine file whose name will be kept in the environment variable: $SLURM_JOB_NODELIST. In the past, this file contained 64 copies of the name of each node (1 for each core) because SLURM used a single core as the element of a job resource requirement. At present, SLURM uses a node as the element of a job resource requirement. Thus, $SLURM_JOB_NODELIST only contains a single name of each node.

Example: 128 MPI tasks per node on a single node

export OMP_NUM_THREADS=1
mpirun -genvall  -n 128 -ppn 128 -host <hostname> executable

Example: 64 MPI tasks per node with 2 threads each

export OMP_NUM_THREADS=2
export I_MPI_PIN=1
export I_MPI_PIN_DOMAIN=core
mpirun -genvall -n 64 -host <hostname> executable

Example: 1 MPI task, on each of 2 nodes with 128 threads each, with 2 threads per core

export OMP_NUM_THREADS=128

export KMP_PLACE_THREADS=1s,64c,2t

export KMP_AFFINITY=compact,granularity=thread

export I_MPI_PIN=1export I_MPI_PIN_DOMAIN=node

mpirun -genvall -n 2 -hosts HOST1,HOST2 ./executable

Example: 64 MPI per node, on 2 nodes with 4 threads per node using rsh

export KMP_AFFINITY=compact,granularity=thread

export OMP_NUM_THREADS=4

export I_MPI_PIN=1

export I_MPI_PIN_DOMAIN=core

export I_MPI_FABRICS=shm:tmi

export HFI_NO_CPUAFFINITY=1
mpirun -rsh rsh -genvall -n 128 -ppn 64 -host HOST1,HOST2 executable

Example: Forcing an MPI job to use MCDRAM on Flat-Quad nodes

Jobs on Flat-Quadrant nodes can make use of MCDRAM. One 'easy-button' if the application fits completely into MCDRAM is to just force the entire code to run via numactl. In this case this needs to be done for each MPI process for example as follows, which will run 2 MPI processes. on 2 nodes, with each MPI process using 128 threads (2 threads per core) and each process is forced to allocate memory from NUMA domain 1 (MCDRAM on Flat-Quad nodes)

export KMP_HW_SUBSET=1s,64c,2t
export OMP_NUM_THREADS=128
export KMP_AFFINITY=compact,granularity=thread
export I_MPI_PIN=1
export I_MPI_PIN_DOMAIN=node
export I_MPI_FABRICS=shm:tmi
export HFI_NO_CPUAFFINITY=1
mpirun -rsh rsh  -genvall -n 2 -ppn 1 -hosts HOST1,HOST2 numactl -m 1 executable

Likewise numactl -m 0 would bind allocations to come from DDR. On Cache-Quadrant nodes, the MCDRAM is of course transparent and there is no need to use numactl.