Intel MPI on the Knights Landing Cluster

Intel MPI on the Knights Landing Cluster

In this chapter we note the main points about running Intel MPI on the Knights Landing cluster.

MPI Setup

It is straightforward to set up MPI by source-ing the psxevars.sh script as detailed in the Compilers and Tools section.

Running with mpirun

The simplest way to run is to use the mpirun command. A list of useful switches is provided below:

 -rsh rsh
use RSH to connect and launch rather than SSH
 
-genvall
make available all current environment variables in the MPI processes
 -genv ENV value
set environment variable ENV to value in the MPI processes
 -n numproc
launch numproc processes
 -ppn procpernode
run procpernode processes per node
 -hosts host1,host2,...
specify list of hosts to run on
 -machinefile
A file listing machines to use

mpirun is a wrapper around the more general mpiexec.hydra command. One can use also mpiexec.hydra, which also allows for example different executables on different hosts. The command has the following format:

 mpiexec.hydra  -host executable : -host executable2 : ...

Here global options include things like -genvall and -genv ENV_VAR value which will be passed to all MPI processes.The local options include the name of a host to run on (hostname), and options like -n to choose the number of proceses to run on the host. In addition one can pass local environment variables with -env ENV_VAR value which will be passed only to the processes on that host.

It is possible to run mpiexec.hydra with a configuration file, in which every line refers to a specific host, with its local host options.

Selecting the OPA driver

One should enable the environment variable

I_MPI_FABRICS=shm:tmi

this should make MPI use shared memory transport on-node and fall back to the TMI driver for OmniPath between the nodes.

Disabling OPA Host Fabric Interface Affinity

For hybrid MPI/OpenMP jobs when using the tmi driver it may be that despite setting the I_MPI_PIN_DOMAIN to be the whole node, threads still appear to behave as if pinned to one core. In this case one can use the environment variable:

export HFI_NO_CPUAFFINITY=1

Process Pinning with Intel MPI

Intel MPI allows one to affinitize its MPI processes. To enable this feature one should set the environment variables I_MPI_PIN and I_MPI_PIN_DOMAIN. In particular I_MPI_PIN_DOMAIN can be used to bind processes to nodes, cores or NUMA domains. Useful values are:

 I_MPI_PIN
0 = pinning is off, 1=pinning is enabled.
 I_MPI_PIN_DOMAIN=node
The threads of a process are kept within a node -- use this when running 1 process per node
 I_MPI_PIN_DOMAIN=core
The threads of a process are kept within a core -- use this when running 1 process per core
 I_MPI_PIN_DOMAIN=numa
The threads of a process are kept within a NUMA domain -- use this when running e.g. in SNC-4 mode
In addition to these pin requirements, one can further enforce OpenMP affinity by setting appropriate KMP_PLACE_THREADS, and KMP_AFFINITY environment variables as described in the OpenMP chapter.

Machine Files:

The PBS system will generate a default machine file whose name will be kept in the environment variable: $PBS_NODEFILE. In the past, this file contained 64 copies of the name of each node (1 for each core) because PBS used a single core as the element of a job resource requirement. At present, PBS uses a node as the element of a job resource requirement. Thus, $PBS_NODEFILE only contains a single name of each node.

Examples: 128 MPI tasks per node on a single node

export OMP_NUM_THREADS=1

mpirun -genvall  -n 128 -ppn 128 -host <hostname> executable


Examples: 64 MPI tasks per node with 2 threads each

export OMP_NUM_THREADS=2

export I_MPI_PIN=1

export I_MPI_PIN_DOMAIN=core

mpirun -genvall -n 64 -host <hostname> executable


Examples: 1 MPI task, on each of 2 nodes with 128 threads each, with 2 threads per core

export OMP_NUM_THREADS=128

export KMP_PLACE_THREADS=1s,64c,2t

export KMP_AFFINITY=compact,granularity=thread

export I_MPI_PIN=1

export I_MPI_PIN_DOMAIN=node

mpirun -genvall -n 2 -hosts HOST1,HOST2 ./executable


Examples: 64 MPI per node, on 2 nodes with 4 threads per node using rsh

export KMP_AFFINITY=compact,granularity=thread

export OMP_NUM_THREADS=4

export I_MPI_PIN=1

export I_MPI_PIN_DOMAIN=core

export I_MPI_FABRICS=shm:tmi

export HFI_NO_CPUAFFINITY=1

mpirun -rsh rsh -genvall -n 128 -ppn 64 -host HOST1,HOST2 executable


Example: Forcing an MPI job to use MCDRAM on Flat-Quad nodes

Jobs on Flat-Quadrant nodes can make use of MCDRAM. One 'easy-button' if the application fits completely into MCDRAM is to just force the entire code to run via numactl. In this case this needs to be done for each MPI process for example as follows, which will run 2 MPI processes. on 2 nodes, with each MPI process using 128 threads (2 threads per core) and each process is forced to allocate memory from NUMA domain 1 (MCDRAM on Flat-Quad nodes)

export KMP_HW_SUBSET=1s,64c,2t
export OMP_NUM_THREADS=128
export KMP_AFFINITY=compact,granularity=thread
export I_MPI_PIN=1
export I_MPI_PIN_DOMAIN=node
export I_MPI_FABRICS=shm:tmi
export HFI_NO_CPUAFFINITY=1
mpirun -rsh rsh  -genvall -n 2 -ppn 1 -hosts HOST1,HOST2 numactl -m 1 executable

Likewise numactl -m 0 would bind allocations to come from DDR. On Cache-Quadrant nodes, the MCDRAM is of course transparent and there is no need to use numactl