In this chapter we note the main points about running Intel MPI on the Knights Landing cluster.
-rsh rsh
use RSH to connect and launch rather than SSH
-genvall
make available all current environment variables in the MPI processes
-genv ENV value
set environment variable ENV to value in the MPI processes
-n numproc
launch numproc processes
-ppn procpernode
run procpernode processes per node
-hosts host1,host2,...
specify list of hosts to run on
-machinefile
A file listing machines to use
mpirun is a wrapper around the more general mpiexec.hydra command. One can also use mpiexec.hydra, which allows for a different executable to be launched on different hosts. The command has the following format
mpiexec.hydra -host executable : -host executable2 : ...
Here global options include things like -genvall and -genv ENV_VAR value which will be passed to all MPI processes.The local options include the name of a host to run on (hostname), and options like -n to choose the number of processes to run on the host. In addition one can pass local environment variables with -env ENV_VAR value which will be passed only to the processes on that host. It is possible to run mpiexec.hydra with a configuration file, in which every line refers to a specific host, with its local host options.
Selecting the OPA driver
One should enable the environment variable
I_MPI_FABRICS=shm:tmi
This will make MPI use shared memory transport on-node and fall back to the TMI driver for OmniPath between the nodes.
Disabling OPA Host Fabric Interface Affinity
For hybrid MPI/OpenMP jobs when using the tmi driver you may encounter that despite setting the I_MPI_PIN_DOMAIN to be the whole node, threads still appear to behave as if pinned to one core. In this case one can use the environment variable:
Process Pinning with Intel MPI
Intel MPI allows one to affinitize its MPI processes. To enable this feature one should set the environment variables I_MPI_PIN and I_MPI_PIN_DOMAIN. In particular I_MPI_PIN_DOMAIN can be used to bind processes to nodes, cores or NUMA domains. Useful values are:
I_MPI_PIN
0 = pinning is off, 1=pinning is enabled.
I_MPI_PIN_DOMAIN=node
The threads of a process are kept within a node -- use this when running 1 process per node.
I_MPI_PIN_DOMAIN=core
The threads of a process are kept within a core -- use this when running 1 process per core.
I_MPI_PIN_DOMAIN=numa
The threads of a process are kept within a NUMA domain -- use this when running e.g. in SNC-4 mode.
In addition to these pin requirements, one can further enforce OpenMP affinity by setting appropriate KMP_PLACE_THREADS, and KMP_AFFINITY environment variables as described in the OpenMP chapter.
Machine Files:
The batch system will generate a default machine file whose name will be kept in the environment variable: $SLURM_JOB_NODELIST. In the past, this file contained 64 copies of the name of each node (1 for each core) because SLURM used a single core as the element of a job resource requirement. At present, SLURM uses a node as the element of a job resource requirement. Thus, $SLURM_JOB_NODELIST only contains a single name of each node.
Example: 128 MPI tasks per node on a single node
export OMP_NUM_THREADS=1
mpirun -genvall -n 128 -ppn 128 -host <hostname> executable
Example: 64 MPI tasks per node with 2 threads each
export OMP_NUM_THREADS=2
export I_MPI_PIN=1
export I_MPI_PIN_DOMAIN=core
mpirun -genvall -n 64 -host <hostname> executable
Example: 1 MPI task, on each of 2 nodes with 128 threads each, with 2 threads per core
export OMP_NUM_THREADS=128
export KMP_PLACE_THREADS=1s,64c,2t
export KMP_AFFINITY=compact,granularity=thread
export I_MPI_PIN=1export I_MPI_PIN_DOMAIN=node
mpirun -genvall -n 2 -hosts HOST1,HOST2 ./executable
Example: 64 MPI per node, on 2 nodes with 4 threads per node using rsh
export KMP_AFFINITY=compact,granularity=thread
export OMP_NUM_THREADS=4
export I_MPI_PIN=1
export I_MPI_PIN_DOMAIN=core
export I_MPI_FABRICS=shm:tmi
export HFI_NO_CPUAFFINITY=1
mpirun -rsh rsh -genvall -n 128 -ppn 64 -host HOST1,HOST2 executable
Example: Forcing an MPI job to use MCDRAM on Flat-Quad nodes
Jobs on Flat-Quadrant nodes can make use of MCDRAM. One 'easy-button' if the application fits completely into MCDRAM is to just force the entire code to run via numactl. In this case this needs to be done for each MPI process for example as follows, which will run 2 MPI processes. on 2 nodes, with each MPI process using 128 threads (2 threads per core) and each process is forced to allocate memory from NUMA domain 1 (MCDRAM on Flat-Quad nodes)
export KMP_HW_SUBSET=1s,64c,2t
export OMP_NUM_THREADS=128
export KMP_AFFINITY=compact,granularity=thread
export I_MPI_PIN=1
export I_MPI_PIN_DOMAIN=node
export I_MPI_FABRICS=shm:tmi
export HFI_NO_CPUAFFINITY=1
mpirun -rsh rsh -genvall -n 2 -ppn 1 -hosts HOST1,HOST2 numactl -m 1 executable
Likewise numactl -m 0 would bind allocations to come from DDR. On Cache-Quadrant nodes, the MCDRAM is of course transparent and there is no need to use numactl.