In this chapter we note the main points about running Intel MPI on the Knights Landing cluster.
It is straightforward to set up MPI by source-ing the psxevars.sh script as detailed in the Compilers and Tools section.
The simplest way to run is to use the mpirun command. A list of useful switches is provided below:
mpirun is a wrapper around the more general mpiexec.hydra command. One can use also mpiexec.hydra, which also allows for example different executables on different hosts. The command has the following format:
mpiexec.hydra
-host executable : -host executable2 : ...
Here global options include things like -genvall and -genv ENV_VAR value which will be passed to all MPI processes.The local options include the name of a host to run on (hostname), and options like -n to choose the number of proceses to run on the host. In addition one can pass local environment variables with -env ENV_VAR value which will be passed only to the processes on that host.
It is possible to run mpiexec.hydra with a configuration file, in which every line refers to a specific host, with its local host options.
One should enable the environment variable
I_MPI_FABRICS=shm:tmi
this should make MPI use shared memory transport on-node and fall back to the TMI driver for OmniPath between the nodes.
For hybrid MPI/OpenMP jobs when using the tmi driver it may be that despite setting the I_MPI_PIN_DOMAIN to be the whole node, threads still appear to behave as if pinned to one core. In this case one can use the environment variable:
export HFI_NO_CPUAFFINITY=1
Intel MPI allows one to affinitize its MPI processes. To enable this feature one should set the environment variables I_MPI_PIN and I_MPI_PIN_DOMAIN. In particular I_MPI_PIN_DOMAIN can be used to bind processes to nodes, cores or NUMA domains. Useful values are:
In addition to these pin requirements, one can further enforce OpenMP affinity by setting appropriate KMP_PLACE_THREADS, and KMP_AFFINITY environment variables as described in the OpenMP chapter.
The PBS system will generate a default machine file whose name will be kept in the environment variable: $PBS_NODEFILE. In the past, this file contained 64 copies of the name of each node (1 for each core) because PBS used a single core as the element of a job resource requirement. At present, PBS uses a node as the element of a job resource requirement. Thus, $PBS_NODEFILE only contains a single name of each node.
export OMP_NUM_THREADS=1
mpirun -genvall -n 128 -ppn 128 -host <hostname> executable
export OMP_NUM_THREADS=2
export I_MPI_PIN=1
export I_MPI_PIN_DOMAIN=core
mpirun -genvall -n 64 -host <hostname> executable
export OMP_NUM_THREADS=128
export KMP_PLACE_THREADS=1s,64c,2t
export KMP_AFFINITY=compact,granularity=thread
export I_MPI_PIN=1
export I_MPI_PIN_DOMAIN=node
mpirun -genvall -n 2 -hosts HOST1,HOST2 ./executable
export KMP_AFFINITY=compact,granularity=thread
export OMP_NUM_THREADS=4
export I_MPI_PIN=1
export I_MPI_PIN_DOMAIN=core
export I_MPI_FABRICS=shm:tmi
export HFI_NO_CPUAFFINITY=1
mpirun -rsh rsh -genvall -n 128 -ppn 64 -host HOST1,HOST2 executable
Jobs on Flat-Quadrant nodes can make use of MCDRAM. One 'easy-button' if the application fits completely into MCDRAM is to just force the entire code to run via numactl. In this case this needs to be done for each MPI process for example as follows, which will run 2 MPI processes. on 2 nodes, with each MPI process using 128 threads (2 threads per core) and each process is forced to allocate memory from NUMA domain 1 (MCDRAM on Flat-Quad nodes)
export KMP_HW_SUBSET=1s,64c,2t
export OMP_NUM_THREADS=128
export KMP_AFFINITY=compact,granularity=thread
export I_MPI_PIN=1
export I_MPI_PIN_DOMAIN=node
export I_MPI_FABRICS=shm:tmi
export HFI_NO_CPUAFFINITY=1
mpirun -rsh rsh -genvall -n 2 -ppn 1 -hosts HOST1,HOST2 numactl -m 1 executable
Likewise numactl -m 0 would bind allocations to come from DDR. On Cache-Quadrant nodes, the MCDRAM is of course transparent and there is no need to use numactl