KNL System Overview

System Overview

The cluster is made up of 192 nodes featuring Intel(R) Xeon Phi(TM) 7230 (code name Knights Landing or KNL) CPUs. The CPUs have 64 cores each and run at a frequency (modulo TurboBoost) of 1.30 GHz. The nodes feature 192GB DDR4 memory, and 16GB high bandwidth (so called MCDRAM) memory.  The cores are grouped into tiles, with each tile containing 2 cores and 1 MB of L2 cache. Nominally this gives each core 512KB of L2 cache each, although a single core can potentially use all the 1MB. Additionally each core has 32KB of L1 data cache, and two 512bit Vector Units which together are capable of performing 32 DP floating point instructions per cycle ( 2 VPUs x 2 FLOPS/FMA x 8 vector lanes). In single precision 64 FLOPS per cycle are possible as the vector length is increased to 16 vector lanes. Vector processors are programmed using AVX-512 instructions.The nodes are connected by Intel OmniPath Interconnect

Node MCDRAM and Cluster Configuration Modes

The high bandwidth memory can be configured in three primary modes

 Flat Mode: 
In this mode the 16GB of MCDRAM are directly accessible. The O/S implements MCDRAM as a separate NUMA domain. In this instance, memory allocated into NUMA domain 0, is the regular (slower) DDR4 memory and memory allocated in NUMA domain 1 will go to MCDRAM. There are a variety of ways to manage this memory mode (using numactl, or hbw_malloc). Cache Mode: 
In this mode the 16GB of MCDRAM are configured as a cache for the larger DDR memory space. Users do not allocate memory in MCDRAM directly. The cache is direct mapped, and several DDR pages can map to the same MCDRAM page in principle. Further, nontemporal stores which bypass the CPU L1 and L2 caches, do not bypass MCDRAM when it is being used as a cache, meaning that using MCDRAM as a cache necessarily generates read-for-write traffic. As such, the effective memory bandwidth available in cache mode is somewhat less than in flat mode. Whether this makes an impact on applications depends on the applications use of memory bandwidth
  Hybrid Mode 
In this mode part of the MCDRAM is directly accessible and part is maintained as a cache

In turn, the on-chip network connecting the cores can be configured in a variety of cluster-modes: all-to-all mode, quadrant mode, and in sub-NUMA clustering modes (SNC-2, SNC-4).

The primary difference between all-to-all and quadrant modes is to do with how memory addresses are affinitized across the cores of the node. The recommended default mode, when all DDR DIMMs are of equal capacity is the so called Quadrant mode. In this case memory is affinitized to virtual quadrants of the chip, and memory transactions are routed within this quadrant giving a reduction in latency compared to all-to-all modes. The quadrant nature of the memory affinitization is not visible to the user, the node still looks like a single socket system. This is the default mode of operation for our cluster nodes

Sub NUMA Clustering Modes (SNC-2 and SNC-4) make the node look like a multi-socket system with either 2 or 4 sockets respectively. This mode may be beneficial, if one is considering running for-example multiple MPI tasks per node. One could be pinned to each of the 'virtual sockets'.

It should be borne in mind that memory modes and cluster modes can be mixed and matched. For example it is possible to to have Flat-SNC-4 mode, where each of the 4 virtual 'sockets' will be made up of 2 NUMA domains (8 NUMA domains in total), with each virtual socket having one NUMA domain for DDR and one for MCDRAM

The default mode for our cluster nodes is cache-quadrant (with PBS tag cache-quad)