The cluster is made up of 192 nodes featuring Intel(R) Xeon Phi(TM) 7230 (code name Knights Landing or KNL) CPUs. The CPUs have 64 cores each and run at a frequency (modulo TurboBoost) of 1.30 GHz. The nodes feature 192GB DDR4 memory, and 16GB high bandwidth (so called MCDRAM) memory. The cores are grouped into tiles, with each tile containing 2 cores and 1 MB of L2 cache. Nominally this gives each core 512KB of L2 cache each, although a single core can potentially use all the 1MB. Additionally each core has 32KB of L1 data cache, and two 512bit Vector Units which together are capable of performing 32 DP floating point instructions per cycle ( 2 VPUs x 2 FLOPS/FMA x 8 vector lanes). In single precision 64 FLOPS per cycle are possible as the vector length is increased to 16 vector lanes. Vector processors are programmed using AVX-512 instructions.The nodes are connected by Intel OmniPath Interconnect
The high bandwidth memory can be configured in three primary modes
In turn, the on-chip network connecting the cores can be configured in a variety of cluster-modes: all-to-all mode, quadrant mode, and in sub-NUMA clustering modes (SNC-2, SNC-4).
The primary difference between all-to-all and quadrant modes is to do with how memory addresses are affinitized across the cores of the node. The recommended default mode, when all DDR DIMMs are of equal capacity is the so called Quadrant mode. In this case memory is affinitized to virtual quadrants of the chip, and memory transactions are routed within this quadrant giving a reduction in latency compared to all-to-all modes. The quadrant nature of the memory affinitization is not visible to the user, the node still looks like a single socket system. This is the default mode of operation for our cluster nodes
Sub NUMA Clustering Modes (SNC-2 and SNC-4) make the node look like a multi-socket system with either 2 or 4 sockets respectively. This mode may be beneficial, if one is considering running for-example multiple MPI tasks per node. One could be pinned to each of the 'virtual sockets'.
It should be borne in mind that memory modes and cluster modes can be mixed and matched. For example it is possible to to have Flat-SNC-4 mode, where each of the 4 virtual 'sockets' will be made up of 2 NUMA domains (8 NUMA domains in total), with each virtual socket having one NUMA domain for DDR and one for MCDRAM
The default mode for our cluster nodes is cache-quadrant (with PBS tag cache-quad)