Experimental Physics Computing

The batch farm contains ~260 CentOS 7 nodes, with 8, 16, 24, 32, or 36 cores, for a total of ~8000 cores:
  • 2018: 88 nodes, dual 20 core 2.4 GHz Xeon (Skylake), 96 GB memory, 480GB SSD, FDR IB
  • 2016: 44 nodes, dual 18 core 2.3 GHz Xeon E5-2697V4 (Broadwell), 64 GB memory, 1 TB HDD, FDR IB
  • 2014: 104 nodes, dual 12 core 2.3 GHz Xeon (Haswell), 32 GB memory, dual 1 TB HDDs, QDR IB
  • 2013: 24 nodes, dual 8 core 2.6 GHz Xeon (Ivy Bridge), 32 GB memory, dual 1 TB HDDs, QDR IB
  • 2012: 32 nodes, dual 8 core 2.0 GHz Xeon (Sandy Bridge), 32 GB memory, dual 500 GB HDDs, DDR IB
  • 2011: 2 nodes, dual 16 core 2.0 GHZ AMD, 64 GB memory, 1 TB disk, SDR IB - retired 06/2017
  • 2011: 18 nodes, dual 4 core 2.53 GHz Xeon (Westmere), 24 GB memory, dual 500 GB disk, SDR IB - retired 06/2017
  • 2010: 24 nodes, dual 4 core 2.4 GHz Xeon (Nehalem), 24 GB memory, 500 GB disk, SDR IB - retired 10/2016

The batch farm runs mostly serial jobs which spend part of their time in wait states for file I/O, and so the number of cores is oversubscribed by the batch system so that the number of jobslots equals the number of threads.    Memory per core is typically 2GB for older nodes, and as little as 1.33 for the newest nodes, and so memory per serial job slot ranges from 0.7 GB to 1.1 GB.  Jobs with larger memory requirements than 2 GB can still run by declaring their memory requirements. The batch system will leave slots on the same node unused so as to effectively allocate that memory to the larger job.

The trend is towards less memory per job thread (core or hyper core), and thus multi-threaded jobs with lower memory requirements per thread are highly preferred.

Farm Networking

All farm nodes are connected to both an Ethernet fabric and an Infiniband fabric, where the IB fabric is used for high speed access to the file servers.  The oldest nodes have SDR cards (single data rate IB, 10 Gb/s), recycled from the decommissioned 2006 LQCD 6n cluster, or mostly DDR cards (double data rate IB, 20 Gb/s), recycled from the decommissioned 2007 LQCD 7n cluster.  The newest nodes use recycled QDR cards (quad data rate, 40 Gb/s) from the 2009 clusters. Up to 22 nodes are connected to an SDR or DDR switch, and up to 34 nodes are connedted to each QDR switch. These "leaf" switches have dual uplinks to a farm "core" DDR or QDR switch, which in turn has a pair of uplinks to reach the file systems.

By using Infiniband, the total file system bandwidth for the farm can now be >4 GBytes/s, a number will will grow in Q4 of 2015 as additional file servers are put into production.  For additional information on the file system, see the next chapter on Disk Servers.