Thread Affinity for OpenMP programs and other Multi-core considerations

The intel compiler provides for several ways of setting thread and core affinity, using the KMP_AFFINITY environment variable: There are several modes of binding threads, which are described in this article. However some common options are:

  • KMP_AFFINITY=compact: thread ID's for on core SMT threads run fastest, ie the OpenMP thread IDs are mapped as:
  tid=0: core=0, smt_thread=0
  tid=1: core=0, smt_thread=1
  tid=2: core=0, smt_thread=2
  tid=3: core=0, smt_thread=3
  tid=4: core=1, smt_thread=0
  tid=5: core=1, smt_thread=1
   ...
  • KMP_AFFINITY=scatter: thread ID's are scheduled onto successive cores, before returning round robin and filling up other smt-thread slots.
  tid=0: core=0, smt_thread=0
  tid=1: core=1, smt_thread=0
  ... 
  tid=59: core=59, smt_thread=0
  tid=60: core =0, smt_thread=1
  tid=61: core =1, smt_thread=1
  ... 
  tid=120: core=0, smt_thread=2
  tid=121: core=1, smt_thread=2
  ...
  • KMP_AFFINITY=balanced: this is an in between mode, between compact and scattered where threads are scheduled across cores first (like scatter)

but the IDs still run fastest within a core (like compact). A description can be found here

  • KMP_AFFINITY=explicit,proclist=[ <O/S thread list> ],< qualifier >: In this mode one provides an explicit mapping in terms of an O/S thread list.

The list contains the O/S thread IDs to use, and the position in the list decides the OpenMP thread ID. One can supply a granularity qualifier to choose, whether the OpenMP thread gets mapped to the desired OS thread, or whether the runtime system can migrate it to other threads within the same core as the desired thread. A qualifier of granularity=core allows migration within the core of the desired O/S thread. A granularity=thread does not allow such migration and directs binding solely to the desired O/S thread. Below is an example using granularity=thread

  KMP_AFFINITY="explicit,proclist=[0,3,5,9],granularity=thread"
  tid=0: on H/W thread 0 -- (on core 59)
  tid=1: on H/W thread 3 -- (on core 1)
  tid=2: on H/W thread 5 -- (on core 2)
  tid=3: on H/W thread 9 -- (on core 3)

When one sets granularity=core the thread can be scheduled to any thread within the core containing the H/W thread IDs in the list. An example is below:

   # Default granularity is per core
   KMP_AFFINITY=explicit,proclist=[0,3,5,9]  

   tid=0: on one of h/w threads (0,237,238,239)  -- this is core 59 which contains H/W thread 0 
   tid=1: on one of h/w threads (1,2,3,4)  -- this is core 0 which contains H/W thread 3 (second in our list)
   tid=2: on one of h/w threads (5,6,7,8)  -- this is core 1 which contains H/W thread 5 (third in our list)
   tid=4: on one of h/w threads (9,10,11,12) -- this list is on core 2 which contains H/W thread 9 (last on our list)
  • KMP_AFFINITY=verbose,... : The verbose modifier will print out both the binding of hardware thread IDs to cores and SMT threads as well as as the binding of the the threads used by the application. E.g.
export OMP_NUM_THREADS=4
export KMP_AFFINITY="verbose,explicit,proclist=[0,3,5,9],granularity=core"

will now print out a lot of information if one subsequently runs an OpenMP program. Some sample output (with our comments added) is below:

OMP: Info #156: KMP_AFFINITY: 240 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #159: KMP_AFFINITY: 1 packages x 60 cores/pkg x 4 threads/core (60 total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 0 thread 0     # COMMENT:  H/W thread 1 maps to core=0, thread=0
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 0 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 0 thread 2 
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 0 thread 3 
OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 0 core 1 thread 0 
OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 1 thread 1 
...
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 59 thread 0       # COMMENT: H/W thread 0, is mapped to core 59, thread=0
OMP: Info #171: KMP_AFFINITY: OS proc 237 maps to package 0 core 59 thread 1 
OMP: Info #171: KMP_AFFINITY: OS proc 238 maps to package 0 core 59 thread 2 
OMP: Info #171: KMP_AFFINITY: OS proc 239 maps to package 0 core 59 thread 3 
OMP: Info #144: KMP_AFFINITY: Threads may migrate across 1 innermost levels of machine  # COMMENT: Threads may migrate on core 
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0,237,238,239}  # COMMENT: OMP thread 0 is bound to core 59
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {1,2,3,4}              # COMMENT: OMP thread 1 is bound to core 0
OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {5,6,7,8}              # COMMENT: OMP thread 2 is bound to core 1
OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {9,10,11,12}        # COMMENT: OMP thread 3 is bound to core 2

One final consideration, is that on these 60 core systems, core 59 is reserved for system functions. Explicitly scheduling threads on core 59 can slow down program execution. In principle, the scatter affinity will schedule threads onto core 59. Your mileage may vary.