Chroma GPU Example

GPU versions of Chroma

Pre-compiled versions of Chroma have been ins talled in the /dist/scidac directory.

The root of the directory is

/dist/scidac/chroma_gpu/centos62_cuda50/mvapich2-1.8_omp
Within this directory the installations are:
  • chroma_quda_sm_20 -- for builds supporting SM_20 (Fermi cards: GTX 480,580 and C2050/M2050/M2070), with 32 bit base precision for Chroma
  • chroma-double_quda_sm20 -- for builds supporting SM_20, but with 64 bit base precision for Chroma
  • chroma_quda_sm_30 -- for Kepler Gaming Cards ( GTX690-s ) with 32bit base precision for Chroma
  • chroma-double_quda_sm_30 -- for Kepler Gaming Cards but with 64 bit base precision for Chroma
  • chroma_quda_sm_35 -- for Kepler Tesla Cards (K20-s) with 32bit base precision for Chroma
  • chroma-double_quda_sm_35 -- for Kepler Tesla Cards but with 64 bit base precision for Chroma

These builds utilize MVAPICH2-1.8 for MPI, and have OpenMP threading enabled. By default if the environment variable OMP_NUM_THREADS is not set. The codes will attempt to use all available threads on the node, which may not be optimal, especially if one is running multiple MPI tasks per node as it can result in serious oversubscription. For running hybrid OpenMP - MPI programs please see the section on using MPI in a multi-core/heterogeneous environment.

It should be noted, that SM_20 builds should also run on hardware supporting SM_30 and SM_35, however, the way this works is that the CUDA driver recompiles the PTX embedded in the executables. For the QUDA Library this process can take some time (as long as 10 minutes). During this time the node will appear to have hung. However, this hit should only occur on the first run, as the recompiled code is cached. Nonetheless it is worth trying to use the right build.

Example Run Script Using MVAPICH2-1.8 mpirun_rsh

Below is an example Job script using mpirun_rsh; remember to change the account name, and request as many CPU cores as needed to get the desired number of nodes (16 gets you 2 older nodes and thus 8 GPUs, 16 gets you 4 GPUs on one k20 nodes; if you use a number <=8 you always get 1 node of 4 GPUs)

#PBS -l nodes=16:k20
#PBS -l walltime=0:30:00
#PBS -q testgpu
#PBS -A HPCSYS
#PBS -j oe

# cd to the directory where the  job was submitted
cd $PBS_O_WORKDIR

# Set up the environment
source /dist/scidac/chroma_gpu/centos62_cuda50/sm_35/env.sh

# Pick the executable 
EXECUTABLE=/dist/scidac/chroma_gpu/centos62_cuda50/mvapich2-1.8_omp/chroma_quda_sm_35/bin/chroma
ARGS="-i 4-bicgstab-wf.ini.xml -geom 1 1 1 4"

# Launch a 4 process job via MPIRUN_RSH
mpirun_rsh -rsh -np 4 -hostfile $PBS_NODEFILE  MV2_ENABLE_AFFINITY=0 QUDA_RESOURCE_PATH=`pwd` $EXECUTABLE $ARGS

Notes

  • the #PBS -l nodes=16:k20 specifies that 12 CPU cores are needed on a k20 GPU node (1 node)
  • the #PBS -q testgpu selects the testgpu queue
  • the #PBS -A selects a job account. You should use your project account here.
  • the #PBS -j oe combines the stdout and stderr files in to just one output file from the job
  • the cd $PBS_O_WORKDIR goes to the directory in which the job was submitted
  • The env.sh script sets up the environment
  • The mpirun_rsh<tt> runs the job as 4 processes, one per GPU
  • The order of command line arguments to mpirun_rsh is important
  • Environment variables are passed just before the executable as a space separated list of env vars with values. The syntax is <tt>ENV_VAR=value
  • The MV2_ENABLE_AFFINITY=0 environment disables MVAPICH2-s default processor binding. This is necessary for mixed OpenMP-MPI jobs

At this point you can supply your own binding mechanism. The QUDA library has a mechanism for binding to the right NUMA node for a given GPU. So by disabling MVAPICH2's binding we allow QUDA to work its magic.

  • The QUDA_RESOURCE_PATH points to the location where QUDA will look for its autotuning cache file. In this case it is the current directory.

Using mpiexec.hydra

Below is a script for running the same job using the mpiexec.hydra process manager. This job runs a hybrid OpenMP + MPI job.

#PBS -l nodes=16:k20
#PBS -l walltime=0:30:00
#PBS -q testgpu
#PBS -A HPCSYS
#PBS -j oe

# cd to the directory where the  job was submitted
cd $PBS_O_WORKDIR

# Set up the environment
source /dist/scidac/chroma_gpu/centos65_cuda50/sm_35/env.sh

# Pick the executable 
EXECUTABLE=/dist/scidac/chroma_gpu/centos65_cuda50/mvapich2-1.8_omp/chroma_quda_sm_35/bin/chroma
ARGS="-i 4-bicgstab-wf.ini.xml -geom 1 1 1 4"

# Launch a 4 process job via mpiexec.hydra
mpiexec.hydra \
    -genv MV2_ENABLE_AFFINITY 0 \
    -genv QUDA_RESOURCE_PATH /home/bjoo/tuning_sm35 \
    -genv OMP_NUM_THREADS 4 \ 
   -f $PBS_NODEFILE \
    -launcher rsh \
    $EXECUTABLE $ARGS

Notes:

  • mpiexec.hydra is a complicated beast with lots of features. See http://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager
  • global environment variables can be passed with the -genv options.
  • I used the executable from the sm_35_omp directory for host side threading
  • I used the OMP_NUM_THREADS env variable to specify the number of threads per process.
  • launching via rsh is selected with the -launcher option
  • The host-file is specified with the -f option.
  • The MV2_ENABLE_AFFINITY=0 environment disables MVAPICH2-s default processor binding. This is necessary for mixed OpenMP-MPI jobs

At this point you can supply your own binding mechanism. The QUDA library has a mechanism for binding to the right NUMA node for a given GPU. So by disabling MVAPICH2's binding we allow QUDA to work its magic.

Submitting the job

You can submit the above jobs with

qsub job.sh

where job.sh is the name of the file. The PBS options in the job file can also be placed on the command line instead. For example in all the lines staring with '#!PBS' were removed from the job file it could still be submitted with

qsub -l nodes=16:k20 -l walltime=00:30:00 -q testgpu -A HPCSYS -j oe job.sh

Selecting GPU generation specific versions of Chroma and/or QUDA

We make the following recommendations:

  • Explicitly request the node type (e.g. k20) with the -l nodes option ie: -lnodes=16:k20
  • Have separate autotuning directories set up for the various GPU types (K20, C2050, GTX580, GTX690)

and use the appropriate one in the QUDA_RESOURCE_PATH environment variable.