Chroma GPU Example

Mon, 03/04/2013 - 14:13 — watson

GPU versions of Chroma

Pre-compiled versions of Chroma have been ins talled in the /dist/scidac directory.

The root of the directory is

/dist/scidac/chroma_gpu/centos62_cuda50/mvapich2-1.8_omp

Within this directory the installations are:

chroma_quda_sm_20 -- for builds supporting SM_20 (Fermi cards: GTX 480,580 and C2050/M2050/M2070), with 32 bit base precision for Chroma
chroma-double_quda_sm20 -- for builds supporting SM_20, but with 64 bit base precision for Chroma
chroma_quda_sm_30 -- for Kepler Gaming Cards ( GTX690-s ) with 32bit base precision for Chroma
chroma-double_quda_sm_30 -- for Kepler Gaming Cards but with 64 bit base precision for Chroma
chroma_quda_sm_35 -- for Kepler Tesla Cards (K20-s) with 32bit base precision for Chroma
chroma-double_quda_sm_35 -- for Kepler Tesla Cards but with 64 bit base precision for Chroma

These builds utilize MVAPICH2-1.8 for MPI, and have OpenMP threading enabled. By default if the environment variable OMP_NUM_THREADS is not set. The codes will attempt to use all available threads on the node, which may not be optimal, especially if one is running multiple MPI tasks per node as it can result in serious oversubscription. For running hybrid OpenMP - MPI programs please see the section on using MPI in a multi-core/heterogeneous environment.

It should be noted, that SM_20 builds should also run on hardware supporting SM_30 and SM_35, however,
the way this works is that the CUDA driver recompiles the PTX embedded in the executables. For the QUDA Library
this process can take some time (as long as 10 minutes). During this time the node will appear to have hung.
However, this hit should only occur on the first run, as the recompiled code is cached. Nonetheless it is worth trying to use the right build.

Example Run Script Using MVAPICH2-1.8 mpirun_rsh

Below is an example Job script using mpirun_rsh; remember to change the account name, and request as many CPU cores as needed to get the desired number of nodes (16 gets you 2 older nodes and thus 8 GPUs, 16 gets you 4 GPUs on one k20 nodes; if you use a number <=8 you always get 1 node of 4 GPUs)

#PBS -l nodes=16:k20
#PBS -l walltime=0:30:00
#PBS -q testgpu
#PBS -A HPCSYS
#PBS -j oe

# cd to the directory where the  job was submitted
cd $PBS_O_WORKDIR

# Set up the environment
source /dist/scidac/chroma_gpu/centos62_cuda50/sm_35/env.sh

# Pick the executable 
EXECUTABLE=/dist/scidac/chroma_gpu/centos62_cuda50/mvapich2-1.8_omp/chroma_quda_sm_35/bin/chroma
ARGS="-i 4-bicgstab-wf.ini.xml -geom 1 1 1 4"

# Launch a 4 process job via MPIRUN_RSH
mpirun_rsh -rsh -np 4 -hostfile $PBS_NODEFILE  MV2_ENABLE_AFFINITY=0 QUDA_RESOURCE_PATH=`pwd` $EXECUTABLE $ARGS

Notes

the #PBS -l nodes=16:k20 specifies that 12 CPU cores are needed on a k20 GPU node (1 node)
the #PBS -q testgpu selects the testgpu queue
the #PBS -A selects a job account. You should use your project account here.
the #PBS -j oe combines the stdout and stderr files in to just one output file from the job
the cd $PBS_O_WORKDIR goes to the directory in which the job was submitted
The env.sh script sets up the environment
The mpirun_rsh<tt> runs the job as 4 processes, one per GPU
The order of command line arguments to mpirun_rsh is important
Environment variables are passed just before the executable as a space separated list of env vars with values. The syntax is <tt>ENV_VAR=value
The MV2_ENABLE_AFFINITY=0 environment disables MVAPICH2-s default processor binding. This is necessary for mixed OpenMP-MPI jobs

At this point you can supply your own binding mechanism.
The QUDA library has a mechanism for binding to the right NUMA node for a
given GPU. So by disabling MVAPICH2's binding we allow QUDA to work
its magic.

The QUDA_RESOURCE_PATH points to the location where QUDA will look for its autotuning cache file. In this case it is the current directory.

[edit]

Using mpiexec.hydra

Below is a script for running the same job using the mpiexec.hydra process manager.
This job runs a hybrid OpenMP + MPI job.

#PBS -l nodes=16:k20
#PBS -l walltime=0:30:00
#PBS -q testgpu
#PBS -A HPCSYS
#PBS -j oe

# cd to the directory where the  job was submitted
cd $PBS_O_WORKDIR

# Set up the environment
source /dist/scidac/chroma_gpu/centos65_cuda50/sm_35/env.sh

# Pick the executable 
EXECUTABLE=/dist/scidac/chroma_gpu/centos65_cuda50/mvapich2-1.8_omp/chroma_quda_sm_35/bin/chroma
ARGS="-i 4-bicgstab-wf.ini.xml -geom 1 1 1 4"

# Launch a 4 process job via mpiexec.hydra
mpiexec.hydra \
    -genv MV2_ENABLE_AFFINITY 0 \
    -genv QUDA_RESOURCE_PATH /home/bjoo/tuning_sm35 \
    -genv OMP_NUM_THREADS 4 \ 
   -f $PBS_NODEFILE \
    -launcher rsh \
    $EXECUTABLE $ARGS

Notes:

mpiexec.hydra is a complicated beast with lots of features. See http://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager
global environment variables can be passed with the -genv options.
I used the executable from the sm_35_omp directory for host side threading
I used the OMP_NUM_THREADS env variable to specify the number of threads per process.
launching via rsh is selected with the -launcher option
The host-file is specified with the -f option.
The MV2_ENABLE_AFFINITY=0 environment disables MVAPICH2-s default processor binding. This is necessary for mixed OpenMP-MPI jobs

At this point you can supply your own binding mechanism. The QUDA
library has a mechanism for binding to the right NUMA node for a given
GPU. So by disabling MVAPICH2's binding we allow QUDA to work its
magic.

[edit]

Submitting the job

You can submit the above jobs with

qsub job.sh

where job.sh is the name of the file. The PBS options in the job file can also be placed on the command line instead.
For example in all the lines staring with '#!PBS' were removed from the job file
it could still be submitted with

qsub -l nodes=16:k20 -l walltime=00:30:00 -q testgpu -A HPCSYS -j oe job.sh

Selecting GPU generation specific versions of Chroma and/or QUDA

We make the following recommendations:

Explicitly request the node type (e.g. k20) with the -l nodes option ie: -lnodes=16:k20
Have separate autotuning directories set up for the various GPU types (K20, C2050, GTX580, GTX690)

and use the appropriate one in the QUDA_RESOURCE_PATH environment variable.

Main menu

Navigation

You are here

GPU versions of Chroma

Example Run Script Using MVAPICH2-1.8 mpirun_rsh

Using mpiexec.hydra

Submitting the job

Selecting GPU generation specific versions of Chroma and/or QUDA

Main menu

Navigation

User login

You are here

Chroma GPU Example

GPU versions of Chroma

Example Run Script Using MVAPICH2-1.8 mpirun_rsh

Using mpiexec.hydra

Submitting the job

Selecting GPU generation specific versions of Chroma and/or QUDA