Pre-compiled versions of Chroma have been ins talled in the /dist/scidac directory.
The root of the directory is
/dist/scidac/chroma_gpu/centos62_cuda50/mvapich2-1.8_omp
Within this directory the installations are:
These builds utilize MVAPICH2-1.8 for MPI, and have OpenMP threading enabled. By default if the environment variable OMP_NUM_THREADS is not set. The codes will attempt to use all available threads on the node, which may not be optimal, especially if one is running multiple MPI tasks per node as it can result in serious oversubscription. For running hybrid OpenMP - MPI programs please see the section on using MPI in a multi-core/heterogeneous environment.
It should be noted, that SM_20 builds should also run on hardware supporting SM_30 and SM_35, however,
the way this works is that the CUDA driver recompiles the PTX embedded in the executables. For the QUDA Library
this process can take some time (as long as 10 minutes). During this time the node will appear to have hung.
However, this hit should only occur on the first run, as the recompiled code is cached. Nonetheless it is worth trying to use the right build.
Below is an example Job script using mpirun_rsh; remember to change the account name, and request as many CPU cores as needed to get the desired number of nodes (16 gets you 2 older nodes and thus 8 GPUs, 16 gets you 4 GPUs on one k20 nodes; if you use a number <=8 you always get 1 node of 4 GPUs)
#PBS -l nodes=16:k20 #PBS -l walltime=0:30:00 #PBS -q testgpu #PBS -A HPCSYS #PBS -j oe # cd to the directory where the job was submitted cd $PBS_O_WORKDIR # Set up the environment source /dist/scidac/chroma_gpu/centos62_cuda50/sm_35/env.sh # Pick the executable EXECUTABLE=/dist/scidac/chroma_gpu/centos62_cuda50/mvapich2-1.8_omp/chroma_quda_sm_35/bin/chroma ARGS="-i 4-bicgstab-wf.ini.xml -geom 1 1 1 4" # Launch a 4 process job via MPIRUN_RSH mpirun_rsh -rsh -np 4 -hostfile $PBS_NODEFILE MV2_ENABLE_AFFINITY=0 QUDA_RESOURCE_PATH=`pwd` $EXECUTABLE $ARGS
Notes
At this point you can supply your own binding mechanism.
The QUDA library has a mechanism for binding to the right NUMA node for a
given GPU. So by disabling MVAPICH2's binding we allow QUDA to work
its magic.
Below is a script for running the same job using the mpiexec.hydra process manager.
This job runs a hybrid OpenMP + MPI job.
#PBS -l nodes=16:k20 #PBS -l walltime=0:30:00 #PBS -q testgpu #PBS -A HPCSYS #PBS -j oe # cd to the directory where the job was submitted cd $PBS_O_WORKDIR # Set up the environment source /dist/scidac/chroma_gpu/centos65_cuda50/sm_35/env.sh # Pick the executable EXECUTABLE=/dist/scidac/chroma_gpu/centos65_cuda50/mvapich2-1.8_omp/chroma_quda_sm_35/bin/chroma ARGS="-i 4-bicgstab-wf.ini.xml -geom 1 1 1 4" # Launch a 4 process job via mpiexec.hydra mpiexec.hydra \ -genv MV2_ENABLE_AFFINITY 0 \ -genv QUDA_RESOURCE_PATH /home/bjoo/tuning_sm35 \ -genv OMP_NUM_THREADS 4 \ -f $PBS_NODEFILE \ -launcher rsh \ $EXECUTABLE $ARGS
Notes:
At this point you can supply your own binding mechanism. The QUDA
library has a mechanism for binding to the right NUMA node for a given
GPU. So by disabling MVAPICH2's binding we allow QUDA to work its
magic.
You can submit the above jobs with
qsub job.sh
where job.sh is the name of the file. The PBS options in the job file can also be placed on the command line instead.
For example in all the lines staring with '#!PBS' were removed from the job file
it could still be submitted with
qsub -l nodes=16:k20 -l walltime=00:30:00 -q testgpu -A HPCSYS -j oe job.sh
We make the following recommendations:
and use the appropriate one in the QUDA_RESOURCE_PATH environment variable.