GPU Clusters

Tue, 08/17/2021 - 13:48 — amitoj

GPUs provide enormous floating point capacity and memory bandwidth, and can yield up to 10 times as much performance (science) per dollar, but do require the use of specialized libraries or specialized programming techniques.

Setting up the Environment (19g cluster)

In order to compile your code using CUDA, on the login nodes (qcdi1401 or qcdi1402) check available CUDA versions as follows:

[@qcdi1402 ~]$ module use /dist/modulefiles/
[@qcdi1402 ~]$ module avail

----------------------------------------------------------- /dist/modulefiles/ -----------------------------------------------------------
anaconda2/4.4.0   anaconda3/5.2.0   cmake/3.21.1      curl/7.59         gcc/7.1.0         gcc/8.4.0         go/1.15.4
anaconda2/5.2.0   cmake/3.17.5      cuda/10.0         gcc/10.2.0        gcc/7.2.0         gcc/9.3.0         singularity/2.3.1
anaconda3/4.4.0   cmake/3.18.4      cuda/9.0          gcc/5.3.0         gcc/7.5.0         go/1.13.5         singularity/3.6.4

------------------------------------------------------------ /etc/modulefiles ------------------------------------------------------------
anaconda ansys18 gcc_4.6.3 gcc-4.9.2 gcc-6.2.0 gsl-1.15 mvapich2-1.8
anaconda2 ansys2020r1 gcc-4.6.3 gcc_5.2.0 gcc-6.3.0 hdf5-1.8.12 mvapich2-2.1
......

Load the desired CUDA version as follows:

[@qcdi1402 ~]$ module load cuda/10.0

Setting up the Environment (21g cluster)

Listed below are some useful tips on using the 21g cluster. This is a compilation of tips from current users of this cluster and your mileage may vary. 21g hardware details are listed here. Should you have questions on 21g please use the following support web page.

Posted 3/22/2022

Running the Grid Benchmark_ITT code on a single M150 we get 530GFlop/s at the "comparison point", a bit better performance than a P100. We get 634.5 GFlop/s running the same test on a single RTX-2080 card.

Posted 11/23/2021

To launch (for example) 2 processes each of which uses 4 GPUs the following environment variables were important:

CUDA_VISIBLE_DEVICES either has to be unset or has to be set to 0,1,2,3,4,5,6,7,8.
ROCR_VISIBLE_DEVICES dictates which GPUs are used. (see next post for more details on this)

HIP_VISIBLE_DEVICES does not seem to make any difference.

Posted 11/12/2021

Question: I need to launch a bunch of single GPU jobs on 21g. Is there any way to run multiple instances of those single GPU jobs on a single node?

Answer: There is no way to just reserve a single gpu on 21g. You have to run 8 separate programs (without the srun) with each run configured to "see" a different gpu. That can be accomplished by setting ROCR_VISIBLE_DEVICES for each srun properly as shown by an example below:

!/bin/bash
#SBATCH --nodes=1
#SBATCH -p 21g

export OMP_NUM_THREADS=16
ROCR_VISIBLE_DEVICES=0 ./mybinary &
ROCR_VISIBLE_DEVICES=1 ./mybinary &
ROCR_VISIBLE_DEVICES=2 ./mybinary &
ROCR_VISIBLE_DEVICES=3 ./mybinary &
ROCR_VISIBLE_DEVICES=4 ./mybinary &
ROCR_VISIBLE_DEVICES=5 ./mybinary &
ROCR_VISIBLE_DEVICES=6 ./mybinary &
ROCR_VISIBLE_DEVICES=7 ./mybinary &
wait

Posted 8/25/2021

Here is one simple way to compile a kernel for MI100 on 21g: make sure to use --amdgpu-target=gfx906,gfx908 which is similar to cuda_sm. gfx908 is for MI100, gfx906 is for MI50.

[@qcdi2001]$> module load rocm
[@qcdi2001]$> hipcc --amdgpu-target=gfx906,gfx908 -o helloWorld helloWorld.cpp

Compile flags for hipcc can be obtained by executing hipconfig --cxx

Main menu

Navigation

You are here

Main menu

Navigation

User login

You are here

GPU Clusters