You are here

Frequently Asked Questions

logocolor4.png


  1. How do I renew my expired JLab computer account?
  2. How do I change my JLab Common User Account password?
  3. What partition/queue should I use?
  4. What is the default MPI?
  5. How can I request nodes booted in flat-quad mode?
  6. How can I explicitly request  18p or 16p nodes?
  7. What is the USQCD allocation jeopardy policy?
  8. I do not see CUDA on the cluster login nodes?
  9. I need to launch a bunch of single GPU jobs on 21g. Is there any way to run multiple instances of those single GPU jobs on a single node?
  10. How do I transfer source tarballs to the cluster login nodes as some sites seem to be blocked?

1. How do I renew my expired JLab computer account?

IF you had a computer account at JLab in the past then please verify this by searching your information in the JLab phone book.

  • If an entry for you exists in the phone book then the quickest turn around on this request is by calling the JLab Computing Center Help Desk at (757) 269 7155 on weekdays between 8am to 4pm eastern time. You may also send them an email (slower response than a phone call) at helpdesk@jlab.org.
    • Once your account password has been reset by Helpdesk staff, please send the LQCD cluster admin team a note using this support form indicating they reset your local cluster accounts as well.
  • If an entry for you does not exist in the phone book then please follow the instructions as indicated at the following web page.

2. How do I change my JLab Common User Account password?

You can change your unexpired password by logging into any of the central Linux systems and typing /apps/bin/jpasswd. Or, you may use the web interface by logging in to the JLab Computer Center web site (https://cc.jlab.org) and clicking on the "Password Change" link in the "Web Utilities" section on the right side of the page. Before you change your password, we recommend that you review the password rules at https://cc.jlab.org/passwordrules. These rules are based on federal requirements for JLab computer systems and must be followed by all JLab Computer Account Holders.

If you get an ERROR when attempting to change your password, please contact the IT Division Helpdesk (email: helpdesk@jlab.org or phone: 757-269-7155).


3. What partition/queue should I use?

There are several partitions as listed below. For an up to date partition list please see the following web page.

  • phi:          KNL cluster partition for 16p and 18p.
  • phi_test :  KNL cluster test partition for 16p and 18p.
  • gpu:         GPU cluster partition for 19g.
  • 21g_test:  GPU cluster test partition for 21g.

The default partition is phi, use '-p partition-name' in your SLURM job submission command to use a partition different from the default phi partition.


4. What is the default MPI?

Right now there is no default MPI configured in SLURM. The following command will list APIs srun supports for MPI.

$ srun --mpi=list
srun: MPI types are...
...

5. How can I request nodes booted in flat-quad mode?

Use the '--constraint=flat,quad' or '-Cflat,quad' option to request nodes in flat-quad mode. If there are insufficient nodes available, SLURM will reboot nodes in to the requested mode. Similarly, if a user needs cache-quad mode, they must use '--constraint=cache,quad' or '-Ccache,quad'.


6. How can I explicitly request  18p or 16p nodes?

Use '--constraint=cache,quad,18p' or '-Ccache,quad,18p' option to request 18p nodes in cache-quad mode. Similarly, use '--constraint=flat,quad,18p' or '-Cflat,quad,18p' option to request 18p nodes in flat-quad mode. To request 16p nodes, replace 18p with 16p in the before mentioned options.


7. What is the USQCD allocation jeopardy policy?

The latest USQCD jeopardy policy is on this web page.


8. I do not see CUDA on the cluster login nodes?

In order to compile your code using CUDA, on the cluster login nodes (qcdi1401 or qcdi1402) check available CUDA versions as follows:

[@qcdi1402 ~]$ module use /dist/modulefiles/
[@qcdi1402 ~]$ module avail

----------------------------------------------------------- /dist/modulefiles/ -----------------------------------------------------------
anaconda2/4.4.0   anaconda3/5.2.0   cmake/3.21.1      curl/7.59         gcc/7.1.0         gcc/8.4.0         go/1.15.4
anaconda2/5.2.0   cmake/3.17.5      cuda/10.0         gcc/10.2.0        gcc/7.2.0         gcc/9.3.0         singularity/2.3.1
anaconda3/4.4.0   cmake/3.18.4      cuda/9.0          gcc/5.3.0         gcc/7.5.0         go/1.13.5         singularity/3.6.4

------------------------------------------------------------ /etc/modulefiles ------------------------------------------------------------
anaconda           ansys18            gcc_4.6.3          gcc-4.9.2          gcc-6.2.0          gsl-1.15           mvapich2-1.8
anaconda2          ansys2020r1        gcc-4.6.3          gcc_5.2.0          gcc-6.3.0          hdf5-1.8.12        mvapich2-2.1
......

Load the desired CUDA version as follows:

[@qcdi1402 ~]$ module load cuda/10.0


9. I need to launch a bunch of single GPU jobs on 21g. Is there any way to run multiple instances of those single GPU jobs on a single node?

There is no way to just reserve a single gpu on 21g. You have to run 8 separate programs (without the srun) with each run configured to "see" a different gpu. That can be accomplished by setting ROCR_VISIBLE_DEVICES for each srun properly as shown by an example below:

!/bin/bash
#SBATCH --nodes=1
#SBATCH -p 21g

export OMP_NUM_THREADS=16
ROCR_VISIBLE_DEVICES=0 ./mybinary &
ROCR_VISIBLE_DEVICES=1 ./mybinary &
ROCR_VISIBLE_DEVICES=2 ./mybinary &
ROCR_VISIBLE_DEVICES=3 ./mybinary &
ROCR_VISIBLE_DEVICES=4 ./mybinary &
ROCR_VISIBLE_DEVICES=5 ./mybinary &
ROCR_VISIBLE_DEVICES=6 ./mybinary &
ROCR_VISIBLE_DEVICES=7 ./mybinary &
wait


10. How do I transfer source tarballs to the cluster login nodes as some sites seem to be blocked?

The blocking of certain remote sites from the cluster login nodes is a mitigation strategy implemented by the JLab Computer Security team. While we cannot circumvent this blockade, we recommend that something like this should work in most cases.

Example of a failing command run on cluster log node:

[@qcdi2001 ~]$ wget https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.10/hdf5-1.10.8/src...
--2022-02-28 13:10:01--  https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.10/hdf5-1.10.8/src...
Resolving support.hdfgroup.org (support.hdfgroup.org)... 50.28.50.143
Connecting to support.hdfgroup.org (support.hdfgroup.org)|50.28.50.143|:443... connected.
HTTP request sent, awaiting response... 503 Service Unavailable
2022-02-28 13:10:01 ERROR 503: Service Unavailable.

Recommended option for a successful execution of the above command:

[@qcdi2001 ~]$ ssh jlabl5 curl https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.10/hdf5-1.10.8/src... > hdf5-1.10.8.tar.bz2


If you have additional questions please use the following support web page.