The data center is a lights out operation, only manned for 5 shifts a week, and the process of writing to tape is not considered a "live" system that must be available 24/7. Any part of the system (networking, data center disk area, tape control system) is allowed to be offline for 24 hours, and the tape library is allowed to be offline for 72 hours. If a hall notices that copies to the data center are failing, that hall should submit a trouble ticket, which will be handled on the next day (or work day if the problem is not preventing the hall from acquiring data).
Do NOT call the on-call phone for DAQ to tape problems except during the hours of 9 a.m. and 9 p.m.! The halls must be capable of dealing locally with all DAQ-to-tape problems for up to 24 hours.
The current status of the tape writing software can be found using the jmirror "status" option, and this can be used by the halls to monitor the system.
The halls are free to choose their policy for when they request data copying to the data center and when they delete data from their local disk, so long as they maintain an ability to acquire 24 hours of data after first noticing a problem in the data flow to tape. While it might feel more comfortable to not delete the copy in the counting house until the file is "safely" on tape, it is perfectly reasonable to delete the local copy once a valid copy has been written to a staging area. The staging areas are RAID-6 protected, and thus can sustain the loss of 2 disks of each set of 10 disks. Losing a 3rd disk in a set of 10 disks would not be expected to happen in the lifetime of the laboratory. Thus, if the transfer to the staging area in the data center is functioning and the tape library is not, and the counting house local disk is full, the halls are expected to begin deleting their local copy of raw data while awaiting the repair of the tape library.
If the tape library is down for an excessively long time (a week or more), then a larger fraction of Lustre will be allocated to buffering the raw data. This might result in offline computing grinding to a halt (very rare event).
Implementation Details
The SSD file system is not exported and so is not visible to the users. Under normal use, files will appear in the pseudo file system /mss within a few hours (for normal ongoing transfers).
If the system is using secondary storage (Lustre), the halls can monitor the buffering location. The location to which the jasmine file fairies stage their files is a unique subdirectory under /lustre/scicomp/jasmine/fairy2. Lustre gateway nodes are named scidaqgw10a/b/c/d/e/f. See their utilization on the ganglia pages for more insight into their usage.
Note that the unix user who runs the jasmine client software (jmirror) must have an appropriate scientific computing certificate in its home directory. Specifically, it is the .scicomp directory with a file named keystore. For raw data this certificate needs to identify the client as user “halldata”. This is a special account that requires a manually generated certificate. When configuring a new machine it’s easiest to simply copy the certificate from an existing machine.
Hall A
Contact: Bob Michaels
Machine: adaq1
Bob maintains a cron task for user adaq that uses jput to send raw data files to tape and to remove originals once tape copies have been made.
Hall B
Contact: Serguei Boiarinov
Machines: clondaq5, clondaq3
A cron task for user root runs the jasmine command jmigrate (/usr/local/scicomp/jasmine/bin/jmigrate) to find and stage files found under /data/to-tape/mss.
Hall C
Hanjie Liu, Robert Michaels
Machines: cdaql4,cdaql5 DAQ; cdaql3 fileserver; cdaql1 analysis
Hall D
Contact: David Lawrence, Paul Letta
Machines: gluonraid3, gluonraid4
A cron task for user root runs a modifed version of jmigrate from /root, finding files under /raid/rawdata/staging. The script is modified to remove empty subdirectories on each run. For more details: https://halldweb.jlab.org/wiki/index.php/Raid-to-Silo_Transfer_Strategy