DAQ Support: Writing Raw Data to Tape

All raw data from the experimental physics detectors goes through a number of steps to reach permanent storage on tape:
  1. The DAQ system assembles events and writes them into a raw data file on a local staging area.  This area is large enough to hold a minimum of 24 hours of high rate running.
  2. A script on the DAQ system periodically calls the jmirror Jasmine client to copy the data from the hall system to the data center file system.  There are currently 2 possible staging destinations: (1) flash storage on a special SSD file server, or (2) the large Lustre file system.
  3. Jasmine "file fairy" servers run on the special SSD file server and also on six gateway hosts, and will coordinate their work to pull files from the halls' local disks via direct socket communication with the jasmine clients.  For the SSD server, the files are then deposited on the local SSDs; for the 6 DAQ gateway nodes, the files are written to a special area in Lustre.  The SSD server has dual 40g Ethernet ports and can support the full DAQ rates of all four halls.  The older DAQ gateway nodes have 10g ports, but collectively they too can support the full data rates.
  4. Note: the file fairies do impose per hall data transfer rates averaged over a running widow of 4 hours.  Thus, if a hall suddenly issues a request to move 2 days worth of data, it might take over a day to accomplish this task (so don't allow too much data to pile up in the counting house).
  5. Starting in Sept 2018, the SSD server is the "normal" staging area.  It has adequate bandwidth to take the data from all 4 halls concurrently, and serve the data out to multiple tape driver servers ("datamovers") to support 2 simultaneous copies (raw and rawdup) for all data.  This file server, however, can only buffer data for a few hours, so if the tape library is down, it will quickly fill. The secondary staging area, Lustre, is large enough to hold data for multiple weeks. 
  6. Using Lustre has a impact on the performance of offline computing, and so if the tape library goes down and the SSDs fill, no data will be pulled by the file fairies from the halls for up to 24 hours. I.e. each hall will buffer its own data locally for up to 24 hours, and this "local buffering" mode is normal, and serves to protect offline computing, at least temporarily.  After 24 hours, if the tape library is still down and the SSDs still full (or the SSD file server is offline), the other file fairies will take over, in a "buffer to Lustre" mode until the problems are resolved.
  7. Once a file has been staged in the Data Center (on either SSD or Lustre), it immediately becomes eligible for writing to tape.  Because hall raw data has highest scheduling priority, it will almost always move to tape very quickly.  The tape drive scheduler will initially allocate one drive per hall when raw data starts arriving (the operations team can adjust the number of drives allocated to raw data).  One drive is capable of streaming compressed data to tape at 360 MB/s (raw copy) and 130 MB/s to LTO-5 and > 150 MB/s to LTO-6 (raw dup data).  If the hall writes data to the data center staging area faster than one drive can sustain, and the backlog exceeds the capacity of 1 tape (LTO-5 and LTO-6) or 2 TB (LTO-8), then a second drive will be activated.  This process continues up to an operator imposed limit on the total number of drives. 
  8. Once a file is written to tape (2 copies, raw and raw-dup), the file is removed from the staging area.  A small percentage of the data can be left behind in Lustre for quick turnaround for analysis jobs.  The selection of files to keep is programmable by a regular expression matching the file name, with total data cached limited by the halls disk quota on Lustre.

Quality of Service Guarantees

The data center is a lights out operation, only manned for 5 shifts a week, and the process of writing to tape is not considered a "live" system that must be available 24/7.  Any part of the system (networking, data center disk area, tape control system) is allowed to be offline for 24 hours, and the tape library is allowed to be offline for 72 hours.  If a hall notices that copies to the data center are failing, that hall should submit a trouble ticket, which will be handled on the next day (or work day if the problem is not preventing the hall from acquiring data).

Do NOT call the on-call phone for DAQ to tape problems except during the hours of 9 a.m. and 9 p.m.! The halls must be capable of dealing locally with all DAQ-to-tape problems for up to 24 hours.

The current status of the tape writing software can be found using the jmirror "status" option, and this can be used by the halls to monitor the system.

The halls are free to choose their policy for when they request data copying to the data center and when they delete data from their local disk, so long as they maintain an ability to acquire 24 hours of data after first noticing a problem in the data flow to tape.  While it might feel more comfortable to not delete the copy in the counting house until the file is "safely" on tape, it is perfectly reasonable to delete the local copy once a valid copy has been written to a staging area.  The staging areas are RAID-6 protected, and thus can sustain the loss of 2 disks of each set of 10 disks.  Losing a 3rd disk in a set of 10 disks would not be expected to happen in the lifetime of the laboratory.  Thus, if the transfer to the staging area in the data center is functioning and the tape library is not, and the counting house local disk is full, the halls are expected to begin deleting their local copy of raw data while awaiting the repair of the tape library.

If the tape library is down for an excessively long time (a week or more), then a larger fraction of Lustre will be allocated to buffering the raw data.  This might result in offline computing grinding to a halt (very rare event). 

Implementation Details

The SSD file system is not exported and so is not visible to the users.  Under normal use, files will appear in the pseudo file system /mss within a few hours (for normal ongoing transfers).

If the system is using secondary storage (Lustre), the halls can monitor the buffering location.  The location to which the jasmine file fairies stage their files is a unique subdirectory under /lustre/scicomp/jasmine/fairy2.   Lustre gateway nodes are named scidaqgw10a/b/c/d/e/f.  See their utilization on the ganglia pages for more insight into their usage.

Note that the unix user who runs the jasmine client software (jmirror) must have an appropriate scientific computing certificate in its home directory.  Specifically, it is the .scicomp directory with a file named keystore.  For raw data this certificate needs to identify the client as user “halldata”.  This is a special account that requires a manually generated certificate.  When configuring a new machine it’s easiest to simply copy the certificate from an existing machine.

Hall A
Contact: Bob Michaels
Machine: adaq1

Bob maintains a cron task for user adaq that uses jput to send raw data files to tape and to remove originals once tape copies have been made.

Hall B
Contact: Serguei Boiarinov
Machines: clondaq5, clondaq3

A cron task for user root runs the jasmine command jmigrate (/usr/local/scicomp/jasmine/bin/jmigrate) to find and stage files found under /data/to-tape/mss.

Hall C
Brad Sawatzky, Steve Wood
Machines: cdaql4,cdaql5 DAQ; cdaql3 fileserver; cdaql1 analysis

Hall D
Contact: David Lawrence, Paul Letta
Machines: gluonraid3, gluonraid4

A cron task for user root runs a modifed version of jmigrate from /root, finding files under /raid/rawdata/staging.  The script is modified to remove empty subdirectories on each run.  For more details: https://halldweb.jlab.org/wiki/index.php/Raid-to-Silo_Transfer_Strategy