You are here

Lustre File System

The Lustre system (a distributed file system) spans multiple file servers, called Object Storage Servers (OSS), while presenting a single flat namespace as if it were a single server.  File system directory information is held in a Meta Data Server (MDS), which at JLab is implemented as a dual headed system with the two servers sharing a SAS disk array. One of the two heads is active and the other inactive, and they implement a hot failover in case the primary server fails.

All of the current OSS nodes are Supermicro disk servers connected to both the Infiniband fabric (high bandwidth) and Ethernet (low bandwidth), and having internal disks and a RAID or JBOD controller.  Most of these servers are configured as active-active pairs each serving half of the disks in 2 chassis, with the possibility of an operator requested failover if one of the two servers stops working. (Automatic failover is not well supported in our current version of Lustre.)

All user files are held in RAID-6 (or ZFS-r2) mostly 8+2 sets (some larger) called Object Storage Targets (OST), which means data remains protected if any disk fails and has no loss of data if any 2 disks fail. Older OSS have either 2 or 3 RAID sets per OSS.  The new servers will have 4 OST per OSS, using 2 JBOD controllers, 2 per controller.

Each RAID or ZFS stripe is an OST (Object Storage Target).  The system disk and the Lustre journal disk are mirrored.  To avoid fragmentation, disks are kept < 85% full (target is 80%). 

Current generations include:

  • 4 nodes: Dual 2.2 GHz 6 core Broadwell CPU, 40 10TB drives, 285 TB target / node
  • 4 nodes: Dual 2.2 GHz 10 core Broadwell CPU, 40 8TB drives, 215 TB target / node
  • 2 nodes: Dual 2.6 GHz 6 core Broadwell CPU, 40 8TB drives, 215 TB target / node
  • 4 nodes: Dual 2.6 GHz 6 core IvyBridge CPU, 36 4TB drives, 96 TB target / node
  •  

Lustre holds 4 "file systems", partitions in the Lustre namespace with differing management policies:

  • /cache for LQCD/HPC, a read/write cache with MSS tape as a backing store
  • /volatile for LQCD/HPC, large scratch space
  • /cache for Experimental Physics, a read/write cache with MSS tape as backing store
  • /volatile for Experimental Physics, large scratch space

In addition, Lustre holds system file areas for staging data to be written to tape.

Lustre Performance

https://lqcd.jlab.org/lqcd/general/lustre.jsf

Lustre performs best as a file system for large sequential I/O. For other workloads, especially small file and random I/O loads in the batch farm,  copy data to the local compute node's scratch disk via Auger, process the data from the local disk, then move the output back to the Lustre filesystem.  Do not read and process files directly from /cache or /volatile unless you are reading in chunks of 4MB or greater -- many thousands of jobs accessing multiple files and directories at the same time causes disk head thrashing and degrades the filesystem performance for all Lustre users.

Lustre System Status

https://lqcd.jlab.org/ganglia/?m=cpu_wio&r=day&s=by%2520name&c=Disk+Serv...

Lustre File Striping

Lustre is capable of striping files across OSTs (Object Storage Targets); the default setting is not to stripe files.  For users with large files where higher single file bandwidth would be helpful, a directory may be set to have files in that directory default to multi-OST striping; stripes of 2 or 4 would be most useful for file sizes of 10-100 GB.   This should only be used for applications capable of reading or writing at > 500 MB/s.  If you intend to run many copies of an application at once, striping should NOT be used or total performance will in fact decrease due to increased head thrashing.

To see striping information, use the lfs getstripe command for a file, directory, or filesystem; to configure, use lfs setstripe  Parameters include stripe size, count, and offset index from the first OST. For example, the command setstripe -s 2M  -c 4  -i -1 mydir would set all files in the directory to 2 megabyte stripes over 4 OSTs; an index offset of -1 will use all available OSTs. 

Lustre lfs Command

https://build.hpdd.intel.com/job/lustre-manual/lastSuccessfulBuild/artif...

The Lustre filesystem command lfs has several useful equivalents of oft-used Linux commands, that are faster than their counterparts and avoid hangs if parts of the Lustre filesystem are unavailable:

> lfs df -hl
   to get information about disks full

> lfs df -h -p lustre2.production
   to get information about disks full in the production pool

>lfs quota -gh halld /lustre
   to get quota information in a human readable format

> lfs find

> lfs ls

Problematic Lustre files - unlink command

Infrequently, a file in Lustre is problematic if there is an issue with the underlying storage.  In this case, a file may appear with ???? , or report "Transport endpoint failure" or permission denied error.  In this case, if the file cannot be accessed permantently, the unlink command can be used to remove the file, thus allowing jcache or jget to retrieve it from tape if it exists there, or to be recreated.