ZFS (see ZFS wiki entry) is a file system with several distinct features and advantages. Storage is managed as pools, and pools may be hierarchically split into smaller pools. This gives the effect of hierarchical quotas, useful for our application, and a model for our own in-house storage management software running above Lustre. ZFS (Z File System) implements RAID-Z, which has a tree of embedded checksums. These checksums are always checked on read, so data integrity is very high. Open ZFS is used as the local file system in our most recent Lustre Object Storage Servers. In contrast, our oldest commodity disk servers used in Lustre do not do RAID checking on read, and only use the redundant information to rebuild a RAID set after a failed disk is replaced.
Because ZFS is not a distributed file system, a single part of the namespace, such as /work/projectA, is mostly easily implemented as a single Z pool on a single server, mounted as /work/<project>. If the pools on one server fill up, then all of those pools have to stop growing, even of other ZFS servers have space. Further, that project is limited to the I/O bandwidth of a single server. So, ZFS has strengths and weaknesses, and we use ZFS for parts of the namespace where higher data integrity is of value (e.g. databases, software source files, project workflow management, etc.). In particular, LQCD user home directories are on the ZFS Appliance.
ZFS Appliance
The ZFS Appliance consists of 2 head nodes, and 2 trays of disks with fibre channel controllers. Under normal operation, one head unit serves LQCD and the other serves Experimental Physics. If either head unit fails, the other takes over both functions. The head nodes have QDR Infiniband connections, and export the local ZFS file systems as NFS.