Ran across an article discussing ORION, ORNL’s new storage system which had 100s of PB of file storage and supported TB/sec of bandwidth, so naturally I thought the GreyBeards need to talk to these people. I reached out and Dustin Leverman, Group Leader HPC storage at Oak Ridge National Labs (ORNL)answered the call. Dustin has been in HPC storage for a long time and at ORNL, he has helped deploy Orion, an almost 670PB, multi-tier file storage system for Frontier supercomputer users.
Orion is a LUSTRE file system based on HPE (Cray) ClusterStor with ~10PB of metadata, 11PB of NVMe flash and 649PB of disk. How the system handles, multi-tiering is unique AFAIK. It performs 11TB/sec of write IO and 14TB/sec of read IO. Note, that’s TeraBytes/sec not TeraBits. Listen to the podcast to learn more
While designing Orion, ORNL found their users have a very bi-(tri-?)modal file size distribution. That is, many of their files are under 256KB, a lot under are 8MB and the remaining all over 8MB. As a result they added Progressive File Placement to support multi-tiering on LUSTRE.
Orion has 3 tiers of data storage. The 1st tier is 10PB NVMe SSD storage metadata tier. Orion also uses Data on Metadata, which stores the 1st 256KB of every file along with the file metadata. So, accessing very small files (<256GB) is all done out of the metadata tier. But what’s interesting is that the first 256GB of every file on ORION is located on the metadata tier
Orion’s 2nd tier is 11PB NVMe SSD flash tier. On this tier they store all file data over 256GB and under 8MB. NVMe flash tier is not as fast as the metadata tier but it supports another large chunk of ORNL files.
The final Orion tier is 649PB of spinning disk storage. Here it stores all file data that is larger than 8MB. Yes it’s slower than the other 2 tiers, but it makes that up in volume. Very large files, will find they can predictably access the first 256GB, the next 8MB (- 256GB) of data and then have to use disk to access any file data after that.
It’s important to note that Orion doesn’t support hot data in the upper tiers and cold data in the lowest tier as many multi-tier storage systems do. Rather Orion multi-tiering just tiers different segment of all file data on different tiers depending on where that data resides in the file’s storage space.
In addition to Orion file storage, ORNL also has an archive storage that uses HPSS and Spectrum Archive. Dustin mentioned that ORNL’s HPC data archive is accessed more frequently than typical archive storage, so there’s lots of data movement going between archive and Orion.
Orion supports metadata nodes and object storage targets (OSTs, storage nodes). Each OST has 1 flash target (made up of many SSDs) and 2 disk targets (made up of many disk drives).
Dustin mentioned that Orion has 450 OSTs, which in aggregate support 5.4K-3.84TB NVMe SSDs and 47.7K-18TB disk drives. Doing the math, that’s 20.7PB of NVMe flash and 858.6PB of disk storage.
ORION data is protected using ZFS RAID2, or can sustain up to 2 drive failures without losing data. Their stripe has 8 data and 2 parity drives plus 2 spares.
Keith asked how does one manage 670PB of LUSTRE storage. Dustin said they have a team of people with many software tools to support it. First and foremost, they take lot’s of telemetry off of all OSTs and metadata servers to understand what’s going on in the storage cluster. They use SMART data to predict which drives will go bad before they actually go bad. He mentioned that using telemetry, they can tell what kind of performance an app is driving and can use this to tweak what file systems an app uses.
I asked Dustin how he updates a 450 OST + [N] metadata node storage system. They take the cluster down when it needs to be updated. But before that, they regression test any update in their lab and when ready, roll it out to the whole cluster. Dustin said many problems only show up at scale, which means that an update can only truly be tested , when the whole cluster is in operation.
I asked Dustin whether they were doing any AI/ML work at ORNL. He said yes, but this is not on Orion directly but uses compute server mirrored DAS NVMe storage. He said that AI/ML workloads don’t require lot’s of data and using DAS makes it go as quick as possible.
Dustin mentioned that ORNL is a DoE funded lab so any changes they make to LUSTREare submitted back to the repository for inclusion into next release of LUSTRE.
Dustin Leaverman, Group Leader HPC storage at Oak Ridge National Labs
Dustin Leverman is the Group Leader for HPC Storage and Archive Group of the National Center for Computational Sciences (NCCS) at Oak Ridge National Laboratory (ORNL). The NCCS is home to Orion, the 700 petabyte file system that supports Frontier, the world’s first exascale supercomputing system and fastest computer in the world.
Dustin began his career at ORNL in 2009. He was previously a team leader in the HPC and Data Operations Group. In his current role, Dustin oversees procurement, administration, and support of high-speed parallel file systems and archive capabilities to enable the National Center for Computational Sciences’ overall mission of leadership-class and scalable computing programs.