139: GreyBeards talk HPC file systems with Marc-André Vef and Alberto Miranda of GekkoFS

In honor of SC22 conference this month in Dallas, we thought it time to check in with our HPC brethren to find out what’s new in storage for their world. We happened to see that IO500 had some recent (ISC22) results using a relative new comer, GekkoFS (@GekkoFS). So we reached out to the team to find out how they managed to crack into the top 10. We contacted Marc-André Vef (@MarcVef), a Ph.D. student at Johannes Guttenberg University Mainz and Alberto Miranda (@amiranda_hpc) Ph.D. of Barcelona Supercomputing Center two of the authors on the GekkoFS paper.

GekkoFS is a new burst file system that is tailor made to create, process and tear down scratch data sets for HPC workloads. It turns out that HPC does lots of work using scratch files as working data sets. Burst file systems typically use another parallel file systems to (stage) read (permanent) data into the scratch files and write (permanent) result data out. But during processing, the burst file system handles all scratch data access. Listen to the podcast to learn more

We had never heard of a burst file system before but it’s been around for a while now in HPC. For example, BeeGFS provides one (check out our GreyBeards podcast on BeeGFS). BeeGFS supports both a PFS and a burst file system. GekkoFS only offers a burst file system.

GekkoFS is a distributed burst file systems which operates across nodes to stitch together a single global file system. GekkoFS is strictly open source at the moment and can be downloaded (see: GekkoFS Gitlab) and used by anyone.

They are considering in the future of supplying professional support but at the moment if you have an issue, Marc and Alberto suggest you use the GekkoFS GitLab incident tracking system to tell them about it.

Turns out Lustre, IBM Spectrum Scale, DAOS and other HPC file systems take gobs of overhead to create scratch files. And even though it takes a lot of IO to load scratch file data and write out results, there’s a whole lot more IO that gets done to scratch files during HPC jobs.

This sort of IO also occurs for AI/ML/DLL where training data is staged into a sort of scratch area (typically in memory, depending on size) and then repeatedly (re-)processed there. GekkoFS can offer significant advantages to AI/ML/DL work when training data is very large. Normally without a burst file system, one would need to shard this data across nodes and then deal with the partial training that results. But with GekkoFS, all you need do is stage it into the burst file system and read it from there.

GekkoFS is partially posix compliant. They install a client-side interposer library that intercepts those posix requests destined for GekkoFS files.

GekkoFS has no central metadata server, which means that all nodes in the GekkoFS cluster support metadata services. Filenames are hashed to tell GekkoFS which node has its (metadata &) data.

GekkoFS stores their data and metadata on local disks, SSDs or in memory (tempfs) storage. All local node storage in the cluster is stitched together into a single global file system.

GekkoFS supports strict consistency for IO and file creation/deletion within nodes. They use an internal transaction database to enforce this strict consistency.

Across nodes they support eventual consistency. Which means files created on one node may not be immediately viewable/accessible by other nodes in the cluster for a short period of time while (meta) data updates are propagated across the cluster.

As part of their consistency paradigm, GekkoFS doesn’t support directory locking. Jason mentioned that HPC “LS” (directory listings) commands can sometimes take forever due to directory locking No directory locking makes LS commands happen faster but may show inconsistent results (due to eventual consistency).

We had some discussion on this lack of directory locking and eventual consistency in file systems, but we agreed to disagree. They did say that for the HPC workloads (and probably AI/ML/DLL) workloads, their approach seems appropriate as they are way more read intensive than write intensive.

In any case, they must be doing something right as they have a screaming scratch file system for HPC work.

Marc will be attending SC22 in Dallas this month, so if your attending please look him up and say hello from us.

Marc-André Vef, Ph.D. student

Marc-André Vef is a Ph.D. candidate at the Johannes Gutenberg University Mainz. He started his Ph.D. in 2016 after receiving his B.Sc. and M.Sc. degrees in computer science from the Johannes Gutenberg University Mainz. His master’s thesis was in cooperation with IBM Research about analyzing file create performance in the IBM Spectrum Scale parallel file system (formerly GPFS).

During his Ph.D., he has worked on several projects focusing on file system tracing (in collaboration with IBM Research) and distributed file systems, among others. Most notably, he designed two ad-hoc distributed file systems: DelveFS (in collaboration with OpenIO), which won the Best Paper in its category, and GekkoFS (in collaboration with the Barcelona Supercomputing Center). GekkoFS placed fourth in its first entry in the 10-node challenge of the IO500 benchmark. The file system is actively developed in the scope of the EuroHPC ADMIRE project.

His research interests focus on file systems and system analytics.

Alberto Miranda, Ph.D., Senior Researcher, Barcelona Supercomputing Center

Dr. Eng. Alberto Miranda is a Senior Researcher in
advanced storage systems in the Computer Science Department of the Barcelona Supercomputing Center (BSC) and co-leader of the Storage Systems Research Group since 2019. Dr. Eng. Miranda received a diploma in Computer Engineering (2004), a M.Sc. degree in Computer Science (2006) and a M.Sc. degree in Computer Architectures, Networks and Systems (2008) from the Technical University of Catalonia (UPC-BarcelonaTech). He later received a Ph.D. degree Cum Laude in Computer Science from the Technical University of Catalonia in 2014 with his thesis “Scalability in Extensible and Heterogeneous Storage Systems”.

His current research interests include efficient file and storage systems, operating systems, distributed system architectures, as well as information retrieval systems. Since he started his work at BSC in 2007, he has published 14 papers in international conferences and journals, as well as 5 white papers and technical reports and 1 book chapter. Dr. Eng. Miranda is currently involved in several European and national research projects and has participated in competitively funded EU projects XtreemOS, IOLanes, Prace2IP, IOStack, Mont-Blanc 2, EUDAT2020, Mont-Blanc 3, and NEXTGenIO.

Leave a Reply

Discover more from Grey Beards on Storage

Subscribe now to keep reading and get access to the full archive.

Continue reading