139: GreyBeards talk HPC file systems with Marc-André Vef and Alberto Miranda of GekkoFS

In honor of SC22 conference this month in Dallas, we thought it time to check in with our HPC brethren to find out what’s new in storage for their world. We happened to see that IO500 had some recent (ISC22) results using a relative new comer, GekkoFS (@GekkoFS). So we reached out to the team to find out how they managed to crack into the top 10. We contacted Marc-André Vef (@MarcVef), a Ph.D. student at Johannes Guttenberg University Mainz and Alberto Miranda (@amiranda_hpc) Ph.D. of Barcelona Supercomputing Center two of the authors on the GekkoFS paper.

GekkoFS is a new burst file system that is tailor made to create, process and tear down scratch data sets for HPC workloads. It turns out that HPC does lots of work using scratch files as working data sets. Burst file systems typically use another parallel file systems to (stage) read (permanent) data into the scratch files and write (permanent) result data out. But during processing, the burst file system handles all scratch data access. Listen to the podcast to learn more

We had never heard of a burst file system before but it’s been around for a while now in HPC. For example, BeeGFS provides one (check out our GreyBeards podcast on BeeGFS). BeeGFS supports both a PFS and a burst file system. GekkoFS only offers a burst file system.

GekkoFS is a distributed burst file systems which operates across nodes to stitch together a single global file system. GekkoFS is strictly open source at the moment and can be downloaded (see: GekkoFS Gitlab) and used by anyone.

They are considering in the future of supplying professional support but at the moment if you have an issue, Marc and Alberto suggest you use the GekkoFS GitLab incident tracking system to tell them about it.

Turns out Lustre, IBM Spectrum Scale, DAOS and other HPC file systems take gobs of overhead to create scratch files. And even though it takes a lot of IO to load scratch file data and write out results, there’s a whole lot more IO that gets done to scratch files during HPC jobs.

This sort of IO also occurs for AI/ML/DLL where training data is staged into a sort of scratch area (typically in memory, depending on size) and then repeatedly (re-)processed there. GekkoFS can offer significant advantages to AI/ML/DL work when training data is very large. Normally without a burst file system, one would need to shard this data across nodes and then deal with the partial training that results. But with GekkoFS, all you need do is stage it into the burst file system and read it from there.

GekkoFS is partially posix compliant. They install a client-side interposer library that intercepts those posix requests destined for GekkoFS files.

GekkoFS has no central metadata server, which means that all nodes in the GekkoFS cluster support metadata services. Filenames are hashed to tell GekkoFS which node has its (metadata &) data.

GekkoFS stores their data and metadata on local disks, SSDs or in memory (tempfs) storage. All local node storage in the cluster is stitched together into a single global file system.

GekkoFS supports strict consistency for IO and file creation/deletion within nodes. They use an internal transaction database to enforce this strict consistency.

Across nodes they support eventual consistency. Which means files created on one node may not be immediately viewable/accessible by other nodes in the cluster for a short period of time while (meta) data updates are propagated across the cluster.

As part of their consistency paradigm, GekkoFS doesn’t support directory locking. Jason mentioned that HPC “LS” (directory listings) commands can sometimes take forever due to directory locking No directory locking makes LS commands happen faster but may show inconsistent results (due to eventual consistency).

We had some discussion on this lack of directory locking and eventual consistency in file systems, but we agreed to disagree. They did say that for the HPC workloads (and probably AI/ML/DLL) workloads, their approach seems appropriate as they are way more read intensive than write intensive.

In any case, they must be doing something right as they have a screaming scratch file system for HPC work.

Marc will be attending SC22 in Dallas this month, so if your attending please look him up and say hello from us.

Marc-André Vef, Ph.D. student

Marc-André Vef is a Ph.D. candidate at the Johannes Gutenberg University Mainz. He started his Ph.D. in 2016 after receiving his B.Sc. and M.Sc. degrees in computer science from the Johannes Gutenberg University Mainz. His master’s thesis was in cooperation with IBM Research about analyzing file create performance in the IBM Spectrum Scale parallel file system (formerly GPFS).

During his Ph.D., he has worked on several projects focusing on file system tracing (in collaboration with IBM Research) and distributed file systems, among others. Most notably, he designed two ad-hoc distributed file systems: DelveFS (in collaboration with OpenIO), which won the Best Paper in its category, and GekkoFS (in collaboration with the Barcelona Supercomputing Center). GekkoFS placed fourth in its first entry in the 10-node challenge of the IO500 benchmark. The file system is actively developed in the scope of the EuroHPC ADMIRE project.

His research interests focus on file systems and system analytics.

Alberto Miranda, Ph.D., Senior Researcher, Barcelona Supercomputing Center

Dr. Eng. Alberto Miranda is a Senior Researcher in
advanced storage systems in the Computer Science Department of the Barcelona Supercomputing Center (BSC) and co-leader of the Storage Systems Research Group since 2019. Dr. Eng. Miranda received a diploma in Computer Engineering (2004), a M.Sc. degree in Computer Science (2006) and a M.Sc. degree in Computer Architectures, Networks and Systems (2008) from the Technical University of Catalonia (UPC-BarcelonaTech). He later received a Ph.D. degree Cum Laude in Computer Science from the Technical University of Catalonia in 2014 with his thesis “Scalability in Extensible and Heterogeneous Storage Systems”.

His current research interests include efficient file and storage systems, operating systems, distributed system architectures, as well as information retrieval systems. Since he started his work at BSC in 2007, he has published 14 papers in international conferences and journals, as well as 5 white papers and technical reports and 1 book chapter. Dr. Eng. Miranda is currently involved in several European and national research projects and has participated in competitively funded EU projects XtreemOS, IOLanes, Prace2IP, IOStack, Mont-Blanc 2, EUDAT2020, Mont-Blanc 3, and NEXTGenIO.

137: GreyBeards talk VMware Explore 2022 Wrap-up

Jason Collier Principle Member of Technical Staff, AMD (@bocanuts), a current GreyBeardsOnStorage co-host and I both attended VMware Explore 2022 this past week and we recorded a podcast discussing VMware’s announcements on the show floor. It turns out that Keith Townsend, TheCTOAdvisor (@thectoadvisor) had brought his Airstream &studio and was exhibiting on the show floor. Keith kindly offered the use of his studio to record the podcast.

This one is a video. Let us know what you think. I clearly need a cowboy hat and Jason said (off camera) that I’m showing more grey in my beard than before. I take that as a compliment here.

Here’s the news as we saw it:

  • vSphere 8 – has a number of new features but the ones we thought important were the GA of Project Monterey. This supports new DPUs that now run ESXi out board from the CPU. They are able to offload lot’s of the CPU networking cycles to the DPU freeing up these for other (more important) work. vSphere 8 supports 2 DPUs now, the NVIDIA (Mellanox) BlueField(-2?) DPU and the AMD (Pensando) DPU. AMD recently purchased Pensando and Jason seemed to know an awful lot about this tech. VMware also announced support for concurrent ESXi upgrades which can now allow upgrading ESXi running in DPUs while hosts and clusters continue to operate. Finally, the other item of interest was vSphere is now more API driven. I guess it’s only a matter of time before all VMware functionality is API driven to make it even more cloud-like
  • vSAN 8 – also has a number of new features. The first we discussed was is a faster data path. This means more IOPS, more bandwidth and lower latency for IOs. Next, vSAN 8 now supports single tier storage pools . These will no longer require a caching layer. This should also speed up IO operations (as long as the single tier is at least as fast as the old caching layer). They also announced faster snapshots. Apparently this has been a problem in the past and they’ve done the work to speed this up considerably. Jason mentioned an AMD open source VM migration tool (from somebody else’s X86 CPUs to AMDs) that depends a lot on vSAN snapshots.
  • Cloud Flex Storage – mentioned at the show but not well explained, Jason and I speculated that this was an internal storage service available on for Cloud Foundation users on AWS where customers could subscribe to storage as-a-service in much lower increments (maybe even GB/month) than standing up more vSAN hosts to increase storage.
  • NetApp FsX (ONTAP) storage – along the same line, VMware announced support for NetApp’s FsX as yet another storage option for Cloud Foundation users on AWS. Supplying yet another storage-as-a-service option for this environment.
  • Cloud Flex Compute – also mentioned at the show was their new Compute-As-A-Service for Cloud Foundation users on AWS. This way users could subscribe to more or less compute, on an as needed basis rather than having to spin up new ESXi hosts. I later found out this allows users to run a single VM and pay for it on a subscription basis.
  • Tanzu Application Platform (TAP) – is a new VMware supplied (and supported) “development experience” for K8s on vSphere. Note, it doesn’t include any advanced Tanzu services such as Tanzu K8s Grid (TKG) so it’s a true DevOps bare bones environment.
  • Tanzu K8S Operations (TKO) – another new Tanzu based service which offers operations complete control over the Tanzu services running on vSphere. Note Tanzu Mission Control (TMC) is not part of TKO.
  • Aria management – VMware rebranded vRealize and CloudHealth, which now comes in 3 bundles, Aria Cost (CloudHealth+), Aria Operations and Aria Automation. Which are all built onto of Aria Graph that graphs all the nodes in your VMware clusters with all their connections so that Aria management can traverse this graph to find out what’s where. On top of Aria Graph are Aria Hub, Aria Insights, and Aria Guardrails (sort of like providing boundary’s where services can be deployed).

They also announced Ransomware Recovery [changed 7Sep22, the Eds] as a Service which builds on VMware’s DR-aaS announced last year and Tanzu now works with Red Hat OpenShift

We also discussed the show. I heard somewhere there were 10K people there, Jason heard somewhere between 6K and 9K. In any case much smaller than VMworlds prior to Covid (25kish). And of course the rebranding of the show seemed counter-intuitive at best.

The show floor was much smaller than usual, (not withstanding Keith’s Airstream RV exhibit). And there were a number of storage vendors not at the show?? There was less hardware on the show floor, this could be a Covid thing but there were just as many mini-white boards/class rooms per large exhibiter, so don’t think it was because of Covid.

But the elephant in the room was Broadcom’s acquisition of VMware. At one of the analyst briefings I asked an exec about attrition. He made a couple of comments but in the end said VMware has been bought and sold before and has always come out of it in better shape. This will be no different.

That’s about all from the show.

And Thanks again to Keith and his crew, for lending us his studio to record the show. It’s been a while since I’ve seen an RV on a show floor. Keith seemed to have a ball with it

Tell us how you like our video. If everyone is for it we could do something like this with a Zoom (in this case Zencastr) recording, Or just try this at the next joint conference. .

Jason Collier, Principle Member of Technical Staff at AMD

Jason Collier (@bocanuts) is a long time friend, technical guru and innovator who has over 25 years of experience as a serial entrepreneur in technology.

He was founder and CTO of Scale Computing and has been an innovator in the field of hyperconvergence and an expert in virtualization, data storage, networking, cloud computing, data centers, and edge computing for years.

He’s on LinkedIN. He’s currently working with AMD on new technology and he has been a GreyBeards on Storage co-host since the beginning of 2022

134: GreyBeards talk (storage) standards with Dr. J Metz, SNIA Chair & Technical Director AMD

We have known Dr. J Metz (@drjmetz, blog), Chair of SNIA (Storage Networking Industry Association) BoD, for over a decade now and he has always been an intelligent industry evangelist. DrJ was elected Chair of SNIA BoD in 2020.

SNIA has been instrumental in the evolution of storage over the years working to help define storage networking, storage form factors, storage protocols, etc. Over the years it’s been crucial to the high adoption of storage systems in the enterprise and still is.. Listen to the podcast to learn more.

SNIA started out helping to define and foster storage networking before people even knew what it was. They were early proponents of plugfests to verify/validate compatibility of all the hardware, software and systems in a storage network solution.

One principal that SNIA has upheld, since the very beginning, is strict vendor and technology neutrality. SNIA goes out of it’s way to insure that all their publications,  media and technical working group (TWGs) committees maintain strict vendors and technology neutrality.

The challenge with any evolving technology arena is that new capabilities come and go with a regular cadence and one cannot promote one without impacting another. Ditto for vendors, although vendors seem to stick around a bit longer.

One SNIA artifact that has stood well the test of time is the SNIA dictionary.  Free to download and free copies available at every conference that SNIA attends. The dictionary covers just about every relevant acronym, buzzword and technology present in the storage networking industry today as well as across its long history.

SNIA also presents and pushes the storage networking point of view at every technical alliance in the IT industry. .

In addition, SNIA holds storage conferences around the world, as well as plugfests and  hackathons focused on the needs of the storage industry. Their Storage Developer Conference (SDC), coming up in September in the USA, is a highly technical conference specifically targeted at storage system developers. 

SDC presenters include many technology inventors driving the leading edge of storage (and memory, see below) industries. So, if you are developing storage systems, SDC is a must attend conference.

As for plugfests, SNIA has held FC storage networking plugfests over the years which have been instrumental in helping storage networking adoption.

We also talked about SNIA hackathons. Apparently a decade or so back, SNIA held a hackathon on SMB (the file protocol formerly known as CIFS) where most of the industry experts and partners doing work on SAMBA (open source SMB implementation) and SMB proprietary software were present.

At the time, Jason was working for another company, developing an SMB protocol. While attending the hackathon, Jason found that he was able to develop 1-1 relationships with many of the lead SMB/SAMBA developers and was able to solve problems in days that would have taken months before.

SNIA also has technology alliances with just about every other standards body involved in IT infrastructure, software and hardware today. As an indicator of where they are headed, SNIA recently joined with CNCF (Cloud Native Computing Foundation) to push for better storage under K8s.

SNIA has TWGs focused on technological areas that impact storage access. One TWG that has been going on now, for a long time, is Swordfish, an extension to the DMTF Redfish that focuses on managing storage.

Swordfish has struggled over the years to achieve industry adoption. We spent time discussing some of the issues with Swordfish, but honestly,  IMHO, it may be too late to change course.

Given the recent SNIA alliance with CNCF, we started discussing the state of storage under K8s and containers. DrJ and Jason mentioned that storage access under K8s goes through so many layers of abstraction that IO performance is almost smothered in overhead. The thinking at SNIA is we need to come up with a better API that bypasses all this software overhead to  directly access hardware.

 SNIA’s been working on SDXI (Smart Data Acceleration Interface), a new hardware memory to memory, direct path protocol. Apparently, this is a new byte level, (storage?) protocol for moving data between memories. I believe SDXI assumes that at least one memory device is shared. The other could be in a storage server, smartNIC, GPU, server, etc. If SDXI were running in your shared memory and server, one could use the API to strip away all of the software abstraction layers that have built up over the years to accessi shared memory at near hardware speeds

DrJ mentioned was NVMe as another protocol that strips away software abstractions to allow direct access to (storage) hardware. The performance of Optane and SSDs (and it turns out disks) was being smothered by SCSI device protocols/abstrations that were the only way to talk to storage devices in the past. But NVM and NVMe came along, and stripped all the non-essential abstractions and protocol overhead away and all of a sudden sub 100 microsecond IO’s were possible. 

Dr. J Metz,  SNIA Chair & Technical Director AMD

J is the Chair of SNIA’s (Storage Networking Industry Association) Board of Directors and Technical Director for Systems Design for AMD where he works to coordinate and lead strategy on various industry initiatives related to systems architecture. Recognized as a leading storage networking expert, J is an evangelist for all storage-related technology and has a unique ability to dissect and explain complex concepts and strategies. He is passionate about the innerworkings and application of emerging technologies.

J has previously held roles in both startups and Fortune 100 companies as a Field CTO,  R&D Engineer, Solutions Architect, and Systems Engineer. He has been a leader in several key industry standards groups, sitting on the Board of Directors for the SNIA, Fibre Channel Industry Association (FCIA), and Non-Volatile Memory Express (NVMe). A popular blogger and active on Twitter, his areas of expertise include NVMe, SANs, Fibre Channel, and computational storage.

J is an entertaining presenter and prolific writer. He has won multiple awards as a speaker and author, writing over 300 articles and giving presentations and webinars attended by over 10,000 people. He earned his PhD from the University of Georgia.

132: GreyBeards talk fast embedded k-v stores with Speedb’s Co-Founder&CEO Adi Gelvan

We’ve been talking a lot about K8s of late so we thought it was time to get back down to earth and spend some time with Adi Gelvan (@speedb_io), Co-founder and CEO of Speedb, an embedded key-value store, drop-in/replacement for RocksDB, that significantly improves on its IO performance for large metadata databases.

At Adi’s last job they were searching for a key-values store or database to manage the substantial metadata they needed. After looking at RocksDB, they found it had a number of performance problems, especially as you got up to lots of metadata. Speedb was specifically designed to address the problems they found. Listen to the podcast to learn more.

RocksDB is a key-value store engine that manages the metadata for just about every open source project in existence that uses metadata. RocksDB is a Facebook open source fork of Google’s LevelDB database.

The main issues with RocksDB is that when you have a lot of metadata (key:volume pairs), RocksDB performance suffers from highly variable latency and write stalls.

Most RocksDB users are aware of these problems and turn to sharding the database to address them (by essentially shrinking the amount of metadata under management within a single node/instance..

Historically, key-volume stores used B+-trees to store data. B+-trees are great for reading, but bad for writing. Namely the B+-tree usually had to be rebalanced when entries were added and potentially when they were updated. This could cause a cascade of read-write IO throughout the tree, delaying the original IO.

Log Structured Merge trees (LMS-trees) were created to reduce write problems while at the same time provide B+-tree speed for reading. Essentially, an LSM-tree is an in-memory, sequence of (sometimes sorted) key-value pairs that can be written (destaged) to multiple sequential (sorted) string tables (SST) files on some backing store. A hierarchical index is maintained in memory to identify which SSTs holds which key-value data.

RocksDB uses LSM-tree, in memory data structures, to buffer writes. When memory becomes full, the LSM tree can be destaged to backing store to one or more SST files. Howewer, SSTs, when first written, aren’t necessarily in sorted key order, and they may contain duplicate key-value entries to what’s already in other SSTs.

So earlier versions of SSTs will need to be read back in, compacted (duplicate key-value entries deleted), sorted and written back out. The earliest version of the SSTs is considered Level 0 (L0), the next (1st level compacted and sorted) is considered L1, and this process can go on generating L2 to Ln SSTs. We would call this garbage collection, the metadata world calls it compaction.

But each time an SST is written out that’s another read of all the key-value pairs AND another write to storage. In SSDs we would call these repeated writes, write amplification. It turns out that RocksDB can have up to a 30X write amplification for a key-value entry. This means that instead of just writing it once or twice it’s written (and reread) up to 30 times. This IO takes away bandwidth and processing power from normal metadata read and write activity, which impacts IO performance

As GreyBeards know, storage (and flash) garbage collection can lead to unpredictable latencies and system busy times. Intense garbage collection (for SSDs) can seemingly hold off or stall all other IO for some amount of time during this activity. This is the main reason why RocksDB has highly variable latencies and write stalls.

Garbage collection is not an issue when you have limited amounts of metadata entries (key:volume pairs), but as you get more entries, ongoing garbage collection can become a serious impediment to performing IO. When we say “large metadata stores” we are talking 30GBs of metadata, with probably, billions of key-volume pair entries.

There appears to be two dimensions to (RocksDB) LSM-tree/SST file performance. One is the number of levels allowed and the other is the size of SST file.

Speedb determined that two dimensions weren’t sufficient to solve RocksDB performance problems. And sharding the database seemed to be putting the burden on the customer to fix the issue. So Speedb restructured their LSM-trees and SSTs to create 3 or more dimensions to tune for database performance.

With Speedb’s restructured LMS-tree and SST files, they reduce write amplification for large metadata databases, from 30X to 5X. That alone could easily increase system performance by a factor of 6.

Adi mentioned that for one cloud based customer, they were able to double performance with 1/4 the (cloud instance) server hardware, essentially providing an ~8X improvement in performance over RocksDB.

Adi also mentioned that they are targeting system developers with large metadata stores. Luckily Speedb is a fully RocksDB compatible replacement. This means developers should only take ~30 minutes to convert a system to use Speedb.

We also asked about pricing. Adi said there’s two current pricing models: 1) OEMs pay a revenue share to use Speedb and 2) non-OEMs can license the product on a per node per month basis. Given that Speedb node efficiency over RocksDB, there should be a less nodes required to support the same performance for any given metadata store.

Adi also mentioned they are in the process of releasing an open source version of Speedb that incorporates some of the enterprise product. This way developers can try Speedb to see how it works for free. It won’t bethe complete product but it’s better than native RocksDB.

Adi Gelvan, Co-Founder and CEO Speedb

Adi Gelvan is co-founder and CEO of Speedb, a data management startup, that provides a drop-in replacement for RocksDB embedded storage engine. 

Adi was a former IT infrastructure manager with over two decades of management, commercialization and executive sales position. Adi specializes in leading global software technology companies like Infinidat and SQream to outstanding growth.

Adi holds a double academic degree in mathematics & computer science.

129: GreyBeards talk composable infrastructure with GigaIO’s, Matt Demas, Field CTO

We haven’t talked composable infrastructure in a while now but it’s been heating up lately. GigaIO has some interesting tech and I’ve been meaning to have them on the show but scheduling never seemed to work out. Finally, we managed to sync schedules and have Matt Demas, field CTO at GigaIO (@giga_io) on our show.

Also, please welcome Jason Collier (@bocanuts), a long time friend, technical guru and innovator to our show as another co-host. We used to have these crazy discussions in front of financial analysts where we disagreed completely on the direction of IT. We don’t do these anymore, probably because the complexities in this industry can be hard to grasp for some. From now on, Jason will be added to our gaggle of GreyBeard co-hosts.

GigaIO has taken a different route to composability than some other vendors we have talked with. For one, they seem inordinately focused on speed of access and reducing latencies. For another, they’re the only ones out there, to our knowledge, demonstrating how today’s technology can compose and share memory across servers, storage, GPUs and just about anything with DRAM hanging off a PCIe bus. Listen to the podcast to learn more.

GigaIO started out with pooling/composing memory across PCIe devices. Their current solution is built around a ToR (currently Gen4) PCIe switch with logic and a party of pooling appliances (JBoG[pus], JBoF[lash], JBoM[emory],…). They use their FabreX fabric to supply rack-scale composable infrastructure that can move (attach) PCIe componentry (GPUs, FPGAs, SSDs, etc.) to any server on the fabric, to service workloads.

We spent an awful long time talking about composing memory. I didn’t think this was currently available, at least not until the next version of CXL, but Matt said GigaIO together with their partner MemVerge, are doing it today over FabreX.

We’ve talked with MemVerge before (see: 102: GreyBeards talk big memory … episode). But when last we met, MemVerge had a memory appliance that virtualized DRAM and Optane into an auto-tiering, dual tier memory. Apparently, with GigaIO’s help they can now attach a third tier of memory to any server that needs it. I asked Matt what the extended DRAM response time to memory requests were and he said ~300ns. And then he said that the next gen PCIe technology will take this down considerably.

Matt and Jason started talking about High Bandwidth Memory (HBM) which is internal to GPUs, AI boards, HPC servers and some select CPUs that stacks synch DRAM (SDRAM) into a 3D package. 2nd gen HBM silicon is capable of 256 GB/sec per package. Given this level of access and performance. Matt indicated that GigaIO is capable of sharing this memory across the fabric as well.

We then started talking about software and how users can control FabreX and their technology to compose infrastructure. Matt said GigaIO has no GUI but rather uses Redfish management, a fully RESTfull interface and API. Redfish has been around for ~6 yrs now and has become the de facto standard for management of server infrastructure. GigaIO composable infrastructure support has been natively integrated into a couple of standard cluster managers. For example. CIQ Singularity & Fuzzball, Bright Computing cluster managers and SLURM cluster scheduling. Matt also mentioned they are well plugged into OCP.

Composable infrastructure seems to have generated new interest with HPC customers that are deploying bucketfuls of expensive GPUs with their congregation of compute cores. Using GigaIO, HPC environments like these can overnight, go from maybe 30% average GPU utilization to 70%. Doing so can substantially reduce acquisition and operational costs for GPU infrastructure significantly. One would think the cloud guys might be interested as well.

Matt Demas, Field CTO, GigaIO

Matt’s career spans two decades of experience in architecting innovative IT solutions, starting with the US Air Force. He has built federal, healthcare, and education-based vertical solutions at companies like Dell, where he was a Senior Solutions Architect. Immediately prior to joining GigaIO, he served as Field CTO at Liqid. 

Matt holds a Bachelor’s degree in Information Technology from American InterContinental University, and an MBA from Concordia University Austin.