139: GreyBeards talk HPC file systems with Marc-André Vef and Alberto Miranda of GekkoFS

In honor of SC22 conference this month in Dallas, we thought it time to check in with our HPC brethren to find out what’s new in storage for their world. We happened to see that IO500 had some recent (ISC22) results using a relative new comer, GekkoFS (@GekkoFS). So we reached out to the team to find out how they managed to crack into the top 10. We contacted Marc-André Vef (@MarcVef), a Ph.D. student at Johannes Guttenberg University Mainz and Alberto Miranda (@amiranda_hpc) Ph.D. of Barcelona Supercomputing Center two of the authors on the GekkoFS paper.

GekkoFS is a new burst file system that is tailor made to create, process and tear down scratch data sets for HPC workloads. It turns out that HPC does lots of work using scratch files as working data sets. Burst file systems typically use another parallel file systems to (stage) read (permanent) data into the scratch files and write (permanent) result data out. But during processing, the burst file system handles all scratch data access. Listen to the podcast to learn more

We had never heard of a burst file system before but it’s been around for a while now in HPC. For example, BeeGFS provides one (check out our GreyBeards podcast on BeeGFS). BeeGFS supports both a PFS and a burst file system. GekkoFS only offers a burst file system.

GekkoFS is a distributed burst file systems which operates across nodes to stitch together a single global file system. GekkoFS is strictly open source at the moment and can be downloaded (see: GekkoFS Gitlab) and used by anyone.

They are considering in the future of supplying professional support but at the moment if you have an issue, Marc and Alberto suggest you use the GekkoFS GitLab incident tracking system to tell them about it.

Turns out Lustre, IBM Spectrum Scale, DAOS and other HPC file systems take gobs of overhead to create scratch files. And even though it takes a lot of IO to load scratch file data and write out results, there’s a whole lot more IO that gets done to scratch files during HPC jobs.

This sort of IO also occurs for AI/ML/DLL where training data is staged into a sort of scratch area (typically in memory, depending on size) and then repeatedly (re-)processed there. GekkoFS can offer significant advantages to AI/ML/DL work when training data is very large. Normally without a burst file system, one would need to shard this data across nodes and then deal with the partial training that results. But with GekkoFS, all you need do is stage it into the burst file system and read it from there.

GekkoFS is partially posix compliant. They install a client-side interposer library that intercepts those posix requests destined for GekkoFS files.

GekkoFS has no central metadata server, which means that all nodes in the GekkoFS cluster support metadata services. Filenames are hashed to tell GekkoFS which node has its (metadata &) data.

GekkoFS stores their data and metadata on local disks, SSDs or in memory (tempfs) storage. All local node storage in the cluster is stitched together into a single global file system.

GekkoFS supports strict consistency for IO and file creation/deletion within nodes. They use an internal transaction database to enforce this strict consistency.

Across nodes they support eventual consistency. Which means files created on one node may not be immediately viewable/accessible by other nodes in the cluster for a short period of time while (meta) data updates are propagated across the cluster.

As part of their consistency paradigm, GekkoFS doesn’t support directory locking. Jason mentioned that HPC “LS” (directory listings) commands can sometimes take forever due to directory locking No directory locking makes LS commands happen faster but may show inconsistent results (due to eventual consistency).

We had some discussion on this lack of directory locking and eventual consistency in file systems, but we agreed to disagree. They did say that for the HPC workloads (and probably AI/ML/DLL) workloads, their approach seems appropriate as they are way more read intensive than write intensive.

In any case, they must be doing something right as they have a screaming scratch file system for HPC work.

Marc will be attending SC22 in Dallas this month, so if your attending please look him up and say hello from us.

Marc-André Vef, Ph.D. student

Marc-André Vef is a Ph.D. candidate at the Johannes Gutenberg University Mainz. He started his Ph.D. in 2016 after receiving his B.Sc. and M.Sc. degrees in computer science from the Johannes Gutenberg University Mainz. His master’s thesis was in cooperation with IBM Research about analyzing file create performance in the IBM Spectrum Scale parallel file system (formerly GPFS).

During his Ph.D., he has worked on several projects focusing on file system tracing (in collaboration with IBM Research) and distributed file systems, among others. Most notably, he designed two ad-hoc distributed file systems: DelveFS (in collaboration with OpenIO), which won the Best Paper in its category, and GekkoFS (in collaboration with the Barcelona Supercomputing Center). GekkoFS placed fourth in its first entry in the 10-node challenge of the IO500 benchmark. The file system is actively developed in the scope of the EuroHPC ADMIRE project.

His research interests focus on file systems and system analytics.

Alberto Miranda, Ph.D., Senior Researcher, Barcelona Supercomputing Center

Dr. Eng. Alberto Miranda is a Senior Researcher in
advanced storage systems in the Computer Science Department of the Barcelona Supercomputing Center (BSC) and co-leader of the Storage Systems Research Group since 2019. Dr. Eng. Miranda received a diploma in Computer Engineering (2004), a M.Sc. degree in Computer Science (2006) and a M.Sc. degree in Computer Architectures, Networks and Systems (2008) from the Technical University of Catalonia (UPC-BarcelonaTech). He later received a Ph.D. degree Cum Laude in Computer Science from the Technical University of Catalonia in 2014 with his thesis “Scalability in Extensible and Heterogeneous Storage Systems”.

His current research interests include efficient file and storage systems, operating systems, distributed system architectures, as well as information retrieval systems. Since he started his work at BSC in 2007, he has published 14 papers in international conferences and journals, as well as 5 white papers and technical reports and 1 book chapter. Dr. Eng. Miranda is currently involved in several European and national research projects and has participated in competitively funded EU projects XtreemOS, IOLanes, Prace2IP, IOStack, Mont-Blanc 2, EUDAT2020, Mont-Blanc 3, and NEXTGenIO.

138: GreyBeards talk big data orchestration with Adit Madan, Dir. of Product, Alluxio

We have never talked with Alluxio before but after coming back last week from Cloud Field Day 15 (CFD15) it seemed a good time to talk with other solution providers attempting to make hybrid cloud easier to use. Adit Madan (@madanadit) , Director of Product Management, Alluxio, which is a data orchestration solution that’s available in both a free to download/use, open source, community edition (apparently, Meta is a customer ) or a licensed, closed source, enterprise edition.

Alluxio data orchestration is all about suppling local like, IO access to data that resides elsewhere for BI, AI/ML/DL, and just about any other application needing to process data residing elsewhere. Listen to the podcast to learn more

Alluxio started out at UC Berkeley’s AMPlab, which is focused on big data problems and was designed to provide local access to massive amounts of distributed data. Alluxio ends up constructing a locally accessible, federation of data sources for compute apps running elsewhere,

Alluxio software installs near where compute apps run that need access to remote data. We asked about a typical cloud bursting case where S3 object data needed by an app are sitting on prem, but the apps need to run in a cloud, e.g., AWS.

He said Alluxio software would be deployed in AWS, close to app compute and that’s all there is. There’s no Alluxio software running on prem, as Alluxio just uses normal (remote access) S3 APIs to supply data to the compute apps running in AWS.

Adit mentioned that BI was one of the main applications to take advantage of Alluxio, but AI/ML/DL learning is another that could use data orchestration. It turns out that AI/ ML/DL training’s consumption of data is repetitive and highly sequential, so caching, sequential pre-fetch and other Alluxio techniques can work well there to provide local-like access to remote data.

Adit said that enterprises are increasingly looking to avoid vendor lock-in and this applies equally well to the cloud. By supporting data access in one location, say GC,P and accessing that data from another, say Azure, data gravity need no longer limit where work is done.

Adit said what makes their solution so valuable is that instead of duplicating all data from one place to another all that Alluxio moves is just the data required/requested by the apps running there.

Keith asked whether Adit considered Alluxio a data mesh or data fabric. Keith had to explain the terms to me and said data fabrics are pipes and physical infrastructure/functionality that moves data around and data mesh is what gives clients/apps/users access to that data. From that perspective Alluxio is a data mesh.

Alluxio Caching

Adit said that caching is one of the keys to making Alluxio work. Much of the success of their solution depends on applications having a well behaved working set. He also mentioned they use pre-fetching and other techniques to minimize access latency and maximize throughput. However, the first byte of data being accessed may take some time to get to where compute executes.

Adit said it’s not unusual for them to have a 1/2PB of cache (storage) for an application with multiPBs of source data.

Keith asked how Alluxio’s performance can be managed. Adit said they (we assume enterprise edition) have a solution called Cache Insights which uses Alluxio’s extensive access pattern history to predict application IO performance with larger cache (storage), higher speed networking, higher performing/more compute cores, etc. In this way, customers can see what can be done to improve application IO performance and what it would cost.

Keith asked if Alluxio were available as a SaaS solution. Adit said, although it could be deployed in that fashion, it’s not currently a SaaS solution. When asked how Alluxio (enterprise) was priced, Adit said it’s a function of the total resources consumed by their service, i.e, storage (cache), cores, networking that runs Alluxio software etc.

As for deployment options, it turns out for Spark, Alluxio is just another lib package installed inside Spark. For K8s, Alluxio is installed as a CSI drivers and a set of containers and can be deployed as containers within a cluster that needs access to data or in an external, standalone K8s cluster, servicing IO from other clusters. Alluxio HA is supplied by using multiple nodes to provide IO access.

Alluxio also supports access to multiple data locations. In this case, the applications would just access different mount points.

Data reads are easy, writes can be harder due to data integrity issues. As such, trying to supply IO performance becomes a trade off for data integrity when data updates are supported. Adit said Alluxio offers a couple of different configuration options for write concurrency (data integrity) that customers can select from. We assume this includes write through, write back and perhaps other write consistency options.

Alluxio supports AWS, Azure and GCP cloud compute accessing HDFS, S3 and Posix protocol access to data residing at remote sites. At remote sites, they currently support MinIO, Cloudian and any other S3 compatible storage solutions as well as NetApp (ONTAP) and Dell (ECS) storage as data sources.

Adit Madan, Director of Product, Alluxio

Adit Madan is the Director of Product Management at Alluxio. Adit has extensive experience in distributed systems, storage systems, and large-scale data analytics.

Adit holds an MS from Carnegie Mellon University and a BS from the Indian Institute of Technology – Delhi.

Adit is the Director of Product Management at Alluxio and is also a core maintainer and Project Management Committee (PMC) member of the Alluxio Open Source project.

137: GreyBeards talk VMware Explore 2022 Wrap-up

Jason Collier Principle Member of Technical Staff, AMD (@bocanuts), a current GreyBeardsOnStorage co-host and I both attended VMware Explore 2022 this past week and we recorded a podcast discussing VMware’s announcements on the show floor. It turns out that Keith Townsend, TheCTOAdvisor (@thectoadvisor) had brought his Airstream &studio and was exhibiting on the show floor. Keith kindly offered the use of his studio to record the podcast.

This one is a video. Let us know what you think. I clearly need a cowboy hat and Jason said (off camera) that I’m showing more grey in my beard than before. I take that as a compliment here.

Here’s the news as we saw it:

  • vSphere 8 – has a number of new features but the ones we thought important were the GA of Project Monterey. This supports new DPUs that now run ESXi out board from the CPU. They are able to offload lot’s of the CPU networking cycles to the DPU freeing up these for other (more important) work. vSphere 8 supports 2 DPUs now, the NVIDIA (Mellanox) BlueField(-2?) DPU and the AMD (Pensando) DPU. AMD recently purchased Pensando and Jason seemed to know an awful lot about this tech. VMware also announced support for concurrent ESXi upgrades which can now allow upgrading ESXi running in DPUs while hosts and clusters continue to operate. Finally, the other item of interest was vSphere is now more API driven. I guess it’s only a matter of time before all VMware functionality is API driven to make it even more cloud-like
  • vSAN 8 – also has a number of new features. The first we discussed was is a faster data path. This means more IOPS, more bandwidth and lower latency for IOs. Next, vSAN 8 now supports single tier storage pools . These will no longer require a caching layer. This should also speed up IO operations (as long as the single tier is at least as fast as the old caching layer). They also announced faster snapshots. Apparently this has been a problem in the past and they’ve done the work to speed this up considerably. Jason mentioned an AMD open source VM migration tool (from somebody else’s X86 CPUs to AMDs) that depends a lot on vSAN snapshots.
  • Cloud Flex Storage – mentioned at the show but not well explained, Jason and I speculated that this was an internal storage service available on for Cloud Foundation users on AWS where customers could subscribe to storage as-a-service in much lower increments (maybe even GB/month) than standing up more vSAN hosts to increase storage.
  • NetApp FsX (ONTAP) storage – along the same line, VMware announced support for NetApp’s FsX as yet another storage option for Cloud Foundation users on AWS. Supplying yet another storage-as-a-service option for this environment.
  • Cloud Flex Compute – also mentioned at the show was their new Compute-As-A-Service for Cloud Foundation users on AWS. This way users could subscribe to more or less compute, on an as needed basis rather than having to spin up new ESXi hosts. I later found out this allows users to run a single VM and pay for it on a subscription basis.
  • Tanzu Application Platform (TAP) – is a new VMware supplied (and supported) “development experience” for K8s on vSphere. Note, it doesn’t include any advanced Tanzu services such as Tanzu K8s Grid (TKG) so it’s a true DevOps bare bones environment.
  • Tanzu K8S Operations (TKO) – another new Tanzu based service which offers operations complete control over the Tanzu services running on vSphere. Note Tanzu Mission Control (TMC) is not part of TKO.
  • Aria management – VMware rebranded vRealize and CloudHealth, which now comes in 3 bundles, Aria Cost (CloudHealth+), Aria Operations and Aria Automation. Which are all built onto of Aria Graph that graphs all the nodes in your VMware clusters with all their connections so that Aria management can traverse this graph to find out what’s where. On top of Aria Graph are Aria Hub, Aria Insights, and Aria Guardrails (sort of like providing boundary’s where services can be deployed).

They also announced Ransomware Recovery [changed 7Sep22, the Eds] as a Service which builds on VMware’s DR-aaS announced last year and Tanzu now works with Red Hat OpenShift

We also discussed the show. I heard somewhere there were 10K people there, Jason heard somewhere between 6K and 9K. In any case much smaller than VMworlds prior to Covid (25kish). And of course the rebranding of the show seemed counter-intuitive at best.

The show floor was much smaller than usual, (not withstanding Keith’s Airstream RV exhibit). And there were a number of storage vendors not at the show?? There was less hardware on the show floor, this could be a Covid thing but there were just as many mini-white boards/class rooms per large exhibiter, so don’t think it was because of Covid.

But the elephant in the room was Broadcom’s acquisition of VMware. At one of the analyst briefings I asked an exec about attrition. He made a couple of comments but in the end said VMware has been bought and sold before and has always come out of it in better shape. This will be no different.

That’s about all from the show.

And Thanks again to Keith and his crew, for lending us his studio to record the show. It’s been a while since I’ve seen an RV on a show floor. Keith seemed to have a ball with it

Tell us how you like our video. If everyone is for it we could do something like this with a Zoom (in this case Zencastr) recording, Or just try this at the next joint conference. .

Jason Collier, Principle Member of Technical Staff at AMD

Jason Collier (@bocanuts) is a long time friend, technical guru and innovator who has over 25 years of experience as a serial entrepreneur in technology.

He was founder and CTO of Scale Computing and has been an innovator in the field of hyperconvergence and an expert in virtualization, data storage, networking, cloud computing, data centers, and edge computing for years.

He’s on LinkedIN. He’s currently working with AMD on new technology and he has been a GreyBeards on Storage co-host since the beginning of 2022

136: Flash Memory Summit 2022 wrap-up with Tom Coughlin, President, Coughlin Assoc.

We have known Tom Coughlin (@thomascoughlin), President, Coughlin Associates for a very long time now. He’s been an industry heavyweight almost as long as Ray (maybe even longer). Tom has always been very active in storage media, storage drives, storage systems and memory as well as active in the semiconductor space. All this made him a natural to perform as Program Chair at Flash Memory Summit (FMS)2022, so it’s great to have on the show to talk about the conference.

Just prior to the show, Micron announced that they had achieved 232 layer 3D NAND(in sampling methinks). Which would be a major step on the roadmap to higher density NAND. Micron was not at the show, but held an event at Levi stadium, not far from the conference center.

During a keynote, SK Hynix announced they had achieved 238 layer NAND, just exceeding Micron’s layer count. Other vendors at the show promised more layers as well but also discussed different ways other than layer counts to scale capacity, such as shrinking holes, moving logic, logical (more bits/cell) scaling, etc. PLC (5 bits/cell) was discussed and at least one vendor mentioned 6LC (not sure there’s a name yet but HxLC maybe?). Just about any 3D NAND is capable of logical scaling in bits/cell. So 200+ layers will mean more capacity SSDs over time.

The FMS conference seems to be expanding beyond Flash into more storage technologies as well as memory systems. In fact they had a session on DNA storage at the show.

In addition, there was a lot of talk about CXL, the new shared memory standard which supports shared memory over PCIe at FMS2022. PCIe is becoming a near universal connection protocol and is being used for 2d scaling of chips as a chip to chip interconnect as well as distributed storage and shared memory interconnect.

The CXL vision is that servers will still have DDR DRAM memory but they can share external memory systems. With shared memory systems in place memory, memory could be pooled and aggregated into one large repository which could then be carved up and parceled out to servers to support the workload dejour. And once those workloads are done, recarved up for the next workload to come. Almost like network attached storage only in this world its network attached memory.

Tom mentioned that CXL is starting to adopting other memory standers such as the Open Memory Interface (OMI) which has also been going on for a while now.

Moreover, CXL can support a memory hierarchy, which includes different speed memories such as DRAM, SCM, and SSDs. If the memory system has enough smarts to keep highly active data in the highest speed devices, an auto-tiering, shared memory pool could provide substantial capacities (10s-100sTB) of memory at a much reduced cost. This sounds a lot like what was promised by Optane.

Another topic at the show was Software Enabled/Defined Flash. There are a few enterprise storage vendors (e.g., IBM, Pure Storage and Hitachi) that design their own proprietary flash devices, but with SSD vendors coming out with software enabled flash, this should allow anyone to do something similar. Much more to come on this. Presumably, the hyper-scalers are driving this but having software enabled flash should benefit the entire IT industry.

The elephant in the room at FMS was Intel’s winding down of Optane. There were a couple of the NAND/SSD vendors talking about their “almost” storage class memory using SLC and other NAND tricks to provide Optane like performance/endurance using NAND storage.

Keith mentioned a youtube clip he saw where somebody talked about an Radeon Pro SSG ( (AMD GPU that had M.2 SSDs attached to it). And tried to show how it improved performance for some workloads (mostly 8k video using native SSG APIs). He replaced the old M.2 SSDs with newer ones with more capacity which increased the memory but it still had many inefficiencies and was much slower than HBM2 memory or VRAM. Keith thought this had some potential seeing as how in memory databases seriously increase performance but as far as I could see the SSG and it’s moded brethren died before it reached that potential.

As part of the NAND scaling discussion, Tom said one vendor (I believe Samsung) mentioned that by 2030, with die stacking and other tricks, they will be selling an SSD with 1PB of storage behind it. Can’t wait to see that.

By the way, if you are an IEEE member and are based in the USA, Tom is running for IEEE USA president this year, so please vote for him. It would be nice having a storage person in charge at IEEE.

Thomas Coughlin, President Coughlin Associates

Tom Coughlin, President, Coughlin Associates is a digital storage analyst and business and technology consultant. He has over 40 years in the data storage industry with engineering and senior management positions at several companies. Coughlin Associates consults, publishes books and market and technology reports (including The Media and Entertainment Storage Report and an Emerging Memory Report), and puts on digital storage-oriented events.

He is a regular storage and memory contributor for forbes.com and M&E organization websites. He is an IEEE Fellow, Past-President of IEEE-USA, Past Director of IEEE Region 6 and Past Chair of the Santa Clara Valley IEEE Section, Chair of the Consultants Network of Silicon Valley and is also active with SNIA and SMPTE.

For more information on Tom Coughlin and his publications and activities go to

135: Greybeard(s) talk file and object challenges with Theresa Miller & David Jayanathan, Cohesity

Sponsored By:

I’ve known Theresa Miller, Director of Technology Advocacy Group at Cohesity, for many years now and just met David Jayanathan (DJ), Cohesity Solutions Architect during the podcast. Theresa could easily qualify as an old timer, if she wished and DJ was very knowledgeable about traditional file and object storage.

We had a wide ranging discussion covering many of the challenges present in today’s file and object storage solutions. Listen to the podcast to learn more.

IT is becoming more distributed. Partly due to moving to the cloud, but now it’s moving to multiple clouds and on prem has never really gone away. Further, the need for IT to support a remote work force, is forcing data and systems that use them, to move as well.

Customers need storage that can reside anywhere. Their data must be migrate-able from on prem to cloud(s) and back again. Traditional storage may be able to migrate from one location to a select few others or replicate to another location (with the same storage systems present), but migration to and from the cloud is just not easy enough.

Moreover, traditional storage management has not kept up with this widely disbursed data world we live in. With traditional storage, customers may require different products to manage their storage depending on where data resides.

Yes, having storage that performs, provides data access, resilience and integrity is important, but that alone is just not enough anymore.

And to top that all off, the issues surrounding data security today have become just too complex for traditional storage to solve alone, anymore. One needs storage, data protection and ransomware scanning/detection/protection that operates together, as one solution to deal with IT security in today’s world

Ransomware has rapidly become the critical piece of this storage puzzle needing to be addressed. It’s a significant burden on every IT organization today. Some groups are getting hit each day, while others even more frequently. Traditional storage has very limited capabilities, outside of snapshots and replication, to deal with this ever increasing threat.

To defeat ransomware, data needs to be vaulted, to an immutable, air gapped repository, whether that be in the cloud or elsewhere. Such vaulting needs to be policy driven and integrated with data protection cycles to be recoverable.

Furthermore, any ransomware recovery needs to be quick, easy, AND securely controlled. RBAC (role-based, access control) can help but may not suffice for some organizations. For these environments, multiple admins may need to approve ransomware recovery, which will wipe out all current data by restoring a good, vaulted copy of the organizations data.

Edge and IoT systems also need data storage. How much may depend on where the data is being processed/pre-processed in the IoT system. But, as these systems mature, they will have their own storage requirements which is yet another data location to be managed, protected, and secured.

Theresa and DJ had mentioned Cohesity SmartFiles during our talk which I hadn’t heard about. Turns out that SmartFiles is Cohesity’s file and object storage solution that uses the Cohesity storage cluster. Cohesity data protection and other data management solutions also use the cluster to store their data. Adding SmartFiles to the mix, brings a more complete storage solution to support customer data needs. .

We also discussed Helios, Cohesity’s, next generation, data platform that provides a control and management plane for all Cohesity products and services,.

Theresa Miller, Director, Technology Advocacy Group, Cohesity

Theresa Miller is the Director, Technology Advocacy Group at Cohesity.  She is an IT professional that has worked as a technical expert in IT for over 25 years and has her MBA.

She is uniquely industry recognized as a Microsoft MVP, Citrix CTP, and VMware vExpert.  Her areas of expertise include Cloud, Hybrid-cloud, Microsoft 365, VMware, and Citrix.

David Jayanathan, Solutions Architect, Cohesity

David Jayanathan is a Solutions Architect at Cohesity, currently working on SmartFiles. 

DJ is an IT professional that has specialized in all things related to enterprise storage and data protection for over 15 years.