163: GreyBeards talk Ultra Ethernet with Dr J Metz, Chair of UEC steering committee, Chair of SNIA BoD, & Tech. Dir. AMD

Dr J Metz, (@drjmetz, blog) has been on our podcast before mostly in his role as SNIA spokesperson and BoD Chair, but this time he’s here discussing some of his latest work on the Ultra Ethernet Consortium (UEC) (LinkedIN: @ultraethernet, X: @ultraethernet)

The UEC is a full stack re-think of what Ethernet could do for large single application environments. UEC was originally focused on HPC, with 400-800 Gbps networks and single applications like simulating a hypersonic missile or airplane. But with the emergence of GenAI and LLMs, UEC could also be very effective for large AI model training with massive clusters doing a single LLM training job over months. Listen to the podcast to learn more.

The UEC is outside the realm of normal enterprise environments. But as AI training becomes more ubiquitous, who knows whether UEC may not find a place in the enterprise. However, it’s not intended for mixed network environments with multiple applications. It’s a single application network.

One wouldn’t think, HPC was a big user of Ethernet for their main network. But Dr J pointed out that the top 3 of the HPC 500, all use Ethernet and more are looking to use it in the future.

UEC is essentially an optimized software stack and hardware for networking used by single application environments. These types of workloads are constantly pushing the networking envelope. And by taking advantage of the “special networking personalities” of these workloads, UEC can significantly reduce networking overheads, boosting bandwidth and workload execution.

The scale of networks is extreme. The UEC is targeting up to a million endpoints, over >100K servers, with each network link >100Gbps and more likely 400-800Gpbs. With the new (AMD and others) networking cards coming out that support 4 400/800Gbps network ports, having a pair of these on each server, with 100K server cluster gives one 800K endpoints. A million is not that far away when you think of it at that scale.

Moreover, LLM training and HPC work are starting to look more alike these days. Yes there are differences but the scale of their clusters are similar, and the way work is sometimes fed to them is similar, which leads to similar networking requirements

UEC is attempting to handle a 5% problem. That is 95% of the users will not have 1M endpoints in their LAN, but maybe 5% will and for these 5%, a more mixed networking workload is unnecessary. In fact, a mixed network becomes a burden slowing down packet transmission.

UEC is finding that with a few select networking parameters, almost like workload fingerprints, network stacks can be much more optimized than current Ethernet and thereby support reduced packet overheads, and more bandwidth.

AI and HPC networks share a very limited set of characteristics which can be used as fingerprints. These characteristics are like reliable or unreliable transport, ordered or unordered delivery, multi-path packet spraying or not, etc, With a set of these types of parameters, selected for an environment, UEC can optimize a network stack to better support a million networking endpoints

We asked where CXL fits in with UEC? DrJ said it could potentially be an entity on the network but he sees CXL more as a within server or between a tight (limited) cluster of servers, solution rather than something on a UEC network.

Just 12 months ago the UEC had 10 members or so and this past week they were up to 60. UEC seems to have struck a chord.

The UEC plans to release a 1.0 specification, near the end of this year. UEC 1.0 is intended to operate on current (>100Gbps) networking equipment with firmware/software changes.

Considering the UEC was just founded in 2023, putting out their 1.0 technical spec. within 1.5 years is astonishing. But also speaks volumes to the interest in the technology.

The UEC has a blog post which talks more about UEC 1.0 specification and the technology behind it.

Dr J Metz, Chair of UEC Steering Committee, Chair of SNIA BoD, Technical Director of Systems Design, AMD

J works to coordinate and lead strategy on various industry initiatives related to systems architecture. Recognized as a leading storage networking expert, J is an evangelist for all storage-related technology and has a unique ability to dissect and explain complex concepts and strategies. He is passionate about the innerworkings and application of emerging technologies.

J has previously held roles in both startups and Fortune 100 companies as a Field CTO,  R&D Engineer, Solutions Architect, and Systems Engineer. He has been a leader in several key industry standards groups, sitting on the Board of Directors for the SNIA, Fibre Channel Industry Association (FCIA), and Non-Volatile Memory Express (NVMe). A popular blogger and active on Twitter, his areas of expertise include NVMe, SANs, Fibre Channel, and computational storage.

J is an entertaining presenter and prolific writer. He has won multiple awards as a speaker and author, writing over 300 articles and giving presentations and webinars attended by over 10,000 people. He earned his PhD from the University of Georgia.

139: GreyBeards talk HPC file systems with Marc-André Vef and Alberto Miranda of GekkoFS

In honor of SC22 conference this month in Dallas, we thought it time to check in with our HPC brethren to find out what’s new in storage for their world. We happened to see that IO500 had some recent (ISC22) results using a relative new comer, GekkoFS (@GekkoFS). So we reached out to the team to find out how they managed to crack into the top 10. We contacted Marc-André Vef (@MarcVef), a Ph.D. student at Johannes Guttenberg University Mainz and Alberto Miranda (@amiranda_hpc) Ph.D. of Barcelona Supercomputing Center two of the authors on the GekkoFS paper.

GekkoFS is a new burst file system that is tailor made to create, process and tear down scratch data sets for HPC workloads. It turns out that HPC does lots of work using scratch files as working data sets. Burst file systems typically use another parallel file systems to (stage) read (permanent) data into the scratch files and write (permanent) result data out. But during processing, the burst file system handles all scratch data access. Listen to the podcast to learn more

We had never heard of a burst file system before but it’s been around for a while now in HPC. For example, BeeGFS provides one (check out our GreyBeards podcast on BeeGFS). BeeGFS supports both a PFS and a burst file system. GekkoFS only offers a burst file system.

GekkoFS is a distributed burst file systems which operates across nodes to stitch together a single global file system. GekkoFS is strictly open source at the moment and can be downloaded (see: GekkoFS Gitlab) and used by anyone.

They are considering in the future of supplying professional support but at the moment if you have an issue, Marc and Alberto suggest you use the GekkoFS GitLab incident tracking system to tell them about it.

Turns out Lustre, IBM Spectrum Scale, DAOS and other HPC file systems take gobs of overhead to create scratch files. And even though it takes a lot of IO to load scratch file data and write out results, there’s a whole lot more IO that gets done to scratch files during HPC jobs.

This sort of IO also occurs for AI/ML/DLL where training data is staged into a sort of scratch area (typically in memory, depending on size) and then repeatedly (re-)processed there. GekkoFS can offer significant advantages to AI/ML/DL work when training data is very large. Normally without a burst file system, one would need to shard this data across nodes and then deal with the partial training that results. But with GekkoFS, all you need do is stage it into the burst file system and read it from there.

GekkoFS is partially posix compliant. They install a client-side interposer library that intercepts those posix requests destined for GekkoFS files.

GekkoFS has no central metadata server, which means that all nodes in the GekkoFS cluster support metadata services. Filenames are hashed to tell GekkoFS which node has its (metadata &) data.

GekkoFS stores their data and metadata on local disks, SSDs or in memory (tempfs) storage. All local node storage in the cluster is stitched together into a single global file system.

GekkoFS supports strict consistency for IO and file creation/deletion within nodes. They use an internal transaction database to enforce this strict consistency.

Across nodes they support eventual consistency. Which means files created on one node may not be immediately viewable/accessible by other nodes in the cluster for a short period of time while (meta) data updates are propagated across the cluster.

As part of their consistency paradigm, GekkoFS doesn’t support directory locking. Jason mentioned that HPC “LS” (directory listings) commands can sometimes take forever due to directory locking No directory locking makes LS commands happen faster but may show inconsistent results (due to eventual consistency).

We had some discussion on this lack of directory locking and eventual consistency in file systems, but we agreed to disagree. They did say that for the HPC workloads (and probably AI/ML/DLL) workloads, their approach seems appropriate as they are way more read intensive than write intensive.

In any case, they must be doing something right as they have a screaming scratch file system for HPC work.

Marc will be attending SC22 in Dallas this month, so if your attending please look him up and say hello from us.

Marc-André Vef, Ph.D. student

Marc-André Vef is a Ph.D. candidate at the Johannes Gutenberg University Mainz. He started his Ph.D. in 2016 after receiving his B.Sc. and M.Sc. degrees in computer science from the Johannes Gutenberg University Mainz. His master’s thesis was in cooperation with IBM Research about analyzing file create performance in the IBM Spectrum Scale parallel file system (formerly GPFS).

During his Ph.D., he has worked on several projects focusing on file system tracing (in collaboration with IBM Research) and distributed file systems, among others. Most notably, he designed two ad-hoc distributed file systems: DelveFS (in collaboration with OpenIO), which won the Best Paper in its category, and GekkoFS (in collaboration with the Barcelona Supercomputing Center). GekkoFS placed fourth in its first entry in the 10-node challenge of the IO500 benchmark. The file system is actively developed in the scope of the EuroHPC ADMIRE project.

His research interests focus on file systems and system analytics.

Alberto Miranda, Ph.D., Senior Researcher, Barcelona Supercomputing Center

Dr. Eng. Alberto Miranda is a Senior Researcher in
advanced storage systems in the Computer Science Department of the Barcelona Supercomputing Center (BSC) and co-leader of the Storage Systems Research Group since 2019. Dr. Eng. Miranda received a diploma in Computer Engineering (2004), a M.Sc. degree in Computer Science (2006) and a M.Sc. degree in Computer Architectures, Networks and Systems (2008) from the Technical University of Catalonia (UPC-BarcelonaTech). He later received a Ph.D. degree Cum Laude in Computer Science from the Technical University of Catalonia in 2014 with his thesis “Scalability in Extensible and Heterogeneous Storage Systems”.

His current research interests include efficient file and storage systems, operating systems, distributed system architectures, as well as information retrieval systems. Since he started his work at BSC in 2007, he has published 14 papers in international conferences and journals, as well as 5 white papers and technical reports and 1 book chapter. Dr. Eng. Miranda is currently involved in several European and national research projects and has participated in competitively funded EU projects XtreemOS, IOLanes, Prace2IP, IOStack, Mont-Blanc 2, EUDAT2020, Mont-Blanc 3, and NEXTGenIO.

122: GreyBeards talk big data archive with Floyd Christofferson, CEO StrongBox Data Solutions

The GreyBeards had a great discussion with Floyd Christofferson, CEO, StrongBox Data Solutions on their big data/HPC file and archive solution. Floyd’s is very knowledgeable on problems of extremely large data repositories and has been around the HPC and other data intensive industries for decades.

StrongBox’s StrongLink solution offers a global namespace file system that virtualizes NFS, SMB, S3 and Posix file environments and maps this to a software-only, multi-tier, multi-site data repository that can span onsite flash, disk, S3 compatible or Azure object and LTFS tape iibrary storage as well as offsite versions of all the above tiers.

Typical StrongLink customers range in the 10s to 100s of PB, and ingesting or processing PBs a day. 200TB is a minimum StrongLink configuration, but Floyd said any shop with over 500TB has problems with data silos and other issues, but may not understand it yet. StrongLink manages data placement and movement, throughout this hierarchy to better support data access and economical storage. In the process StrongLink eliminates any data silos due to limitations of NAS systems while providing the most economic placement of data to meet user performance requirements.


Floyd said that StrongLink first installs in customer environment and then operates in the background to discover and ingest metadata from the primary customers file storage environment. Some point later the customer reconfigures their end-users share and mount points to StrongLink servers and it’s up and starts running.

The minimal StrongLink, HA environment consists of 3 nodes. They use a NoSQL metadata database which is replicated and sharded across the nodes. It’s shared for performance load balancing and fully replicated (2-way or 3-way) across all the StrongLink server nodes for HA.

The StrongLink nodes create a cluster, called a star in StrongBox vernacular. Multiple clusters onsite can be grouped together to form a StrongLink constellation. And multiple data center sites, can be grouped together to form a StrongLink galaxy. Presumably if you have a constellation or a galaxy, the same metadata is available to all the star clusters across all the sites.

They support any tape library and any NFS, SMB, S3 orAzure compatible object or file storage. Stronglink can move or copy data from one tier/cluster to another based on policies AND the end-users never sees any difference in their workflow or mount/share points.

One challenge with typical tape archives is that they can make use of proprietary tape data formats which are not accessible outside those systems. StrongLink has gone with a completely open-source, LTFS file format on tape, which is well documented and is available to anyone.

Floyd also made it a point of saying they don’t use any stubs, or soft links to provide their data placement magic. They only use standard file metadata.

File data moves across the hierarchy based on policies or by request. One of the secrets to StrongLink success is all the work they have done to ensure that any data movement can occur at line rate speeds. They heavily parallelize any data movement that’s required to support data placement across as many servers as the customer wants to throw at it. StrongBox services will help right-size the customer deployment to support any data movement performance that is required.

StrongLink supports up to 3-way replication of a customer’s data archives. This supports a primary archive and 3 more replicas of data.

Floyd mentioned a couple of big customers:

  • One autonomous automobile supplier, was downloading 2PB of data from cars in the field, processing this data and then moving it off their servers to get ready for the next day’s data load.
  • Another weather science research organization, had 150PB of data in an old tape archive and they brought in StrongLink to migrate all this data off and onto LTFS tape format as well as support their research activities which entail staging a significant chunk of file data on research servers to do a climate run/simulation.

NASA, another StrongLink customer, operates slightly differently than the above, in that they have integrated StrongLink functionality directly into their applications by making use of StrongBox’s API.

StrongLink can work in three ways.

  • Using normal file access services where StrongLink virtualizes your NFS, SMB, S3 or Posix file environment. For this service StrongLink is in the data path and you can use policy based management to have data moved or staged as the need arises.
  • Using StrongLink CLI to move or copy data from one tier to another. Many HPC customers use this approach through SLURM scripts or other orchestration solutions.
  • Using StrongLink API to move or copy data from one tier to another. This requires application changes to take advantage of data placement.

StrongBox customers can of course, use all three modes of operation, at the same time for their StrongLink data galaxy. StrongLink is billed by CPU/vCPU level and not for the amount of data customers throw into the archive. This has the effect of Customers gaining a flat expense cost, once StrongLink is deployed, at least until they decide to modify their server configuration.

Floyd Christofferson, CEO StrongBox Data Solutions

As a professional involved in content management and storage workflows for over 25 years, Floyd has focused on methods and technologies needed to manage massive volumes of data across many different storage types and use cases.

Prior to joining SBDS, Floyd worked with software and hardware companies in this space, including over 10 years at SGI, where he managed storage and data management products. In that role, he was part of the team that provided solutions used in some of the largest data environments around the world.

Floyd’s background includes work at CBS Television Distribution, where he helped implement file-based content management and syndicated content distribution strategies, and Pathfire (now ExtremeReach), where he led the team that developed and implemented a satellite-based IP-multicast content distribution platform that manages delivery of syndicated content to nearly 1,000 TV stations throughout the US.

Earlier in his career, he ran Potomac Television, a news syndication and production service in Washington DC, and Manhattan Center Studios, an audio, video, graphics, and performance facility in New York.

117: GreyBeards talk HPC file systems with Frank Herold, CEO of ThinkParQ, makers of BeeGFS

We return back to our storage thread with a discussion of HPC file systems with Frank Herold, (@BeeGFS) CEO of ThinkParQ GmbH, the makers of BeeGFS. I’ve seen BeeGFS start to show up in some IO500 top storage benchmark results and as more and more data keeps coming online every day, we thought it time to start finding out how our friends in the HPC world handle their data deluge.

Frank’s a former rocket scientist, that’s been in and around the storage industry for years, and was very knowledgeable about BeeGFS’s software defined, parallel file system. He seemed to have a great grasp of the IO requirements in HPC, Life Sciences and other HPC-like applications. Listen to the podcast to learn more.

Turns out that ThinkParQ is a spinoff of the research institute in Germany that originally developed BeeGFS parallel file system. There are apparently two version of their product one which is publicly available (downloadable from their website) and another with commercial support. It’s not quite 100% open source but it’s got a lot of open source in it and their GIT repository is available

BeeGFS was primarily focused on HPC workloads but as this type of work has become more mainstream, they have moved beyond HPC and now have significant installations in Life Sciences, Oil&Gas and many other big data environments.

It runs on x86/AMD, OpenPower, and ARM CPUs. BeeGFS comes as a number of services, one of which is a storage service which uses a backend with ZFS or XFS file system. It also uses (POSIX compliant) host client software to access their system. There’s also a metadata and monitoring service. Most of the time these services run on separate servers but BeeGFS also supports a “converged mode”, where all these services run on a single server. And you can have multiple converged mode servers in a cluster.

BeeGFS is a parallel file system. This means that it intrinsically supports multiple metadata services/servers and multiple storage servers which allow it to scale up storage bandwidth and performance considerably beyond single appliance systems. Data is automatically distributed across all the storage servers in the configuration, unless you specify that data reside on specific, say all flash storage servers. Similarly, metadata is automatically distributed across all metadata servers in the system.

They don’t support any specific RAID protection other than mirroring and that really to speed up read throughput. Rather they depend on the underlying XFS/ZFS file system to provide drive failure protection (RAID5/6).

One of BeeGFS’s selling points is that it has few tuning parameters that a customer needs to fiddle with. Frank said it runs quite well right out of the box.

BeeGFS offers a single name space that spans the cluster (of metadata servers/storage servers). But customers can elect to split this name space across a subset of these metadata and storage servers, and by doing so they create multiple BeeGFS clusters.

There’s no inherent support for NFS or SMB but customers can configure NFS or SAMBA servers that use BeeGFS as backend storage. Also, there’s no data reduction built into BeeGFS and no automatic data tiering across the backend storage (file systems).

But as noted above, customers can direct which backend storage to use to hold their data. And they do offer a CLI data movement primitive and customers can use this in conjunction with other software to implement storage tiering or do it themselves.

Metadata performance is extremely important for small files and for large multi Billion object file systems. BeeGFS uses extensive metadata caching to provide faster access to this information.

Speaking of small file performance, we had a decent discussion on the tradeoffs involved between small and large file performance. And although BeeGFS has decent small file performance it’s not a be all for every small file intensive application. According to Frank, not every small file workload is optimal for BeeGFS.

They offer BeeOND which is BeeGFS on demand. This is an integration with Slurm workload scheduler (HPC work scheduler) that allows customers to spin up a scratch BeeGFS parallel file system across compute servers with storage.

Slurm’s BeeOND integration brings all BeeGFS services up and deploys them on compute nodes you specify. At this point you have a fully installed BeeGFS (scratch) parallel file system. Customers may use this scratch file system to support any compute-data intensive workload theyneed to run. When no longer needed, Slurm can be directed to automatically dismantle the BeeGFSl file system.

We talked about BeeGFS partners. They have a number of regional partners that provide installation and onsite support and a number of technical partners, such as NetApp, Dell, HPE and INSPUR, that supply BeeGFS configured servers and systems for deployment/installation.

Frank Herold, CEO ThinkparQ

Frank Herold is the CEO of ThinkParQ GmbH – the company behind BeeGFS. He actively leads the company and the product strategy of BeeGFS as a global player for parallel high-performance file systems.

Prior to joining ThinkParQ, he held various senior management positions within ADIC and Quantum Corporation, responsible for market segments within the academic and scientific research, oil and gas, broadcast and video surveillance sectors, focusing on large scale, high-performance and enterprise accounts within EMEA. 

Frank has over 25 years of experience in the IT industry and holds a master’s degree in engineering (Dipl. -Ing.) in rocket science.