144: Greybeard talks AI IO with Subramanian Kartik & Howard Marks of VAST Data

Sponsored by

Today we talked with VAST Data’s Subramanian Kartik (@phyzzycyst), Global Systems Engineering Lead and Howard Marks (@DeepStorage@mastodon.social, @deepstoragenet) former GreyBeards co-host and now Technologist Extraordinary & Plenipotentiary at VAST. Howard needs no introduction to our listeners but Kartik does. Kartik has supported a number of customers implementing AI apps at VAST and prior companies, so he is well versed in the reality of AI ML DL. Moreover, VAST recently funded Silverton Consulting to write a paper discussing Deep Learning IO.

Although AI ML DL applications have been very popular these days in IT, there’s been a continuing challenge trying to understand its IO requirements. Listen to the podcast to learn more.

AI ML DL Neural Networks (NN) models train with data and lots of it while inferencing is also very data dependent. Kartik said AI model IO consists of small block, random reads with very few writes.

Some models contain huge NNs which consume mountains of data to train while others are relatively small and consume much less. GPT-3(.5), the model behind the original ChatGPT, has ~75B parameters in its ~800GB NN.

As many of us know, the key to AI processing is GPU hardware, which performs most, if not all, of the computations to train models and supply inferences. Moreover, to maximize training throughput, many organizations deploy model parallelism, using 10s to 1000s of GPUs.

For instance, in the paper mentioned earlier, we showed a model training IO chart based on all six storage vendor published NVIDIA DGX-A100 Reference Architecture reports for ResNet-50. On this single chart, all 6 storage systems supplied roughly the same images processed/sec (or ~IO bandwidth) performance to train the model on each of 8, 16 & 32 GPUs configurations. This is very unusual from our perspective but shows that ResNet-50 training is not IO bound.

However, another approach to speeding up NN training is to take advantage of newer, more advanced IO protocols. NVIDIA GPUDirect Storage transfers data directly from storage memory to GPU memory bypassing CPU memory all together which can significantly speed up GPU data consumption. It turns out that one bottleneck for AI training is CPU memory bandwidth

In addition, most AI model training reads data from a single file system mount point. Historically, an NFS mount point was limited to a single TCP connection and a maximum of ~2.5GB/sec of IO bandwidth. Recently, however, NConnect for NFS has been introduced which increased TCP connections to 16 per mount point .

Despite that, VAST Data found that by adding some code to Linux’s NFS TCP stack, they were able to increase NConnect to 64 TCP connections per compute node. Howard mentioned that with these changes and a 16 (compute) node VAST Data storage cluster they sustained 175GB/sec of GPUDirect Storage bandwidth using a DGX-A100 systems .

Subramanian Kartik, Global Systems Engineering Lead, VAST Data

Subramanian Kartik has been the Vice President of Systems Engineering at VAST Data since January of 2020, running the global presales organization. He is part of the incredible success of VAST Data which increased almost 10-fold in valuation and revenue in this period.

An accomplished technologist and executive in the industry, he has a wide array of experience in Cloud Architectures, AI/Machine Learning/Deep Learning, as well as in the  Life Sciences, covering high-performance computing and storage. He has had a lifelong deep passion for studying complex problems in all spheres spanning both workloads and infrastructure at the vanguard of current day technology. 

Prior to his work at VAST Data, he was with EMC (later Dell) for two decades, as both a Distinguished Engineer and global executive running the Converged and Hyperconverged Division  go-to-market. He has a Ph.D in Particle Physics with over 75 publications and 3 patents to his credit over the years. He enjoys mathematics, jazz, cooking and travelling with his family in his non-existent spare time.

Howard Marks, (former GreyBeards Co-Host) Technologist Extraordinary and Plenipotentiary, VAST Data

Howard Marks brings over forty years of experience as a technology architect for hire and Industry observer to his role as VAST Data’s Technologist Extraordinary and Plienopotentary. In this role, Howard demystifies VAST’s technologies for customers and customer requirements for VAST’s engineers.

Before joining VAST, Howard ran DeepStorage an industry test lab and analyst firm. An award-winning speaker, he has appeared at events on three continents including Comdex, Interop and VMworld.

Howard is the author of several books (all gratefully out of print) and hundreds of articles since Bill Machrone taught him journalism at PC Magazine in the 1980s.

Listeners may also remember that Howard was a founding co-Host of the Greybeards-on-Storage Podcast.

143: GreyBeards talk Chia cypto with Jonmichael Hands, VP Storage at Chia Project

Today we interview Jonmichael Hands (@LebanonJon, LinkedIn), VP Storage at Chia Project , who has been in and around the storage business forever, mostly with Intel and their SSD team, before it was sold. He was technical marketing for NVMe. He also ran the security and crypto track at FMS2022. He recently worked on sustainability, helping to create a circular economy for disk and SSD storage. Moreover, he assisted IEEE with their new (media) sanitization standard to make reuse/recycling storage easier.

Chia was born to provide a way to take advantage of storage media for blockchains in a government compliant way so that it could be spun off as a public company someday. Chia is a crypto currency that depends on proof of space (storage space exists) and proof of time (storage space is reserved for a period of time). There have been many crypto coins based on proof of work (running hard cryptographic algorithms to come up with some specific bit pattern). And ETH was forked last year to support proof of stake (where one stakes some amount of ETH for a defined period). But few, if any, have been based on proof of space and time.

Disk and SSD commands already exist to provide “Secure Erase” (multiple passes of different bit patterns overwriting the same block) and cryptographic erasure (For encrypted drives, the encryption key is changed). Both approaches insure that customer/organization data is no longer retained on media leaving an organization’s control. And yet, many companies use secure erase/cryptographic erasure and still shred disk drives and SSDs, just to be sure that no data is retained. This is a vast waste of energy and resources.

Jonmichael said that both disk and SSD drives typically have another 5 years beyond their guaranteed (5 year) production life where both can function perfectly well as storage devices (ok may performance may not be the same as current drives). And after using them for another 5 years, they are much easier to recycle, if left un-shredded and returned to manufacturers, who can dismantle them to reuse expensive components and rare earth materials.

We didn’t spend much time on the technical underpinnings of Chia so if you are interested in that we suggest you check out Jonmichael’s FMS2022 presentation video.

But if you’re interested in a high level understanding of Chia and what one can do with it we did cover that. For example, Chia has farmers (not miners). Farmers create (~100GB) Chia plot files and store these on media.

Plot files take some amount of CPU power and memory to create but once created can stay on storage forever. What makes Chia work is that it comes out and checks to see if you have a certain plot file and if you do you get rewarded for that. Jonmichael said that with a typical Chia crypto setup, one could make $0.50/TB/Month farming Chia.

The Chia project currently has about 24EB of plots online and at their peak had over 300EB. They also have 130K farmers in their current network. Bitcoin, at its peak, had about 60K miners. Jonmichael thinks Chia crypto coin may be the most distributed crypto coin in existence today.

A couple of years back Chia accounted for a significant amount of new disk drive purchases but that has died down considerably since then. As discussed earlier, Jonmichael is working to create a circular economy for storage that could lead to media reuse for Chia farming.

Jonmichael mentioned that Chia has matured significantly since peak use. It used to be that creating Chia plot files required high end CPUs and lots of technical skills, but today Jonmichael said you can be a farmer with an RPi. He did say that they have moved to making better use of available memory in the plotting process and have reduced the write load on the storage media.

Another aspect to Chia’s maturation is that they now support Chia smart coins or smart contracts. They have created ChiaLisp, a Turing complete language, as their language to implement Chia smart coins. It turns out that Lisp and other functional languages provide a natural way to implement secure code. Jonmichael mentioned that other crypto coins are starting to move towards using ChiaLisp.

Some recent innovations in Chia smart coins include:

  • Chia Offer Management – that is anything you wish to trade can be digitally tracked and traded using this Chia Offer Management smart coins.
  • Chia NFTs (non-fungible token) Management – NFT’s have been used by other blockchains to sell digital rights to assets Chia’s support for NFTs opens Chia up to this as well. The reference implementation for Chia’s NFT management is Chia Friends, where all proceeds are being donated to the Marmot Recovery Foundation.
  • Chia Data Layer Management, a federated database – here the Chia block chain is being used to support a K-V store, where the block chain stores the Key and a hash of the Value. Users can use this Chia Data Layer to store any key-hash(value) database they wish. It’s important to realize that actual the data or value is stored external to the Chia block chain.

The Data Layer solution is currently being used to develop a way to track carbon credits by the World Bank (see: the Climate Action Data Trust).

Chia has come a long way. In its heyday it was significant consumer of new disk media but with what Jonmichael and others have planned for it is to take advantage of the longer term life of storage media and to use this for the benefit of all humanity.

Jonmichael Hands, VP Storage at Chia Project

Jonmichael Hands partners with the storage vendors for Chia optimized product development, market modeling, and Chia blockchain integration.

Jonmichael spent the last ten years at Intel in the Non-Volatile Memory Solutions group working on product line management, strategic planning, and technical marketing for the Intel data center SSDs.

In addition, he served as the chair for NVM Express (NVMe), SNIA (Storage Networking Industry Association) SSD special interest group, and Open Compute Project for open storage hardware innovation.

Jonmichael started his storage career at Sun Microsystems designing storage arrays (JBODs) and holds an electrical engineering degree from the Colorado School of Mines.

142: GreyBeards talk scale-out, software defined storage with Bjorn Kolbeck, Co-Founder & CEO, Quobyte

Software defined storage is a pretty full segment of the market these days. So, it’s surprising when a new entrant comes along. We saw a story on Quobyte in Blocks and Files and thought it would be great to talk with Bjorn Kolbeck (LinkedIn), Co-Founder & CEO, Quobyte. Bjorn got his PhD in scale out storage and went to work at Google on anything but storage. While there, he was amazed by Goodle’s vast infrastructure being managed by only a few people and thought this could should be commercialized, so Quobyte was born. Listen to the podcast to learn more.

Quobyte is a scale out file and object storage system with mirrored metadata and data which is 3-way mirrored or erasure coded (EC). Minimum cluster is 4 nodes (fault tolerant for a single node failure.). Quobyte has current customers with ~250 nodes and ~20K clients accessing a storage cluster.

Although they support NFSv3 and NFSv4 for file (and object) access, their solution is typically deployed using host client and storage services software accessing the files with Posix or objects via S3. Objects can also be accessed as file within the file system directories.

Host client software runs on Linux, Mac or Windows machines. Storage server software runs on Linux systems bare metal or under VMs in user space. Quobyte also support containerized storage server software for K8s but their bare metal/VM storage server software option doesn’t require containers.

Quobyte is also available in the GCP marketplace and can run in AWS, Azure and Oracle Cloud.

Their metadata service is a mirrored key-value store distributed across any number of (customer configured, I believe) storage nodes. Metadata resides on flash and distribution is designed to eliminate the metadata service as a performance bottleneck.

Their data services supports (any number of) storage tiers. Storage policies determine how tiering is used for files, directories, objects, etc. For example, with 3 tiers (NVMe Flash, SSD, and disk), file data could be first landed on NVMe Flash, but as it grows, it gets moved off to SSD, and as it grows, even more, it’s moved to disk. This could also be triggered using time since last access.

Bjorn said anything in file system metadata could be used to trigger data movement across tiers. Each tier could be defined with different data protection policies, like mirroring or EC 8+3.

Backend storage is split up into Volumes. They also support thinly provisioned volumes for file creation.

Unclear how tiering and thin provisioning applies to objects with much richer metadata options but as they can be mapped to files, we suppose that anything in the object file metadata could conceivably used to trigger tiering as a bare minimum.

As for security, 

  1. Quobyte supports end to end data encryption. This is done once and the customer owns the keys. They do support external key servers.  I believe this is another option that is enabled by file based policy management. It seems like different files can have different keys to encrypt them.
  2. Quobyte supports TLS. Depending on customer requirements data may go across open networks and this is where TLS could very well be used. And Quobyte supports user X.509 certificates for users, devices and systems authentication. 
  3. Quobyte supports file access controls. They support a subset of Windows capabilities but have full support for Linux and Mac access controls.

Quobyte also supports two forms of cluster to cluster replication. One is event driven where event occurrence (i.e. file close) signals data replication and another which is time driven (i.e., every 5 minutes) but both are asynchronous.

Quobyte was designed from the start to be completely API driven. But they do support CLI and a GUI for those customers that want them. 

They have a Free (forever) edition, a downloadable version of the software without 24/7 support and minus some enterprise capabilities (think encryption). This is gated at 150TB disk/30TB flash with limited number of clients and volumes.

The Infrastructure edition is their full featured solution with 7/24 enterprise support. It’s comes with a yearly service fee, priced by capacity with volume discounts.

Bjorn Kolbeck, Co-Founder & CEO, Quobyte

Bjorn Kolbeck, Co-Founder and CEO of Quobyte attended the Technical University of Berlin and Humboldt University of Berlin.

His PhD thesis dealt with fault-tolerant replication, but he gained several years’ experience in distributed and storage systems while developing the distributed research file system XtreemFS at the Zuse Institute Berlin.

He then spent time at Google working as a Software Engineer before he and fellow Co-Founder Felix Hupfield decided to combine the innovative research from XtreemFS and the operations experience from Google to build a highly reliable and scalable enterprise-grade storage system now known as Quobyte.

141: GreyBeards annual 2022 wrap-up podcast

Well it has been another year and time for our annual year end wrap up. Since Covid hit, every year has certainly been interesting. This year we have seen the start of back in person conferences which was a welcome change from the covid lockdown. We are very glad to start seeing everybody again.

From the tech standpoint, the big news this year was CXL. As everyone should recall, CXL is a new-ish PCIe hardware and protocol that supports larger memory sitting out on a PCIe bus and in the future shared memory between servers. All this is to enable a new wave of memory based computing. We spent probably half our time discussing CXL and it’s impact on IT.

The other major topic was the Cloud Native ecosystem. In the past all we talked about was K8s but nowadays the ecosystem that surrounds it is almost as important as K8s itself. The final topic was a bit of a shock earlier this year and yes it was the Broadcom’s acquisition of VMware. Jason and I spend our Explore podcast talking about it (see our 137: VMware Explore wrap-up). Keith has high hopes that the EU will shut it down but the jury’s still out on that one. Listen to the podcast to learn more.

As for CXL, it turns out that AMD have just released full support for CXL hardware and protocols with their latest round of CPU chips. But the new AMD CPUs only support DDR5 memory, (something about there’s only so much logic one can fit on a chip…) which means all those DDR4 DIMs out in the wild need somewhere to land. CXL could supply a new lease on life for DDR4 DIMs.

And it’s not just about shared memory or increased memory sizes, CXL can also provide a tiered memory hierarchy, with gobs of flash behind memory DIMs (see: 136: FMS2022 wrap up …) So, now its no longer a TB or ten of server memory but potentially 100s of TBs. What this means for SAP HANNA, AWS Aurora and other heavy-memory solutions has yet to play out.

Cloud Native won. We see this in the increasing adoption of containers and K8s in the enterprise, cloud and just about anywhere IT happens these days. But the ecosystem surrounding K8s is chaos.

Over time, many of these ecosystem solutions will die off, be purchased, or consolidated but in the mean time, it’s entirely too confusing. Red Hat’s OpenShift is one answer and VMware’s Tanzu is another. And of course all the clouds have their own K8s packaged solution. But just to cover their bets, everyone also supports native K8s and just about every software package that works with it. So, K8s’s ecosystem is in a state of flux and may take time to become a stable set of tools useable by the enterprise IT.

Finally, Broadcom’s acquisition of VMware has everyone up in arms. Customers are concerned the R&D juggernaut that VMware has been, since its very beginning, will be jettisoned in favor of profits. And HCI vendors that always felt Dell EMC had an unfair advantage will all look at Broadcom in a similar light.

Keith says there’s a major difference in how USA regulators view an acquisition and how EU regulators view one. According to Keith, EU views acquisitions in how they help or hurt the customer. USA regulators view acquisitions on show they help or hurt the competition. Will have to wait and see how this all plays for Broadcom-VMware.

On the other hand, speaking of competition, Nutanix seems to be feeling the heat as well. Rumors are it’s up for sale. Who will want it and how the regulators view both of these acquisitions may be as interesting story for 2023

2023 looks to be another year of transition for enterprise IT. The cloud players all seem to be coming around to the view that they can’t be all things to all (IT) people. And the enterprise vendors are finally seeing some modicum of staying power in the face of a relentless push to the cloud. How this plays out over the next few years will be of major interest to everybody.

Happy New Year from the GreyBeards!

Keith Townsend, The CTO Advisor

Keith Townsend (@CTOAdvisor) is a IT thought leader who has written articles for many industry publications, interviewed many industry heavyweights, worked with Silicon Valley startups, and engineered cloud infrastructure for large government organizations. Keith is the co-founder of The CTO Advisor, blogs at Virtualized Geek, and can be found on LinkedIN.

Jason Collier, Principal Member of Technical Staff, AMD

Jason Collier (@bocanuts) is a long time friend, technical guru and innovator who has over 25 years of experience as a serial entrepreneur in technology. He was founder and CTO of Scale Computing and has been an innovator in the field of hyperconvergence and an expert in virtualization, data storage, networking, cloud computing, data centers, and edge computing for years. He’s on LinkedIN.

140: Greybeards talk data orchestration with Matt Leib, Product Marketing Manager for IBM Spectrum Fusion

As our listeners should know, Matt Leib (@MBleib) was a GreyBeards co-host But since then, Matt has joined IBM to become Product Marketing Manager on IBM Spectrum Fusion, a data orchestration solution for Red Hat OpenShift environments. Matt’s been in and around the storage and data management industry for many years which is why we tapped him for GreyBeards co-host duties.

IBM Fusion, in its previous incarnation, came as an OpenShift software defined storage or as an OpenShift (H)CI solution. But recently, Fusion has taken on more of a data orchestration role for OpenShift stateful containerized applications. Listen to the podcast to learn more.

Fusion can run in any OpenShift deployment whether (currently AWS, Azure, & IBM) clouds, under VMware (wherever it runs), or on (x86 or IBM Z) bare metal. It supplies NFS file or S3 compatible object storage for container applications running under OpenShift. But it does more than just storage.

Beyond storage, Fusion includes backup/recovery, site to site DR and global (file & object) data access. It’s almost like someone opened up the IBM Spectrum software pantry and took out the best available functionality and cooked it up in to an OpenShift solution. IBM’s Spectrum Fusion current website (linked to above (Dec.’22)) still refers only to the software defined storage and (H)CI solution, but today’s Fusion includes all of the functions identified above.

All Fusion facilities run as containers under OpenShift. Customers can elect to run all Fusion services or pick and chose which ones they want for their environment. IBM Fusion supports an API, an API backed GUI, and CLI for its storage & data management as well as REST access. Fusion is fully compatible with Red Hat Ansible.

IBM Fusion is intended to be storage agnostic. Which means it can support its data management services for any NFS file storage as well as anyone’s S3 compatible, object storage.

Now that Red Hat software defined CEPH and ODF are under IBM product management, CEPH and ODF options will become available under Fusion. And CEPH offers block as well as file and object. We’ve talked about CEPH before, packaged in a hardware appliance, see our SoftIron podcast.

One intriguing part of the Fusion solution is its global data access. With global access, any OpenShift application can access data from any Fusion data store, across clouds, across on prem installations, or just about anywhere OpenShift is running. Matt mentioned that compute could be on AWS OpenShift, Fusion’s data control plane could be running on prem OpenShift and the data storage could be running on Azure OpenShift. All this would be glued together by Fusion global access, so that AWS compute had access to data on Azure.

There’s some sophisticated caching magic to make global access happen seamlessly and with decent levels of performance, but customers no longer have to copy whole file systems over from one cloud to another in order to move compute or data. IBM Fusion would need to run in all those locations for global access.

Keith asked if it was directly available in the AWS marketplace. Matt said not yet but you can deploy OpenShift out of the marketplace and then deploy IBM Fusion onto that.

It took us sometime to get our heads wrapped around what Fusion has to offer and throughout it all, Keith and I had a bit of fun with Matt.

Matthew Leib, Product Marketing Manager, IBM Spectrum Fusion

Matt has spent years in IT, from Engineering, to Architecture, from PreSales to analyst work, and finally to Product Marketing at IBM.

He’s spent years trying to achieve both credibility in the space, as a podcaster, blogger, and community member.

In his spare time, he’s a dad, dog owner, and amateur guitar player..