144: Greybeard talks AI IO with Subramanian Kartik & Howard Marks of VAST Data

Sponsored by

Today we talked with VAST Data’s Subramanian Kartik (@phyzzycyst), Global Systems Engineering Lead and Howard Marks (@DeepStorage@mastodon.social, @deepstoragenet) former GreyBeards co-host and now Technologist Extraordinary & Plenipotentiary at VAST. Howard needs no introduction to our listeners but Kartik does. Kartik has supported a number of customers implementing AI apps at VAST and prior companies, so he is well versed in the reality of AI ML DL. Moreover, VAST recently funded Silverton Consulting to write a paper discussing Deep Learning IO.

Although AI ML DL applications have been very popular these days in IT, there’s been a continuing challenge trying to understand its IO requirements. Listen to the podcast to learn more.

AI ML DL Neural Networks (NN) models train with data and lots of it while inferencing is also very data dependent. Kartik said AI model IO consists of small block, random reads with very few writes.

Some models contain huge NNs which consume mountains of data to train while others are relatively small and consume much less. GPT-3(.5), the model behind the original ChatGPT, has ~75B parameters in its ~800GB NN.

As many of us know, the key to AI processing is GPU hardware, which performs most, if not all, of the computations to train models and supply inferences. Moreover, to maximize training throughput, many organizations deploy model parallelism, using 10s to 1000s of GPUs.

For instance, in the paper mentioned earlier, we showed a model training IO chart based on all six storage vendor published NVIDIA DGX-A100 Reference Architecture reports for ResNet-50. On this single chart, all 6 storage systems supplied roughly the same images processed/sec (or ~IO bandwidth) performance to train the model on each of 8, 16 & 32 GPUs configurations. This is very unusual from our perspective but shows that ResNet-50 training is not IO bound.

However, another approach to speeding up NN training is to take advantage of newer, more advanced IO protocols. NVIDIA GPUDirect Storage transfers data directly from storage memory to GPU memory bypassing CPU memory all together which can significantly speed up GPU data consumption. It turns out that one bottleneck for AI training is CPU memory bandwidth

In addition, most AI model training reads data from a single file system mount point. Historically, an NFS mount point was limited to a single TCP connection and a maximum of ~2.5GB/sec of IO bandwidth. Recently, however, NConnect for NFS has been introduced which increased TCP connections to 16 per mount point .

Despite that, VAST Data found that by adding some code to Linux’s NFS TCP stack, they were able to increase NConnect to 64 TCP connections per compute node. Howard mentioned that with these changes and a 16 (compute) node VAST Data storage cluster they sustained 175GB/sec of GPUDirect Storage bandwidth using a DGX-A100 systems .

Subramanian Kartik, Global Systems Engineering Lead, VAST Data

Subramanian Kartik has been the Vice President of Systems Engineering at VAST Data since January of 2020, running the global presales organization. He is part of the incredible success of VAST Data which increased almost 10-fold in valuation and revenue in this period.

An accomplished technologist and executive in the industry, he has a wide array of experience in Cloud Architectures, AI/Machine Learning/Deep Learning, as well as in the  Life Sciences, covering high-performance computing and storage. He has had a lifelong deep passion for studying complex problems in all spheres spanning both workloads and infrastructure at the vanguard of current day technology. 

Prior to his work at VAST Data, he was with EMC (later Dell) for two decades, as both a Distinguished Engineer and global executive running the Converged and Hyperconverged Division  go-to-market. He has a Ph.D in Particle Physics with over 75 publications and 3 patents to his credit over the years. He enjoys mathematics, jazz, cooking and travelling with his family in his non-existent spare time.

Howard Marks, (former GreyBeards Co-Host) Technologist Extraordinary and Plenipotentiary, VAST Data

Howard Marks brings over forty years of experience as a technology architect for hire and Industry observer to his role as VAST Data’s Technologist Extraordinary and Plienopotentary. In this role, Howard demystifies VAST’s technologies for customers and customer requirements for VAST’s engineers.

Before joining VAST, Howard ran DeepStorage an industry test lab and analyst firm. An award-winning speaker, he has appeared at events on three continents including Comdex, Interop and VMworld.

Howard is the author of several books (all gratefully out of print) and hundreds of articles since Bill Machrone taught him journalism at PC Magazine in the 1980s.

Listeners may also remember that Howard was a founding co-Host of the Greybeards-on-Storage Podcast.

127: Annual year end wrap up podcast with Keith, Matt & Ray

[Ray’s sorry about his audio, it will be better next time he promises, The Eds] This was supposed to be the year where we killed off COVID for good. Alas, it was not to be and it’s going to be with us for some time to come. However, this didn’t stop that technical juggernaut we call the GreyBeards on Storage podcast.

Once again we got Keith, Matt and Ray together to discuss the past year’s top 3 technology trends that would most likely impact the year(s) ahead. Given our recent podcasts, Kubernetes (K8s) storage was top of the list. To this we add AI-MLops in the enterprise and continued our discussion from last year on how Covid & WFH are remaking the world, including offices, data centers and downtowns around the world. Listen to the podcast to learn more.

K8s rulz

For some reason, we spent many of this year’s podcasts discussing K8s storage. TK8s was never meant to provide (storage) state AND as a result, any K8s data storage has had to be shoe horned in.

Moreover, why would any IT group even consider containerizing enterprise applications let alone deploy these onto K8s. The most common answers seem to be automatic scalability, cloud like automation and run-anywhere portability.

Keith chimed in with enterprise applications aren’t going anywhere and we were off. Just like the mainframe, client-server and OpenStack applications before them, enterprise apps will likely outlive most developers, continuing to run on their current platforms forever.

But any new apps will likely be born, live a long life and eventually fade away on the latest runtime environment. which is K8s.

Matt mentioned hybrid and multi-cloud as becoming the reason-d’etre for enterprise apps to migrate to containers and K8s. Further, enterprises have pressing need to move their apps to the hybrid- & multi-cloud model. AWS’s recent hiccups, notwithstanding, multi-cloud’s time has come.

Ray and Keith then discussed which is bigger, K8s container apps or enterprise “normal” (meaning virtualized/bare metal) apps. But it all comes down to how you define bigger that matters, Sheer numbers of unique applications – enterprise wins, Compute power devoted to running those apps – it’s a much more difficult race to cal/l. But even Keith had to agree that based on compute power containerized apps are inching ahead.

AI-MLops coming on strong

AI /MLops in the enterprise was up next. For me the most significant indicator for heightened interest in AI-ML was VMware announced native support for NVIDIA management and orchestration AI-MLops technologies.

Just like K8s before it and VMware’s move to Tanzu and it’s predecessors, their move to natively support NVIDIA AI tools signals that the enterprise is starting to seriously consider adding AI to their apps.

We think VMware’s crystal ball is based on

  • Cloud rolling out more and more AI and MLops technologies for enterprises to use. on their infrastructure
  • GPUs are becoming more and more pervasive in enterprise AND in cloud infrastructure
  • Data to drive training and inferencing is coming out of the woodwork like never before.

We had some discussion as to where AMD and Intel will end up in this AI trend.. Consensus is that there’s still space for CPU inferencing and “some” specialized training which is unlikely to go away. And of course AMD has their own GPUs and Intel is coming out with their own shortly.

COVID & WFH impacts the world (again)

And then there was COVID and WFH. COVID will be here for some time to come. As a result, WFH is not going away, at least not totally any time soon. And is just becoming another way to do business.

WFH works well for some things (like IT office work) and not so well for others (K-12 education). If the GreyBeards were into (non-crypto) investing, we’d be shorting office real estate. What could move into those millions of square feet (meters) of downtime office space is anyones guess. But just like the factories of old, cities and downtowns in particular can take anything and make it useable for other purposes.

That’s about it, 2021 was another “interesteing” year for infrastructure technology. It just goes to show you, “May you live in interesting times” is actually an old (Chinese) curse.

Keith Townsend, (@TheCTOadvisor)

Keith is a IT thought leader who has written articles for many industry publications, interviewed many industry heavyweights, worked with Silicon Valley startups, and engineered cloud infrastructure for large government organizations. Keith is the co-founder of The CTO Advisor, blogs at Virtualized Geek, and can be found on LinkedIN.

Matt Leib, (@MBLeib)

Matt Leib has been blogging in the storage space for over 10 years, with work experience both on the engineering and presales/product marketing. His blog is at Virtually Tied to My Desktop and he’s on LinkedIN.

Ray Lucchesi, (@RayLucchesi)

Ray is the host and co-founder of GreyBeardsOnStorage and is President/Founder of Silverton Consulting, and a prominent (AI/storage/systems technology) blogger at RayOnStorage.com. Signup for SCI’s free, monthly industry e-newsletter here, published continuously since 2007. Ray can also be found on LinkedIn

113: GreyBeards talk storage for next gen. workloads with Liran Zvibel, Co-Founder & CEO WekaIO

Sponsored By:

I’ve known Liran Zvibel, Co-founder and CEO of Weka IO for many years now and it’s the second time he’s been on our show, (see: Episode 56: GreyBeards talk high performance file storage...). In those days, WekaIO was just coming out and hitting the world with this extremely high-performing, scale out unstructured data solution. Well since then, they’ve just gotten better.

Keith and I had a great time talking with Liran again. Liran has deep knowledge about unstructured data and how enterprises use it these days. WekaIO’s story, over the last two years has gone beyond great performance to real world, hybrid cloud offerings e as well as going after the cloud native app’s (read Kubernetes [K8S]) persistent storage. Listen to the podcast to learn more.

We started with a history lesson on WekaIO. Back in those days (which persists today, I might add) there were many IO workloads that required companies to purchase different solutions for different work. For example, they needed DAS or SAN for performance, NAS for ease of access and object for scale. WekaIO came out with an answer to all these problems in a single, scaleable storage system. That is, they performed IO as fast as DAS or SAN block, had all the ease of access of NAS, and could scale as much as object.

However, the real culprit holding the world back was “NFS”. At the outset NFS was designed (back in the 1990s) with the then current networking speeds available (10-100Mbps), which performed just fine at those speeds. But when 10-100GbE came out in the 2000’s, NFS’s metadata overhead was too chatty to support wire speeds. Thus, any storage that depended on NFS protocols couldn’t supply (small) files fast enough for modern applications.

This is why WekaIO has moved to not only support NFS and SMB but also POSIX and NVIDIA® GPUDirect® Storage interfaces. By offering POSIX, WekaIO is able to plug into standard Linux and Windows server systems and provide excellent small file performance. Of course applications that demand small file performance today are mostly data analytics and AI/ML/DL workloads.

Consequently., NVIDIA came out with their GPUDirect Storage protocol to address getting small file (data) into GPUs faster. With GPUDirect, storage systems can RDMA data directly from storage to GPU memory and vice versa, with no OS intervention (other than to set up the transfer). If you happen to have a small file, high performing storage system attached to your fabric that supports GPUDirect , like WekaIO, you can significantly speed up your AI/ML/DL workloads.

Next we started talking K8S storage. WekaIO usestheir POSIX interface in their CSI plugin to support K8S container persistent storage. Again, supplying high performance for small files seems to be tailor made for K8S container applications that exist today and will for the foreseeable future.

Enter the cloud. Almong other things, WekaIO is a AWS primary storage vendor. It also offers snap to cloud. And with both of these in tandem, it’s just become a lot easier to move and access your unstructured data in the cloud. Liran mentioned that WekaIO primary storage in AWS operates across AZ’s. This means it can be configured to support better availability than EBS.

Large BioPharma companies are using WekaIO in AWS to store and process field data and research data, so that this work can be done around the world. Some companies have run out of compute in a single AZ (unbelievable I know but it’s COVID). By offering multi-AZ support unstructured data access with WekaIO, these companies can spread their compute across AZ’s and region and still access their data. And when their products are ready for gov’t certification, having all this data in the cloud, can make provide an easy way to have gov’t access this same data.

Liran Zvibel, Co-founder and CEO WekaIO

As Co-Founder and CEO, Mr. Liran Zvibel guides long term vision and strategy at WekaIO. Prior to creating the opportunity at WekaIO, he ran engineering at social startup and Fortune 100 organizations including Fusic, where he managed product definition, design, and development for a portfolio of rich social media applications.

Liran also held principal architectural responsibilities for the hardware platform, clustering infrastructure and overall systems integration for XIV Storage System, acquired by IBM in 2007.

Mr. Zvibel holds a BSc.in Mathematics and Computer Science from Tel Aviv University.

109: GreyBeards talk SmartNICs & DPUs with Kevin Deierling, Head of Marketing at NVIDIA Networking

We decided to take a short break (of sorts) from storage to talk about something equally important to the enterprise, networking. At (virtual) VMworld a month or so ago, Pat made mention of developing support for SmartNIC-DPUs and even porting vSphere to run on top of a DPU. So we thought it best to go to the source of this technology and talk with Kevin Deierling (TechSeerKD), Head of Marketing at NVIDIA Networking who are the ones supplying these SmartNICs to VMware and others in the industry.

Kevin is always a pleasure to talk with and comes with a wealth of expertise and understanding of the technology underlying data centers today. The GreyBeards found our discussion to be very educational on what a SmartNIC or DPU can do and why VMware and others would be driving to rapidly adopt the technology. Listen to the podcast to learn more.

NVIDIA’s recent acquisition of Mellanox brought them Mellanox’s NIC, switch and router technology. And while Mellanox, and now NVIDIA have some pretty impressive switches and routers, what interested the GreyBeards was their SmartNIC technology.

Essentially, SmartNICS provide acceleration and offload of data handling needs required to move data around an enterprise network. These offload services include at a minimum, encryption/decryption, packet pacing (delivering gadzillion video streams at the right speed to insure proper playback by all), compression, firewalls, NVMeoF/RoCE, TCP/IP, GPU direct storage (GDS) transfers, VLAN micro-segmentation, scaling, and anything else that requires real time processing to perform at line speeds.

For those who haven’t heard of it, GDS transfers data from storage directly into GPU memory and from GPU memory directly to storage without any CPU cycles or server memory involvement, other than to set up the transfer. This extends NVMeoF RDMA tech to/from storage and server memory, to GPUs. That is, GDS offers a RDMA like path between storage and GPU memory. GPU to/from server memory direct interface already exists over the PCIe bus.

But even with all the offloads and accelerators above, they can also offer an additional a secure enclave outside the TPM in the CPU, to better isolate security sensitive functionality for a data center. (See DPU below).

Kevin mentioned multiple times that the new unit of computation is no longer a server but rather is now a data center. When you have public cloud, private cloud and other systems that all serve up virtual CPUs, NICs, GPUs and storage, what’s really being supplied to a user is a virtual data center. Cloud providers can carve up their hardware and serve it to you any way you want or need it. Virtual data centers can provide a multitude of VMs and any infrastructure that customers need to use to run their workloads.

Kevin mentioned by using SmartNics, IT or cloud providers can return 30% of the processor cycles (that were being spent doing networking work on CPUs) back to workloads that run on CPUs. Any data center can effectively obtain 30% more CPU cycles and increased networking speed and performance just by deploying SmartNICs throughout all the servers in their environment.

SmartNICs are an outgrowth of Mellanox technology embedded in their HPC InfiniBAND and high end Ethernet switches/routers. Mellanox had been well known for their support of NVMeoF/RoCE to supply high IOPs/low-latency IO activity for NVMe storage over Ethernet and before that their InfiniBAND RDMA technologies.

As Mellanox came out with their 2nd Gen SmartNIC they began to call their solution a “DPU” (data processing unit), which they see forming part of a “holy trinity” underpinning the new data center which has CPUs, GPUs and now DPUs. But a DPU is more than just a SmartNIC.

All NVIDIA SmartNICs and DPUs are based on Mellanox’s BlueField cards and chip technology. Their DPU uses BlueField2 (gen 2 technology) chips, which has a multi-core ARM engine inside of it and memory which can be used to perform computational processing in addition to the onboard offload/acceleration capabilities.

Besides adding VMware support for SmartNICs, PatG also mentioned that they were porting vSphere (ESX) to run on top of NVIDIA Networking DPUs. This would move the core VMware’s hypervisor functionality from running on CPUs, to running on DPUs. This of course would free up most if not all VMware Hypervisor CPU cycles for use by customer workloads.

During our discussion with Kevin, we talked a lot about the coming of AI-ML-DL workloads, which will require ever more bandwidth, ever lower latencies and ever more compute power. NVIDIA was a significant early enabler of the AI-ML-DL with their CUDA API that allowed a GPU to be used to perform DL network training and inferencing. As such, CUDA became an industry wide phenomenon allowing industry wide GPUs to be used as DL compute engines.

NVIDIA plans to do the same with their SmartNICs and DPUs. NVIDIA Networking is releasing the DOCA (Data center On a Chip Architecture) SDK and API. DOCA provides the API to use the BlueField2 chips and cards which are the central techonology behind their DPU. They have also announced a roadmap to continue enhancing DOCA, as they have done with CUDA, over the foreseeable future, to add more bandwidth, speed and functionality to DPUs.

It turns out the real problem which forced Mellanox and now NVIDIA to create SmartNics was the need to support the extremely low latencies required for NVMeoF and GDS IO.

It wasn’t clear that the public cloud providers were using SmartNICS but Kevin said it’s been sort of a widely known secret that they have been using the tech. The public clouds (AWS, Azure, Alibaba) have been deploying SmartNICS in their environments for some time now. Always on the lookout for any technology that frees up compute resources to be deployed for cloud users, it appears that public cloud providers were early adopters of SmartNICS.

Kevin Deierling, Head of Marketing NVIDIA Networking

Kevin is an entrepreneur, innovator, and technology executive with a proven track record of creating profitable businesses in highly competitive markets.

Kevin has been a founder or senior executive at five startups that have achieved positive outcomes (3 IPOs, 2 acquisitions). Combining both technical and business expertise, he has variously served as the chief officer of technology, architecture, and marketing of these companies where he led the development of strategy and products across a broad range of disciplines including: networking, security, cloud, Big Data, machine learning, virtualization, storage, smart energy, bio-sensors, and DNA sequencing.


Kevin has over 25 patents in the fields of networking, wireless, security, error correction, video compression, smart energy, bio-electronics, and DNA sequencing technologies.

When not driving new technology, he finds time for fly-fishing, cycling, bee keeping, & organic farming.

This image has an empty alt attribute; its file name is Subscribe_on_iTunes_Badge_US-UK_110x40_0824.png
This image has an empty alt attribute; its file name is play_prism_hlock_2x-300x64.png
This image has an empty alt attribute; its file name is Spotify_Logo_CMYK_Black-1024x307.png


106: Greybeards talk Intel’s new HPC file system with Kelsey Prantis, Senior Software Eng. Manager, Intel

We had talked with Intel at Storage Field Day 20 (SFD20), about a month ago. At the virtual event, Intel’s focus was on their Optane PMEM (persistent memory) technology. Kelsey Prantis (@kelseyprantis), Senior Software Engineering Manager, Intel was on the show and gave an introduction into Intel’s DAOS (Distributed Architecture Object Storage, DAOS.io) a new HPC (high performance computing, super computers) file system they developed from scratch to use leading edge, Intel technologies, and Optane PMEM was one of them.

Kelsey has worked on LUSTRE and other HPC file systems for a long time now and came into the company from the acquisition of Whamcloud. Currently, she manages the development team working on DAOS. DAOS is a new HPC object storage file system which is completely open source (available on GitHub).

DAOS was designed from the start to take advantage of NVMe SSDs and Optane PMEM. With PMEM, current servers can support up to 20TB of memory. Besides the large memory sizes, Optane PMEM also offers non-volatile memory and byte addressability (just like DRAM). These two characteristics opens up new functionality that allows DAOS to move beyond legacy, block oriented, storage architectures that have been the only storage solution for HPC (and the enterprise) for decades now.

What’s different about DAOS

DAOS uses PMEM for all metadata and for storing small files. HPC IO has always focused on heavy bandwidth (IO using large blocks) oriented but lately newer applications have emerged, such as AI/ML/DL, data analytics and others, that use smaller files/blocks. Indeed, most new HPC clusters and supercomputers are deploying almost as many GPUs as CPUs in their configurations to support AI activities.

The problem is that these newer applications typically consume much smaller files. Matt mentioned one HPC client he worked with was processing small batches of seismic data, to predict, in real time, earthquakes that were happening around the world.

By using PMEM for metadata and small files, DAOS can be much more responsive to file requests (open, close, delete, status) as well as provide higher performing IO for small files. All this leads to a much better performing system for the new HPC workloads as well as great sustainable performance for the more traditional large file workloads.

DAOS storage

DAOS provides a cluster storage system that can be configured with from 1 (no data protection), but more normally 3 nodes (with data protection) at a minimum to 512 nodes (lab tested). Data protection in DAOS is currently based on mirroring data and can use from 0 to the number of nodes in a cluster as data mirrors.

DAOS system nodes are homogeneous. That is they all come with the same amount of PMEM and NVMe SSDs. Note, DAOS doesn’t support disk drives. Kelsey mentioned DAOS node hardware can be tailored to suit any particular application environment. But they typically require an average of 6% of overall DAOS system capacity in PMEM for metadata and small file activity.

DAOS current supports their own API, POSIX, HDFS5, MPIIO and Apache Spark storage protocols. Kelsey mentioned that standard POSIX uses a pessimistic conflict resolution mode which leads to performance bottlenecks during parallel access. In contrast, DAOS’s versos of POSIX uses optimistic conflict resolution, which means DAOS starts writes assuming there’s no conflict, but if one occurs it handles the conflict in real time. Of course with all the metadata byte addressable and in PMEM this doesn’t take up a lot of (IO) time.

As mentioned earlier, DAOS data protection uses mirror-replicas. However, unlike most other major file systems, DAOS mirroring can be done at the object level. DAOS internally is an object store. Data organization on DAOS starts at the pool level, underneath that is data containers, and then under that are objects. Any object in DAOS can have its own mirroring configuration. DAOS is working towards supporting Erasure Coding as another form of data protection for a future release.

DAOS performance

There’s a new storage benchmark that was developed specifically for HPC, called the IO500. The IO500 benchmark simulates a number of different HPC workloads, measures performance for each of them, and computes an (aggregate) performance score to rank HPC storage systems.

IO500 ranks system performance using two lists: one is for any sized configuration that typically range from 50 to 1000s of nodes and their other list limits the configuration to 10 nodes. The first performance ranking can sometimes be gamed by throwing more hardware into a cluster. The 10 node rankings are much harder to game this way and from our perspective, show a fairer comparison of system performance.

As presented (virtually) at ISC 2020, DAOS took the top spot on the IO500 any size configuration list and performed better than 2X the next best solution. And on the IO500 10 node list, Intel’s DAOS configuration, Texas Advanced Computing (TAC) DAOS configuration, and Argonne Nat Labs DAOS configuration took the top 3 spots and had 3X better performance than the next best, non-DAOS storage system.

The Argonne National Labs has already stated that they will be using DAOS in their new HPC system to be deployed in the near future. Early specifications for storage at the new Argonne Lab required support for 230PB of data and 25TB/sec of bandwidth.

The podcast ran ~43 minutes. Kelsey was great to talk with and very knowledgeable about HPC systems and HPC IO in particular. Matt has worked at Argonne in the past so understood these systems better than I. Sadly, we lost Matt’s end of the conversation about 1/2 way into the recording. Both Matt and I thought that DAOS represents the birth of a new generation of HPC storage. Listen to the podcast to learn more.


This image has an empty alt attribute; its file name is Spotify_Logo_CMYK_Black-1024x307.png

This image has an empty alt attribute; its file name is play_prism_hlock_2x-300x64.png
This image has an empty alt attribute; its file name is Subscribe_on_iTunes_Badge_US-UK_110x40_0824.png

Kelsey Prantis, Senior Software Engineering Manager, Intel

 Kelsey Prantis heads the Extreme Storage Architecture and Development division at Intel Corporation. She leads the development of Distributed Asynchronous Object Storage (DAOS), an open-source, low-latency and high IOPS object store designed from the ground up for massively distributed Non-Volatile Memory (NVM).

She joined Intel in 2012 with the acquisition of Whamcloud, where she led the development of the Intel Manager for Lustre* product.

Prior to Whamcloud, she was a software developer at personal genomics and biotechnology company 23andMe.

Prantis holds a Bachelor’s degree in Computer Science from Rochester Institute of Technology