163: GreyBeards talk Ultra Ethernet with Dr J Metz, Chair of UEC steering committee, Chair of SNIA BoD, & Tech. Dir. AMD

Dr J Metz, (@drjmetz, blog) has been on our podcast before mostly in his role as SNIA spokesperson and BoD Chair, but this time he’s here discussing some of his latest work on the Ultra Ethernet Consortium (UEC) (LinkedIN: @ultraethernet, X: @ultraethernet)

The UEC is a full stack re-think of what Ethernet could do for large single application environments. UEC was originally focused on HPC, with 400-800 Gbps networks and single applications like simulating a hypersonic missile or airplane. But with the emergence of GenAI and LLMs, UEC could also be very effective for large AI model training with massive clusters doing a single LLM training job over months. Listen to the podcast to learn more.

The UEC is outside the realm of normal enterprise environments. But as AI training becomes more ubiquitous, who knows whether UEC may not find a place in the enterprise. However, it’s not intended for mixed network environments with multiple applications. It’s a single application network.

One wouldn’t think, HPC was a big user of Ethernet for their main network. But Dr J pointed out that the top 3 of the HPC 500, all use Ethernet and more are looking to use it in the future.

UEC is essentially an optimized software stack and hardware for networking used by single application environments. These types of workloads are constantly pushing the networking envelope. And by taking advantage of the “special networking personalities” of these workloads, UEC can significantly reduce networking overheads, boosting bandwidth and workload execution.

The scale of networks is extreme. The UEC is targeting up to a million endpoints, over >100K servers, with each network link >100Gbps and more likely 400-800Gpbs. With the new (AMD and others) networking cards coming out that support 4 400/800Gbps network ports, having a pair of these on each server, with 100K server cluster gives one 800K endpoints. A million is not that far away when you think of it at that scale.

Moreover, LLM training and HPC work are starting to look more alike these days. Yes there are differences but the scale of their clusters are similar, and the way work is sometimes fed to them is similar, which leads to similar networking requirements

UEC is attempting to handle a 5% problem. That is 95% of the users will not have 1M endpoints in their LAN, but maybe 5% will and for these 5%, a more mixed networking workload is unnecessary. In fact, a mixed network becomes a burden slowing down packet transmission.

UEC is finding that with a few select networking parameters, almost like workload fingerprints, network stacks can be much more optimized than current Ethernet and thereby support reduced packet overheads, and more bandwidth.

AI and HPC networks share a very limited set of characteristics which can be used as fingerprints. These characteristics are like reliable or unreliable transport, ordered or unordered delivery, multi-path packet spraying or not, etc, With a set of these types of parameters, selected for an environment, UEC can optimize a network stack to better support a million networking endpoints

We asked where CXL fits in with UEC? DrJ said it could potentially be an entity on the network but he sees CXL more as a within server or between a tight (limited) cluster of servers, solution rather than something on a UEC network.

Just 12 months ago the UEC had 10 members or so and this past week they were up to 60. UEC seems to have struck a chord.

The UEC plans to release a 1.0 specification, near the end of this year. UEC 1.0 is intended to operate on current (>100Gbps) networking equipment with firmware/software changes.

Considering the UEC was just founded in 2023, putting out their 1.0 technical spec. within 1.5 years is astonishing. But also speaks volumes to the interest in the technology.

The UEC has a blog post which talks more about UEC 1.0 specification and the technology behind it.

Dr J Metz, Chair of UEC Steering Committee, Chair of SNIA BoD, Technical Director of Systems Design, AMD

J works to coordinate and lead strategy on various industry initiatives related to systems architecture. Recognized as a leading storage networking expert, J is an evangelist for all storage-related technology and has a unique ability to dissect and explain complex concepts and strategies. He is passionate about the innerworkings and application of emerging technologies.

J has previously held roles in both startups and Fortune 100 companies as a Field CTO,  R&D Engineer, Solutions Architect, and Systems Engineer. He has been a leader in several key industry standards groups, sitting on the Board of Directors for the SNIA, Fibre Channel Industry Association (FCIA), and Non-Volatile Memory Express (NVMe). A popular blogger and active on Twitter, his areas of expertise include NVMe, SANs, Fibre Channel, and computational storage.

J is an entertaining presenter and prolific writer. He has won multiple awards as a speaker and author, writing over 300 articles and giving presentations and webinars attended by over 10,000 people. He earned his PhD from the University of Georgia.

159: GreyBeards Year End 2023 Wrap Up

Jason and Keith joined Ray for our annual year end wrap up and look ahead to 2024. I planned to discuss infrastructure technical topics but was overruled. Once we started talking AI, we couldn’t stop.

It’s hard to realize that Generative AI and ChatGPT in particular, haven’t been around that long. We discussed some practical uses Keith and Jason had done with the technology.

Keith mentioned its primary skill is language expertise. He has used it to help write up proposals. He often struggles to convince CTO Advisor non-sponsors of the value they can bring and found that using GenAI has helped do this better.

Jason mentioned he uses it to create BASH, perl, and PowerShell scripts. He says it’s not perfect but can get ~80% there and with a few tweaks, is able to have something a lot faster than if he had to do it completely by hand. He also mentioned its skill in translating from one scripting language to others and how well the code it generates is documented (- that hurt).

I was the odd GreyBeard out, having not used any GenAI, proprietary or not. I’m still working to get a reinforcement learning task to work well and consistently. I figured once I mastered that, I train an LLM on my body of (text and code) work (assuming of course someone gifts me a gang of GPUs).

I agreed GenAI are good at (English) language and some coding tasks (where lot’s of source code exists, such as java, scripting, python, etc.).

However, I was on a MLops slack channel and someone asked if GenAI could help with IBM RPG II code. I answered, probably not. There’s just not a lot of RPG II code publicly accessible on the web and the structure of RPG was never line of text/commands oriented.

We had some heated discussion on where LLMs get the data to train with. Keith was fine with them using his data. I was not. Jason was neutral.

We then turned to what this means to the white collar workers who are coding and writing text. Keith made the point that this has been a concern throughout history, at least since the industrial revolution.

Machines come along, displace work that was done by hand, increase production immensely, reduce costs. Organizations benefit, but people doing those jobs need to up level their skills, to take advantage of the new capabilities.

Easy for us to say, as we, except for Jason, in his present job, are essentially entrepreneurs and anything that helps us deliver more value, faster, easier or less expensively, is a boon for our businesses.

Jason mentioned, Stephen Wolfram wrote a great blog post discussing LLM technology (see What is ChatGPT doing … and why does it work). Both Jason and Keith thought it did a great job about explaining the science and practice behind LLMs.

We moved on to a topic harder to discuss but of great relevance to our listeners, GenAI’s impact on the enterprise.

It reminds me of when Cloud became most prominent. Then “C” suites tasked their staff to adopt “the cloud” anyway they could. Today, “C” suites are tasking their staff to determine what their “AI strategy” is and when will it be implemented.

Keith mentioned that this is wrong headed. The true path forward (for the enterprise) is to focus on what are the business problems and how can (Gen)AI address (some of) them.

AI is so varied and its capabilities across so many fields, is so good nowadays ,that organizations should really look at AI as a new facility that can recognize patterns, index/analyze/transform images, summarize/understand/transform text/code, etc., in near real-time and see where in the enterprise that could help.

We talked about how enterprises can size AI infrastructure needed to perform these activities. And it’s more than just a gaggle of GPUs.

MLcommons’s MLperf benchmarks can help show the way, for some cases, but they are not exhaustive. But it’s a start.

The consensus was maybe deploy in the cloud first and when the workload is dialed in there, re-home it later. With the proviso that hardware needed is available.

Our final topic was the Broadcom VMware acquisition. Keith mentioned their recent subscription pricing announcements vastly simplified VMware licensing, that had grown way too complex over the decades.

And although everyone hates the expense of VMware solutions, they often forget the real value VMware brings to enterprise IT.

Yes hyperscalars and their clutch of coders, can roll their own hypervisor services stacks, using open source virtualization. But the enterprise has other needs for their developers. And the value of VMware virtualization services, now that 128 Core CPUs are out, is even higher.

We mentioned the need for hybrid cloud and how VCF can get you part of the way there. Keith said that dev teams really want something like “AWS software” services running on GCP or Azure.

Keith mentioned that IBM Cloud is the closest he’s seen so far to doing what Dev wants in a hybrid cloud.

We all thought when DNN’s came out and became trainable, and reinforcement learning started working well, that AI had turned a real corner. Turns out, that was just a start. GenAI has taken DNNs to a whole other level and Deepmind and others are doing the same with reinforcement learning.

This time AI may actually help advance mankind, if it doesn’t kill us first. On the latter topic you may want to checkout my RayOnStorage AGI series of blog posts (latest … AGI part-8)

Jason Collier, Principal Member Of Technical Staff at AMD, Data Center and Embedded Solutions Business Group

Jason Collier (@bocanuts) is a long time friend, technical guru and innovator who has over 25 years of experience as a serial entrepreneur in technology.

He was founder and CTO of Scale Computing and has been an innovator in the field of hyperconvergence and an expert in virtualization, data storage, networking, cloud computing, data centers, and edge computing for years.

He’s on LinkedIN. He’s currently working with AMD on new technology and he has been a GreyBeards on Storage co-host since the beginning of 2022

Keith Townsend, President of The CTO Advisor a Futurum Group Company

Keith Townsend (@CTOAdvisor) is a IT thought leader who has written articles for many industry publications, interviewed many industry heavyweights, worked with Silicon Valley startups, and engineered cloud infrastructure for large government organizations.

Keith is the co-founder of The CTO Advisor, blogs at Virtualized Geek, and can be found on LinkedIN.

154: GreyBeards annual VMware Explore wrap-up podcast

Thanks, once again to The CTO Advisor|Keith Townsend, (@CTOadvisor) for letting us record the podcast in his studio. VMware Explore this year was better than last year. The show seemed larger. the show floor busier, the Hub better and the Hands-On Lab much larger than I ever remember before. The show seems to be growing, but still not at the pre-pandemic levels, but the trend is good.

The engineers have been busy at VMware this past year. Announced at the show include Private AI Foundation, a way for enterprises to train open source LLMs on corporate data kept private, a significant re-direct to VMware Edge environments moving from the push model updates to push model updates, and vSAN Max, NSX+, Tanzu App Engine, and more. And we heard that Brocade is clearing more hurdles to the acquisition. Listen to the podcast to learn more.

Private AI plays to VMware’s strengths and its control over on-prem processing. Customers need a safe space and secured data to train corporate ChatBots curated on corporations knowledge base. VMware rolled this out two ways,

  • Reference architecture approach based on Ray cluster management, KubeFlow, PyTorch, VectorDB, GPU Scaling (NVLink/NVswitch), vSAN fast path (RDMA, GPUdirect), and deep learning VMs. There was no discussion of tie ins to the Data Persistence (object) storage.
  • Proprietary NVIDIA approach based on NVIDIA workbench, TensorRT, NeMO, NVIDIA GPU & Network Operator

By having both approaches VMware provides alternatives for those wanting a non-proprietary solution. And with with AI/MLOps moving so fast, the open source may be better able to keep up.

The tie in with NVIDIA is a natural extension of what VMware have been doing with GPUs and DPUs, etc.

Also, VMware announced a technological partnership with Hugging Face. We were somewhat concerned with all the focus on LLM and GenAI but the agreement with Hugging Face goes beyond just LLMs.

VMware Edge solutions are pivoting. Apparently, VMware is moving from the vSphere pull model of code updates in the field which seems to handle 64 server, multi-cluster environments without problem to more of a YAML-GitHub push model of IoT device updates that seems better able to manage fleets of 1K to 100K devices in the field.

With the new model one creates a GitHub repo and a YAML file describing the code update to be done and all your IoT devices just get’s updated to the new level.

Once again the Brocade acquisition is on everyone’s mind. As I got to the show, one analyst asked if this was going to be the last VMware Explore. I highly doubt that, but Brocade will make lots of changes once the transaction closes. One thing mentioned at the show was that Brocade will make an immediate, additional $1B investment in R&D. The deal had provisionally passed the UK regulatory body and was on track to close near the end of October.

Other news from the show:

  • The Tanzu brand is broadening. Tanzu Application Platform (TAP) still exists but they have added a new App Engine is to take the VMware management approach to K8s clusters, other cloud infrastructure and the rest of the IT world. Tanzu Intelligent Services also now supports policy guardrails, cost control, management insight and migration services for other environments.
  • vSAN Max, which supports disaggregation (separation) of storage and compute is available. vSAN Max becomes a full fledged, standalone storage system that just happens to run on top of vSphere. Disaggregated (vSAN Max) storage and (regular vSAN) HCI can co-exist as different mounted datastores and vSAN Max supports PB of storage.
  • Workspace One is updated to provide enhanced digital experience monitoring that adds coverage of what Workspace One users are actually experiencing.
  • NSX+ continues to roll out. VMware mentioned that the number one continuing problem with hybrid cloud/multi-cloud setup is getting the networking right. NSX+ will reduce this complexity by becoming a management/configuration overlay over any and all cloud/on-prem networking for your environment(s).
  • VMware chatbots for Tanzu, Workspace One and NSX+ are now in tech preview and will supply intelligent assistants for these solutions. Based on LLM/GenAI and trained on VMware’s extensive corporate knowledge base, the chatbots will help admins focus on the signal over the noise and will provide recommendations on how to resolve issues. .

Jason Collier, Principal Member of Technical Staff, AMD

Jason Collier (@bocanuts) is a long time friend, technical guru and innovator who has over 25 years of experience as a serial entrepreneur in technology. Jason currently works at AMD focused on emerging technology for IT, IoT and anywhere else in the world and across the universe that needs compute, storage or networking resources.

He was Chief Evangelist, CTO & Co-Founder of Scale Computing and has been an innovator in the field of hyper-convergence and an expert in virtualization, data storage, networking, cloud computing, data centers, and edge computing for years.

He has also been another co-founder, director of research, VP of technical operations and director of operations at other companies over his long career prior to AMD and Scale.

He’s on LinkedIN.

151: GreyBeards talk AI (ML) performance benchmarks with David Kanter, Exec. Dir. MLCommons

Ray’s known David Kanter (@TheKanter), Executive Director, MLCommons, for quite awhile now and has been reporting on MLCommons Mlperf AI benchmark results for even longer. MLCommons releases new benchmark results each quarter and this last week they released new Data Center Training (v3.0) and new Tiny Inferencing (v1.1) results. So, the GreyBeards thought it was time to get a view of what’s new in AI benchmarking and what’s coming later this year.

David’s been around the startup community in the Bay Area for a while now and sort of started at MLPerf early on as a technical guru working on submissions and other stuff and worked his way up to being the Executive Director/CEO. The big news this week from MLCommons is that they have introduced a new training benchmark and updated an older one. The new one simulates training GPT-3 and they also updated their Recommendation Engine benchmark. Listen to the podcast to learn more.

MLCommons is an industry association focused on supplying recreatable, verifiable benchmarks for machine learning (ML) and AI which they call MLperf benchmarks. Their benchmark suite includes a number of different categories such as data center training, HPC training, data center inferencing, edge inferencing, mobile inferencing and finally tiny (IoT device) inferencing. David likes to say MLperf benchmarks range from systems consuming Megawatts (HPC, literally a supercomputer) to Microwatts (Tiny) solutions.

The challenge holding AI benchmarking back early on was a few industry players had done their own thing but there was way to compare one to another. MLcommons was born out of that chaos, and sought to create a benchmarking regimen that any industry player could use to submit AI work activity and would allow customers to compare their solution to any other submission on a representative sample of ML model training and inferencing activity

MLCommons has both an Open and Closed class of submissions. For the Closed class, submissions have a very strict criteria for submission. These include known open source AI models and data, accuracy metrics that training and inferencing need to hit, and reporting a standard set of metrics information for the benchmark. All of which need to be done in order to create a repeatable and verifiable submission.

Their Open class is a way for any industry participant to submit whatever model they would like, to whatever accuracy level they want, and it’s typically used to benchmark new hardware, software or AI models.

As mentioned above MLcommons training benchmarks use accuracy specification that must be achieved to have a valid submission. Benchmarks also have to be run 3 times. All submissions list hardware (CPU and Accelerators) and software (AI framework). And these could range from 0 accelerators (e.g. CPU only with no GPUs) to 1000’s of GPUs.

The new GPT-3 model is a very complex AI model, that seemed until yesterday, unlikely to ever be benchmarked. But apparently the developers at MLCommons (and their industry partners) have been working on this for some time now. In this round of results there were 3 cloud submissions and 4 on prem submissions for GPT-3 training.

GPT-3, -3.5 & -4 are all OpenAI solutions which power their ChatGPT text transformer Large Language Model (LLM). GPT-3 has 175B parameters and was trained on TBs of data covering web crawls, book crawls, official documentation, code, etc. OpenAI said, at GPT-3 announcement, it took over $10M and months to train.

MLcommons GPT-3 benchmark is not a full training run of GPT-3 but uses a training checkpoint, trained on a subset of data used for the original GPT-3 training, Checkpoints are used for long running jobs (training sessions, weather simulations, fusion energy simulations, etc) and copy all internal state of a job/system while its running (ok, quiesced) at some interval (say every 8hrs, 24 hrs, 48hrs, etc), so that in case of a failure, one could just restart the activity from the last checkpoint rather than the beginning.

MLCommons GPT-3 checkpoint has been trained on a 10B token data set. The benchmark starts with loading this checkpoint and trains on an even smaller subset of the data for GPT-3 and trains to achieve the accuracy baseline.

Accuracy for text transformers is not as simple as other models (correct image classification, object identification, etc.) and uses “perplexity”. Hugging Face defines perplexity as “the exponentiated average negative log-likelihood of a sequence.”

The 4 on-premises submissions for GPT-3 using 45 minutes (768 NVIDIA H100 GPUs) to 442 minutes (64 Habana Guadi2 GPUS). The 3 cloud submissions all used NVIDIA H100 GPUs and ranged from 768 (@47 minutes to train) to 3584 GPUs (@11 min. to train).

Aside from DataCenter training, MLcommons also released a new round of Tiny (IoT) inferencing benchmarks. These generally use smaller ARM processors and no GPUs with much smaller AI models such as Keyword spotting (“Hey SIRI”), visual wake words (door opening), image classification, etc.

We ended our discussion with me asking David why there was no storage oriented MLcommons benchmark. David said creating a storage benchmarks for AI is much different than inferencing or training benchmarks. But MLCommons has taken this on and now have a storage MLcommons series of benchmarks for storage that uses emulated accelerators.

At the moment, anyone with a storage system can submit MLcommons storage benchmark. After some time, MLcommons will only allow submissions from member companies but early on it’s open for all.

For their storage benchmarks, rather than using accuracy as benchmark criteria they use keeping (emulated) accelerators X% busy. This way storage support of the MLops activities can be isolated from the training and inferencing.

The GreyBeards eagerly anticipate the first round of MLcommons storage benchmark results. Hopefully coming out later this year.

144: Greybeard talks AI IO with Subramanian Kartik & Howard Marks of VAST Data

Sponsored by

Today we talked with VAST Data’s Subramanian Kartik (@phyzzycyst), Global Systems Engineering Lead and Howard Marks (@DeepStorage@mastodon.social, @deepstoragenet) former GreyBeards co-host and now Technologist Extraordinary & Plenipotentiary at VAST. Howard needs no introduction to our listeners but Kartik does. Kartik has supported a number of customers implementing AI apps at VAST and prior companies, so he is well versed in the reality of AI ML DL. Moreover, VAST recently funded Silverton Consulting to write a paper discussing Deep Learning IO.

Although AI ML DL applications have been very popular these days in IT, there’s been a continuing challenge trying to understand its IO requirements. Listen to the podcast to learn more.

AI ML DL Neural Networks (NN) models train with data and lots of it while inferencing is also very data dependent. Kartik said AI model IO consists of small block, random reads with very few writes.

Some models contain huge NNs which consume mountains of data to train while others are relatively small and consume much less. GPT-3(.5), the model behind the original ChatGPT, has ~75B parameters in its ~800GB NN.

As many of us know, the key to AI processing is GPU hardware, which performs most, if not all, of the computations to train models and supply inferences. Moreover, to maximize training throughput, many organizations deploy model parallelism, using 10s to 1000s of GPUs.

For instance, in the paper mentioned earlier, we showed a model training IO chart based on all six storage vendor published NVIDIA DGX-A100 Reference Architecture reports for ResNet-50. On this single chart, all 6 storage systems supplied roughly the same images processed/sec (or ~IO bandwidth) performance to train the model on each of 8, 16 & 32 GPUs configurations. This is very unusual from our perspective but shows that ResNet-50 training is not IO bound.

However, another approach to speeding up NN training is to take advantage of newer, more advanced IO protocols. NVIDIA GPUDirect Storage transfers data directly from storage memory to GPU memory bypassing CPU memory all together which can significantly speed up GPU data consumption. It turns out that one bottleneck for AI training is CPU memory bandwidth

In addition, most AI model training reads data from a single file system mount point. Historically, an NFS mount point was limited to a single TCP connection and a maximum of ~2.5GB/sec of IO bandwidth. Recently, however, NConnect for NFS has been introduced which increased TCP connections to 16 per mount point .

Despite that, VAST Data found that by adding some code to Linux’s NFS TCP stack, they were able to increase NConnect to 64 TCP connections per compute node. Howard mentioned that with these changes and a 16 (compute) node VAST Data storage cluster they sustained 175GB/sec of GPUDirect Storage bandwidth using a DGX-A100 systems .

Subramanian Kartik, Global Systems Engineering Lead, VAST Data

Subramanian Kartik has been the Vice President of Systems Engineering at VAST Data since January of 2020, running the global presales organization. He is part of the incredible success of VAST Data which increased almost 10-fold in valuation and revenue in this period.

An accomplished technologist and executive in the industry, he has a wide array of experience in Cloud Architectures, AI/Machine Learning/Deep Learning, as well as in the  Life Sciences, covering high-performance computing and storage. He has had a lifelong deep passion for studying complex problems in all spheres spanning both workloads and infrastructure at the vanguard of current day technology. 

Prior to his work at VAST Data, he was with EMC (later Dell) for two decades, as both a Distinguished Engineer and global executive running the Converged and Hyperconverged Division  go-to-market. He has a Ph.D in Particle Physics with over 75 publications and 3 patents to his credit over the years. He enjoys mathematics, jazz, cooking and travelling with his family in his non-existent spare time.

Howard Marks, (former GreyBeards Co-Host) Technologist Extraordinary and Plenipotentiary, VAST Data

Howard Marks brings over forty years of experience as a technology architect for hire and Industry observer to his role as VAST Data’s Technologist Extraordinary and Plienopotentary. In this role, Howard demystifies VAST’s technologies for customers and customer requirements for VAST’s engineers.

Before joining VAST, Howard ran DeepStorage an industry test lab and analyst firm. An award-winning speaker, he has appeared at events on three continents including Comdex, Interop and VMworld.

Howard is the author of several books (all gratefully out of print) and hundreds of articles since Bill Machrone taught him journalism at PC Magazine in the 1980s.

Listeners may also remember that Howard was a founding co-Host of the Greybeards-on-Storage Podcast.