154: GreyBeards annual VMware Explore wrap-up podcast

Thanks, once again to The CTO Advisor|Keith Townsend, (@CTOadvisor) for letting us record the podcast in his studio. VMware Explore this year was better than last year. The show seemed larger. the show floor busier, the Hub better and the Hands-On Lab much larger than I ever remember before. The show seems to be growing, but still not at the pre-pandemic levels, but the trend is good.

The engineers have been busy at VMware this past year. Announced at the show include Private AI Foundation, a way for enterprises to train open source LLMs on corporate data kept private, a significant re-direct to VMware Edge environments moving from the push model updates to push model updates, and vSAN Max, NSX+, Tanzu App Engine, and more. And we heard that Brocade is clearing more hurdles to the acquisition. Listen to the podcast to learn more.

Private AI plays to VMware’s strengths and its control over on-prem processing. Customers need a safe space and secured data to train corporate ChatBots curated on corporations knowledge base. VMware rolled this out two ways,

  • Reference architecture approach based on Ray cluster management, KubeFlow, PyTorch, VectorDB, GPU Scaling (NVLink/NVswitch), vSAN fast path (RDMA, GPUdirect), and deep learning VMs. There was no discussion of tie ins to the Data Persistence (object) storage.
  • Proprietary NVIDIA approach based on NVIDIA workbench, TensorRT, NeMO, NVIDIA GPU & Network Operator

By having both approaches VMware provides alternatives for those wanting a non-proprietary solution. And with with AI/MLOps moving so fast, the open source may be better able to keep up.

The tie in with NVIDIA is a natural extension of what VMware have been doing with GPUs and DPUs, etc.

Also, VMware announced a technological partnership with Hugging Face. We were somewhat concerned with all the focus on LLM and GenAI but the agreement with Hugging Face goes beyond just LLMs.

VMware Edge solutions are pivoting. Apparently, VMware is moving from the vSphere pull model of code updates in the field which seems to handle 64 server, multi-cluster environments without problem to more of a YAML-GitHub push model of IoT device updates that seems better able to manage fleets of 1K to 100K devices in the field.

With the new model one creates a GitHub repo and a YAML file describing the code update to be done and all your IoT devices just get’s updated to the new level.

Once again the Brocade acquisition is on everyone’s mind. As I got to the show, one analyst asked if this was going to be the last VMware Explore. I highly doubt that, but Brocade will make lots of changes once the transaction closes. One thing mentioned at the show was that Brocade will make an immediate, additional $1B investment in R&D. The deal had provisionally passed the UK regulatory body and was on track to close near the end of October.

Other news from the show:

  • The Tanzu brand is broadening. Tanzu Application Platform (TAP) still exists but they have added a new App Engine is to take the VMware management approach to K8s clusters, other cloud infrastructure and the rest of the IT world. Tanzu Intelligent Services also now supports policy guardrails, cost control, management insight and migration services for other environments.
  • vSAN Max, which supports disaggregation (separation) of storage and compute is available. vSAN Max becomes a full fledged, standalone storage system that just happens to run on top of vSphere. Disaggregated (vSAN Max) storage and (regular vSAN) HCI can co-exist as different mounted datastores and vSAN Max supports PB of storage.
  • Workspace One is updated to provide enhanced digital experience monitoring that adds coverage of what Workspace One users are actually experiencing.
  • NSX+ continues to roll out. VMware mentioned that the number one continuing problem with hybrid cloud/multi-cloud setup is getting the networking right. NSX+ will reduce this complexity by becoming a management/configuration overlay over any and all cloud/on-prem networking for your environment(s).
  • VMware chatbots for Tanzu, Workspace One and NSX+ are now in tech preview and will supply intelligent assistants for these solutions. Based on LLM/GenAI and trained on VMware’s extensive corporate knowledge base, the chatbots will help admins focus on the signal over the noise and will provide recommendations on how to resolve issues. .

Jason Collier, Principal Member of Technical Staff, AMD

Jason Collier (@bocanuts) is a long time friend, technical guru and innovator who has over 25 years of experience as a serial entrepreneur in technology. Jason currently works at AMD focused on emerging technology for IT, IoT and anywhere else in the world and across the universe that needs compute, storage or networking resources.

He was Chief Evangelist, CTO & Co-Founder of Scale Computing and has been an innovator in the field of hyper-convergence and an expert in virtualization, data storage, networking, cloud computing, data centers, and edge computing for years.

He has also been another co-founder, director of research, VP of technical operations and director of operations at other companies over his long career prior to AMD and Scale.

He’s on LinkedIN.

153: GreyBeards annual FMS2023 wrapup with Jim Handy, General Director, Objective Analysis

Jim Handy, General Director, Objective Analysis and I were at the FMS 2023 conference in Santa Clara last week and there were a number of interesting discussions at the show. I was particularly struck with the progress being made on the CXL front. I was just a participant but Jim moderated and was on many panels during the show. He also comes with a much deeper understanding of the technologies. Listen to the podcast to learn more.

We asked for some of Jim’s top takeaways from the show.

Jim thought that the early Tuesday Morning Market sessions on the state of the flash, memory and storage markets were particularly well attended. As these were the first day’s earliest sessions, in the past they weren’t as well attended.

The flash and memory markets both seem to be in a downturn. As the great infrastructure buy out of COVID ends, demand seems to have collapsed. As always, these and other markets go thru cycles, i.e., downturn where demand collapses and prices fall, to price stability as demand starts to pick up, and to supply constrained where demand can’t be satisfied. The general consensus seems to be that we may see a turn in the market by middle of next year.

CXL is finally catching on. At the show there were a couple of vendors showing memory extension/expansion products using CXL 1.1 as well as CXL switches (extenders) based on CXL 2.0. The challenge with memory today, in this 100+ core CPU world, is trying to keep the core to memory bandwidth flat and keep up with application memory demand. CXL was built to deal with both of these concerns

CXL has additional latency but it’s very similar to dual CPUs accessing shared memory. Jim mentioned that Microsoft Azure actually checked to see if they can handle CXL latencies by testing with dual socket systems.

There was a lot of continuing discussion on new and emerging memory technologies. And Jim Handy mentioned that their team has just published a new report on this. He also mentioned that CXL could be the killer app for all these new memory technologies as it can easily handle multiply different technologies with different latencies.

The next big topic were chiplets and the rise of UCIe (universal chiplet interconnect express) links. AMD led the way with their chiplet based, multi-core CPU chips but Intel is there now as well.

Chiplets are becoming the standard way to create significant functionality on silicon. But the problem up to now has been that every vendor had their own proprietary chiplet interconnect.

UCIe is meant to end proprietary interconnects. With UCIe, companies can focus on developing the best chiplet functionality and major manufacturers can pick and choose whichever chiplet offers the best bang for their buck and be assured that it will all talk well over UCIe. Or at least that’s the idea.

Computational storage is starting to become mainstream. Although everyone thought they would become general purpose compute engines, they seem to be having more success doing specialized (data) compute services like compression, transcoding, ransomware detection, etc. They are being adopted by companies that have need to do that type of work.

Computational memory is becoming a thing. Yes memristor, pcm, mram, etc. always offered computational capabilities on their technologies but now, organizations are starting to add compute logic to DIMMs to carry out computations close to the memory. We wonder if this will find niche applications just like computational storage did.

AI continues to drive storage and compute. But we are starting to see some IoT applications of AI as well and Jim thinks it won’t take long to see AI ubiquitous throughout IT, industry and everyday devices. Each with special purpose AI models trained to perform very specific functionality better and faster than general purpose algorithms could do.

One thing that’s starting to happen is that SSD intelligence is moving out of the SSD (controllers) and to the host. We can see this with the use of Zoned Name Spaces but OCP is also pushing flexible data placement so host’s can provide hints as to where to place newly written data.

There was more to the show as well. It was interesting to see the continued investment in 3D NAND (1000 layers by 2030), SSD capacity (256TB SSD coming in a couple of years), and some emerging tech like Memristor development boards and a 3D memory idea, but it’s a bit early to tell about that one.

Jim Handy, Director Objective Analysis

Jim Handy of Objective Analysis has over 35 years in the electronics industry including 20 years as a leading semiconductor and SSD industry analyst. Early in his career he held marketing and design positions at leading semiconductor suppliers including Intel, National Semiconductor, and Infineon.

A frequent presenter at trade shows, Mr. Handy is known for his technical depth, accurate forecasts, widespread industry presence and volume of publication.

He has written hundreds of market reports, articles for trade journals, and white papers, and is frequently interviewed and quoted in the electronics trade press and other media. 

He posts blogs at www.TheMemoryGuy.com, and www.TheSSDguy.com

151: GreyBeards talk AI (ML) performance benchmarks with David Kanter, Exec. Dir. MLCommons

Ray’s known David Kanter (@TheKanter), Executive Director, MLCommons, for quite awhile now and has been reporting on MLCommons Mlperf AI benchmark results for even longer. MLCommons releases new benchmark results each quarter and this last week they released new Data Center Training (v3.0) and new Tiny Inferencing (v1.1) results. So, the GreyBeards thought it was time to get a view of what’s new in AI benchmarking and what’s coming later this year.

David’s been around the startup community in the Bay Area for a while now and sort of started at MLPerf early on as a technical guru working on submissions and other stuff and worked his way up to being the Executive Director/CEO. The big news this week from MLCommons is that they have introduced a new training benchmark and updated an older one. The new one simulates training GPT-3 and they also updated their Recommendation Engine benchmark. Listen to the podcast to learn more.

MLCommons is an industry association focused on supplying recreatable, verifiable benchmarks for machine learning (ML) and AI which they call MLperf benchmarks. Their benchmark suite includes a number of different categories such as data center training, HPC training, data center inferencing, edge inferencing, mobile inferencing and finally tiny (IoT device) inferencing. David likes to say MLperf benchmarks range from systems consuming Megawatts (HPC, literally a supercomputer) to Microwatts (Tiny) solutions.

The challenge holding AI benchmarking back early on was a few industry players had done their own thing but there was way to compare one to another. MLcommons was born out of that chaos, and sought to create a benchmarking regimen that any industry player could use to submit AI work activity and would allow customers to compare their solution to any other submission on a representative sample of ML model training and inferencing activity

MLCommons has both an Open and Closed class of submissions. For the Closed class, submissions have a very strict criteria for submission. These include known open source AI models and data, accuracy metrics that training and inferencing need to hit, and reporting a standard set of metrics information for the benchmark. All of which need to be done in order to create a repeatable and verifiable submission.

Their Open class is a way for any industry participant to submit whatever model they would like, to whatever accuracy level they want, and it’s typically used to benchmark new hardware, software or AI models.

As mentioned above MLcommons training benchmarks use accuracy specification that must be achieved to have a valid submission. Benchmarks also have to be run 3 times. All submissions list hardware (CPU and Accelerators) and software (AI framework). And these could range from 0 accelerators (e.g. CPU only with no GPUs) to 1000’s of GPUs.

The new GPT-3 model is a very complex AI model, that seemed until yesterday, unlikely to ever be benchmarked. But apparently the developers at MLCommons (and their industry partners) have been working on this for some time now. In this round of results there were 3 cloud submissions and 4 on prem submissions for GPT-3 training.

GPT-3, -3.5 & -4 are all OpenAI solutions which power their ChatGPT text transformer Large Language Model (LLM). GPT-3 has 175B parameters and was trained on TBs of data covering web crawls, book crawls, official documentation, code, etc. OpenAI said, at GPT-3 announcement, it took over $10M and months to train.

MLcommons GPT-3 benchmark is not a full training run of GPT-3 but uses a training checkpoint, trained on a subset of data used for the original GPT-3 training, Checkpoints are used for long running jobs (training sessions, weather simulations, fusion energy simulations, etc) and copy all internal state of a job/system while its running (ok, quiesced) at some interval (say every 8hrs, 24 hrs, 48hrs, etc), so that in case of a failure, one could just restart the activity from the last checkpoint rather than the beginning.

MLCommons GPT-3 checkpoint has been trained on a 10B token data set. The benchmark starts with loading this checkpoint and trains on an even smaller subset of the data for GPT-3 and trains to achieve the accuracy baseline.

Accuracy for text transformers is not as simple as other models (correct image classification, object identification, etc.) and uses “perplexity”. Hugging Face defines perplexity as “the exponentiated average negative log-likelihood of a sequence.”

The 4 on-premises submissions for GPT-3 using 45 minutes (768 NVIDIA H100 GPUs) to 442 minutes (64 Habana Guadi2 GPUS). The 3 cloud submissions all used NVIDIA H100 GPUs and ranged from 768 (@47 minutes to train) to 3584 GPUs (@11 min. to train).

Aside from DataCenter training, MLcommons also released a new round of Tiny (IoT) inferencing benchmarks. These generally use smaller ARM processors and no GPUs with much smaller AI models such as Keyword spotting (“Hey SIRI”), visual wake words (door opening), image classification, etc.

We ended our discussion with me asking David why there was no storage oriented MLcommons benchmark. David said creating a storage benchmarks for AI is much different than inferencing or training benchmarks. But MLCommons has taken this on and now have a storage MLcommons series of benchmarks for storage that uses emulated accelerators.

At the moment, anyone with a storage system can submit MLcommons storage benchmark. After some time, MLcommons will only allow submissions from member companies but early on it’s open for all.

For their storage benchmarks, rather than using accuracy as benchmark criteria they use keeping (emulated) accelerators X% busy. This way storage support of the MLops activities can be isolated from the training and inferencing.

The GreyBeards eagerly anticipate the first round of MLcommons storage benchmark results. Hopefully coming out later this year.

147: GreyBeards talk ransomware protection with Jonathan Halstuch, Co-Founder and CTO, RackTop Systems

Sponsored By:

This is another in our series of sponsored podcasts with Jonathan Halstuch (@JAHGT), Co-Founder and CTO of RackTop Systems. You can hear more in Episode 145.

We asked Jonathan what was wrong with ransomware protection today. Jonathan started by mentioning that bad actors had been present, on average, 277 days in an environment before being detected. That much dwell time, means they could have easily corrupted most backups and snapshots, stolen copies of all your most of sensitive/proprietary data, and of course, encrypted all your storage.

Backup ransomware protection works ok if dwell time is a couple of days or even a week, but not multiple months or longer.. The only real solution to this level of ransomware sophistication is real time monitoring of IO, looking for illegal activity. Listen to the podcast to learn more

Often, any data corruption, when discovered, is just notification to an unsuspecting IT organization that they have been compromised and lost control over their systems. Sort of like having a thief ring the door bell to tell you they stole all your stuff after the fact.

The only real solution to data breaches and ransomware attacks with significant dwell time, that protects both your data and your reputation is something like RackTop Systems and their BrickStore SP storage system. BrickStore offers an ongoing, in real-time, active defense against ransomware that’s embedded in your data storage, that’s continuously looking for bad actors and their activities during IO activity, all day, every day. 

When BrickStor detects ransomware in progress it shuts it down, by halting any further access to that user/apllication and snapshots the data before corruption, to immutable snapshots. That way admins have a good copy of data.

In addition, RackTop BrickStor SP supplies run book like recovery procedures that tell IT how to retrieve good data from snapshots, without wasting valuable time searching for the “last good backup”, which could be months old.

I asked whether data at rest encryption could offer any help. Jonathan said data encryption can thwart only some types of attacks. But it’s not that useful for ransomware, as bad actors who infiltrate your system masquerade as valid users/admins and by doing so, gain access to decrypted data.  

RackTop Systems uses AI in its labs to create ransomware “assesors”, automated routines embedded in their storage data path, which continuously execute looking for bad actor IO patterns. It’s these assessors that provide the first line of defense against ransomware.

In addition to assessors, Racktop Systems supplies many reports which depict data access permissions, user/admin access permissions, data being accessed, etc. All of which help IT and security teams better understand how data is being used and provide the visibility needed to help support better cyber security

When ransomware is detected, RackTop BrickStor offers a number of different notification features that range from web-hooks and slack channels to email notices and just about everything in between to notify IT and security teams that a breach is occurring and where.

RackTop Systems BrickStor SP is available in many deployments. One new option, from HPE, uses their block storage to present LUNs to BrickStor SP. Jonathan mentioned that other enterprise class block storage vendors are starting to use BrickStor SP to supply secure NAS services for their customers as well.

Jonathan mentioned that RackTop attended the HIMSS conference in Chicago last week and will be attending many others throughout the year. So check them out at a conference near you if you get a chance.

Jonathan Halstuch, Co-Founder & CTO RackTop Systems

Jonathan Halstuch is the Chief Technology Officer and co-founder of RackTop Systems. He holds a bachelor’s degree in computer engineering from Georgia Tech as well as a master’s degree in engineering and technology management from George Washington University.

With over 20-years of experience as an engineer, technologist, and manager for the federal government he provides organizations the most efficient and secure data management solutions to accelerate operations while reducing the burden on admins, users, and executives.

146: GreyBeards talk K8s cloud storage with Brian Carmody, Field CTO, Volumez

We’ve known Brian Carmody (@initzero), Field CTO, Volumez for over a decade now and he’s always been very technically astute. He moved to Volumez earlier this year and has once again joined a storage startup. Volumez is a cloud K8s storage provider with a new twist, K8s persistent volumes hosted on ephemeral storage.

Volumes currently works in public clouds (AWS & Azure( soft launch), with GCP coming soon) and is all about supplying high performing, enterprise class data services to K8s container apps. But doing this using transient (Azure ephemeral &AWS instance) storage and standard Linux. Hyperscalers offer transient storage as almost an afterthought with customer compute instances. Listen to the podcast to learn more.

It turns out that over the last decade or so, there has been a lot of time and effort devoted to maturing Linux’s storage stack and nowadays, with appropriate configuration, Linux can offer enterprise class data services and performance using direct attached NVMe SSDs. These services include thin provisioning, encryption, RAID/erasure coding, snapshots, etc., which on top of NVMe SSDs, provide IOPS, bandwidth and latency performance that boggles the mind.

However, configuring Linux sophisticated and high performing data services is a hard problem to solve..

Enter Volumez, they have a SaaS control plane, client software plus CSI drivers that will configure Linux with ephemeral storage to support any performance and data service that can be obtained from NVMe SSDs.

Once installed on your K8s cluster, Volumez software profiles all ephemeral storage, and supplies that information to their SaaS control plane. Once that’s done your platform engineers can define specific storage class policies or profiles useable by DevOps to consume ephemeral storage. .

These policies identify volume [IOPs, Bandwidth, Latency] X [read, write] performance specifications as well as data protection, resiliency and other data service requirements. DevOps engineers consume this storage using PVCs that call for these storage classes at some capacity. When it sees the PVC claim, Volumez SaaS control plane will carve out slices of ephemeral storage that can support the performance and other storage requirements defined in the storage class.

Once that’s done, their control plane next creates a network path from the compute instances with ephemeral storage to the worker nodes running container apps. After that it steps out of the picture and the container apps have a direct (network) data path to the storage they requested. Note, Volumez’s SaaS control plane is not in the container app storage data path at all.

Volumez supports multi-AZ data resiliency for PVCs. In this case, another mirror K8s cluster would reside in another AZ, with Volumez software active and similar if not equivalent ephemeral storage. Volumez will configure the container volume to mirror data between AZs. Similarly, if the policy requests erasure coding, Volumez SaaS software configures the ephemeral storage to provide erasure coding for that container volume.

Brian said they’ve done some amazing work to increase the speed of Linux snapshotting and restoring.

As noted above, the Volumez control plane SaaS software is outside the data path, so even if the K8s cluster running Volumez enabled storage loses access to the control plane, container apps continue to run and perform IO to their storage. This can continue until there’s a new PVC request that requires access to their control plane.

Ephemeral storage is accessed through special compute instances. These are not K8s worker nodes and they essentially act as a passthru or network attachment between worker nodes running apps with PVC’s and the Volumez configured Linux Logical Volumes hosted on slices of ephemeral storage.

Volumez is gaining customer traction with data platform clients, DBaaS companies, and some HPC environments. But just about anyone needing high performing data services for cloud K8s container apps should give Volumez a try.

I looked at AWS to see how they price instance store capacities and found out it’s not priced separately, but rather instance storage is bundled into the cost of EC2 compute instances.

Volumez is priced based on the number of media devices (instance/ephemeral stores) and performance (IOPs) available. They also have different tiers depending on support level requirements (e.g., community, Business hrs, 7X24) which also offers different levels of enterprise security functionality.

Brian said they have a free tier that customers can easily signup for and try out by going to their web site (see link above), or if you would like a guided demo, just contact him directly.

Brian Carmody, Field CTO, Volumez

Brian Carmody is Field CTO at Volumez. Prior to joining Volumez, he served as Chief Technology Officer of data storage company Infinidat where he drove the company’s technology vision and strategy as it ramped from pre-revenue to market leadership.

Before joining Infinidat, Brian worked in the Systems and Technology Group at IBM where he held senior roles in product management and solutions engineering focusing on distributed storage system technologies.

Prior to IBM, Brian served as a technology executive at MTV Networks Viacom, and at Novus Consulting Group as a Principal in the Media & Entertainment and Banking practices.