157: GreyBeards talk commercial cloud computer with Bryan Cantrill, CTO, Oxide Computer

Bryan Cantrill (@bcantrill), CTO, Oxide Computer was a hard man to interrupt once started but the GreyBeards did their best to have a conversation. Nonetheless, this is a long podcast. Oxide are making a huge bet on rack scale computing and have done everything they can to make their rack easy to unbox, setup and deploy VMs on.

They use commodity parts (AMD EPYC CPUs) and package them in their own designed hardware (server) sleds, which blind mate to networking and power in the back of the own designed rack. They use their own OS Helios (OpenSolaris derivative) with their own RTOS, Hubris, for system bringup, monitoring and the start of their hardware root of trust. And of course, to make it all connect easie,r they designed and developed their own programmable networking switch. Listen to the podcast to learn more.

Oxide essentially provides rack hardware which supports EC2-like compute and EBS-like storage to customers. It also has Terraform plugins to support infrastructure as code. In addition, all their software is completely API driven.

Bryan said time and time again, developing their own hardware and software made everything easier for them and their customers. Customers pay for hardware but there’s absolutely NO SOFTWARE LICENSING FEEs, because all their software is open source.

For example, the problem with AMI bios and UEFIs is their opacity, There’s really no way to understand what packages are included in its root of trust because it’s proprietary. Brian said one company UEFI they examined, had URL’s embedded in firmware. It seemed odd to have another vendor’s web pages linked to their root of trust.

Bryan said they did their own switch to reduce integration and validation test time. The Oxide rack supports all internal networking, compute sled to compute sled, and ToR switch (with no external cabling) and has 32 networking ports to connect the rack to the data center’s core networking.

As for storage, Bryan said each of the 10 U.2 NVMe drives in their compute sled is a separate, ZFS file system and customer data is 3 way mirrored across any of them. ZFS also provides end to end checksumming across all customer data for IO integrity.

Bryan said Oxide Computer rack bring up is 1) plug it in to core networking and power, 2) power it on, 3) attach a laptop to their service processor, 4) SSH into it, 5) Run a configuration script and your ready to assign VMs. He said that from the time an Oxide Rack hits your dock until you are up and firing up VMs, could be as short as an HOUR.

The Rust programming language is the other secret to Oxide’s success. More to the point their company is named after Rust (oxide get it). Apparently just about any software they developed is written in Rust.

The question for Oxide and every other computer and storage vendor is – do you believe that on premises computing will continue for the foreseeable future. The GreyBeards and Oxide believe yes. If not for compliance and better latency but also because it often costs less.

Bryan mentioned they have their own podcast, Oxide and Friends. On their podcast, they did a board bring up series (Tales from the Bring-Up Lab) and a series on taking their rack through FCC compliance (Oxide and the Chamber of Mysteries).

Bryan Cantrill, CTO, Oxide Computers

Bryan Cantrill is a software engineer who has spent over a quarter of a century at the hardware/software interface. He is the co-founder and CTO of Oxide Computer Company, the creator of the world’s first commercial cloud computer.

Prior to Oxide, he spent nearly a decade at Joyent, a cloud computing pioneer; prior to Joyent, he spent 14 years at Sun Microsystems.

Bryan received the Sc.B. magna cum laude with honors in Computer Science from Brown University, and is a MIT Technology Review 35 Top Young Innovators alumnus.

You can learn more about his work with Oxide at oxide.computer, or listen in on their weekly live show, Oxide and Friends (link above), on Discord or anywhere you get your podcasts.

154: GreyBeards annual VMware Explore wrap-up podcast

Thanks, once again to The CTO Advisor|Keith Townsend, (@CTOadvisor) for letting us record the podcast in his studio. VMware Explore this year was better than last year. The show seemed larger. the show floor busier, the Hub better and the Hands-On Lab much larger than I ever remember before. The show seems to be growing, but still not at the pre-pandemic levels, but the trend is good.

The engineers have been busy at VMware this past year. Announced at the show include Private AI Foundation, a way for enterprises to train open source LLMs on corporate data kept private, a significant re-direct to VMware Edge environments moving from the push model updates to push model updates, and vSAN Max, NSX+, Tanzu App Engine, and more. And we heard that Brocade is clearing more hurdles to the acquisition. Listen to the podcast to learn more.

Private AI plays to VMware’s strengths and its control over on-prem processing. Customers need a safe space and secured data to train corporate ChatBots curated on corporations knowledge base. VMware rolled this out two ways,

  • Reference architecture approach based on Ray cluster management, KubeFlow, PyTorch, VectorDB, GPU Scaling (NVLink/NVswitch), vSAN fast path (RDMA, GPUdirect), and deep learning VMs. There was no discussion of tie ins to the Data Persistence (object) storage.
  • Proprietary NVIDIA approach based on NVIDIA workbench, TensorRT, NeMO, NVIDIA GPU & Network Operator

By having both approaches VMware provides alternatives for those wanting a non-proprietary solution. And with with AI/MLOps moving so fast, the open source may be better able to keep up.

The tie in with NVIDIA is a natural extension of what VMware have been doing with GPUs and DPUs, etc.

Also, VMware announced a technological partnership with Hugging Face. We were somewhat concerned with all the focus on LLM and GenAI but the agreement with Hugging Face goes beyond just LLMs.

VMware Edge solutions are pivoting. Apparently, VMware is moving from the vSphere pull model of code updates in the field which seems to handle 64 server, multi-cluster environments without problem to more of a YAML-GitHub push model of IoT device updates that seems better able to manage fleets of 1K to 100K devices in the field.

With the new model one creates a GitHub repo and a YAML file describing the code update to be done and all your IoT devices just get’s updated to the new level.

Once again the Brocade acquisition is on everyone’s mind. As I got to the show, one analyst asked if this was going to be the last VMware Explore. I highly doubt that, but Brocade will make lots of changes once the transaction closes. One thing mentioned at the show was that Brocade will make an immediate, additional $1B investment in R&D. The deal had provisionally passed the UK regulatory body and was on track to close near the end of October.

Other news from the show:

  • The Tanzu brand is broadening. Tanzu Application Platform (TAP) still exists but they have added a new App Engine is to take the VMware management approach to K8s clusters, other cloud infrastructure and the rest of the IT world. Tanzu Intelligent Services also now supports policy guardrails, cost control, management insight and migration services for other environments.
  • vSAN Max, which supports disaggregation (separation) of storage and compute is available. vSAN Max becomes a full fledged, standalone storage system that just happens to run on top of vSphere. Disaggregated (vSAN Max) storage and (regular vSAN) HCI can co-exist as different mounted datastores and vSAN Max supports PB of storage.
  • Workspace One is updated to provide enhanced digital experience monitoring that adds coverage of what Workspace One users are actually experiencing.
  • NSX+ continues to roll out. VMware mentioned that the number one continuing problem with hybrid cloud/multi-cloud setup is getting the networking right. NSX+ will reduce this complexity by becoming a management/configuration overlay over any and all cloud/on-prem networking for your environment(s).
  • VMware chatbots for Tanzu, Workspace One and NSX+ are now in tech preview and will supply intelligent assistants for these solutions. Based on LLM/GenAI and trained on VMware’s extensive corporate knowledge base, the chatbots will help admins focus on the signal over the noise and will provide recommendations on how to resolve issues. .

Jason Collier, Principal Member of Technical Staff, AMD

Jason Collier (@bocanuts) is a long time friend, technical guru and innovator who has over 25 years of experience as a serial entrepreneur in technology. Jason currently works at AMD focused on emerging technology for IT, IoT and anywhere else in the world and across the universe that needs compute, storage or networking resources.

He was Chief Evangelist, CTO & Co-Founder of Scale Computing and has been an innovator in the field of hyper-convergence and an expert in virtualization, data storage, networking, cloud computing, data centers, and edge computing for years.

He has also been another co-founder, director of research, VP of technical operations and director of operations at other companies over his long career prior to AMD and Scale.

He’s on LinkedIN.

152: GreyBeards talk agent-less data security with Jonathan Halstuch, Co-Founder & CTO, RackTop Systems

Sponsored By:

Once again we return to our ongoing series with RackTop Systems, and their Co-Founder & CTO, Jonathan Halstuch (@JAHGT). This time we discuss how agent-less, storage based, security works and how it can help secure many organizations with (IoT) end points they may not control or can’t deploy agents on them. But agent-less security can also help other organizations with security agents deployed over their end points. Listen to the podcast to learn more.

The challenge for enterprise’s with agent based security, is that not all end points support them. Jonathan mentioned one health care customer with an older electron microscope that couldn’t be modified. These older, outdated systems are often targeted by cyber criminals because they are seldom updated.

But even the newest IoT devices often can’t be modified by organizations that use them. Agent-less, storage based security can be a final line of defense to any environment with IoT devices deployed.

But security exposures go beyond IoT devices. Agents can sometimes take manual effort to deploy and update. And as such, sometimes they are left un-deployed or improperly configured.

The advantage of a storage based, agent-less security approach is that it’s always on/always present, because it’s in the middle of the data path and is updated by the storage company, where possible. Yes, not every organization may allows this and for those organizations, storage agent updates will be also require manual effort.

Jonathan mentioned the term Data Firewall. I (a networking novice, at best) have always felt firewalls were a configuration nightmare.

But as we’ve discussed previously in our series, RackTop has a “learning” and an “active” mode. During learning, the system automatically configures application/user IO assessors to characterize normal IO activity. Once learning has completed, the RackTop Systems in the environment now understands what sorts of IO to expect from users/applications and can then flag anything outside normal IO patterns.

But even during “learning” mode, the system is actively monitoring for known malware signatures and other previously characterized bad actor IO. These assesors are always active. 

Keith mentioned that most organizations run special jobs on occasion (quarterly, yearly) which might have not been characterized during learning. Jonathan said these will be flagged and may be halted (depending on RackTop’s configuration). But authorized parties can easily approve that applications IO activity, using a web link provided in the storage security alert.

Once alerted, authorized personnel can allow that IO activity for a specific time period (say Dec-Jan), or just for a one time event. When the time period expires, that sort of IO will be flagged again.

Some sophisticated customers have change control and may know, ahead of time, that end of quarter or end of year processing is coming up. If so, they can easily configure RackTop Systems, ahead of time, to authorize the applications IO activity. In this case there wouldn’t be any interruption to the application.

With RackTop Systems, security agents are centrally located, in the data path and are always operating. This has no dependency on your backend storage such as, SAN, cloud, hybrid storage, etc., or any end point. If anything in your environment accesses data, those RackTop System assessors will be active, checking IO activity and securing your data. 

Jonathan Halstuch, Co-Founder and CTO, RackTop Systems

onathan Halstuch is the Chief Technology Officer and co-founder of RackTop Systems. He holds a bachelor’s degree in computer engineering from Georgia Tech as well as a master’s degree in engineering and technology management from George Washington University.

With over 20-years of experience as an engineer, technologist, and manager for the federal government he provides organizations the most efficient and secure data management solutions to accelerate operations while reducing the burden on admins, users, and executives.

150: GreyBeard talks Zero Trust with Jonathan Halstuch, Co-founder & CTO, RackTop Systems

Sponsored By:

This is another in our series of sponsored podcasts with Jonathan Halstuch (@JAHGT), Co-Founder and CTO of RackTop Systems. You can hear more in Episode #147 on RansomWare protection and Episode #145 on proactive NAS security.

Zero Trust Architecture (ZTA) has been touted as the next level of security for a while now. As such, it spans all of IT infrastructure. But from a storage perspective, it’s all about the latest NFS and SMB protocols together with an extreme level of security awareness that infuses storage systems.

RackTop has, from the git go, always focused on secure storage. ZTA with RackTop, adds on top of protocol logins an understanding of what normal IO looks like for apps, users, & admins and makes sure IO doesn’t deviate from what it should be. We discussed some of this in Episode #145, but this podcast provides even more detail. Listen to the podcast to learn more.

ZTA starts by requiring all participants in an IT infrastructure transaction to mutually authenticate one another. In modern storage protocols this is done via protocol logins. Besides logins, ZTA can establish different timeouts to tell servers and clients when to re-authenticate.

Furthermore, ZTA doesn’t just authenticate user/app/admin identity, it can also require that clients access storage only from authorized locations. That is, a client’s location on the network and in servers is also authenticated and when changed, triggers a system response. .

Also, with ZTA, PBAC/ABAC (policy/attribute based access controls) can be used to associate different files with different security policies. Above we talked about authentication timeouts and location requirements but PBAC/ABAC can also specify different authentication methods that need to be used.

RackTop systems does all of that and more. But where RackTop really differs from most other storage is that it support two modes of operation an observation mode and an enforcement mode. During observation mode, the system observes all the IO a client performs to characterizes its IO history.

Even during observation mode, RackTop has been factory pre-trained with what bad actor IO has looked like in the past. This includes all known ransomware IO, unusual user IO, unusual admin IO, etc. During observation mode, if it detects any of this bad actor IO, it will flagg and report it. For example, admins performing high read/write IO to multiple files will be detected as abnormal, flagged and reported.

But after some time in observation mode, admins can change RackTop into enforcement mode. At this point, the system understands what normal client IO looks like and if anything abnormal occurs, the system detects, flags and reports it.

RackTop customers have many options as to what the system will do when abnormal IO is detected. This can range from completely shutting down client IO to just reporting and logging it.

Jonathan mentioned that RackTop is widely installed in multi-level security enviroments. For example, in many government agencies, it’s not unusual to have top secret, secret, and unclassified information, each with their own PBAC/ABAC enforcement criteria.

RackTop has a long history of supporting storage for these extreme security environments. As such, customers should be well assured that their data can be as secured as any data in national government agencies.

Jonathan Halstuch, Co-Founder & CTO RackTop Systems

onathan Halstuch is the Chief Technology Officer and co-founder of RackTop Systems. He holds a bachelor’s degree in computer engineering from Georgia Tech as well as a master’s degree in engineering and technology management from George Washington University.

With over 20-years of experience as an engineer, technologist, and manager for the federal government he provides organizations the most efficient and secure data management solutions to accelerate operations while reducing the burden on admins, users, and executives.

144: Greybeard talks AI IO with Subramanian Kartik & Howard Marks of VAST Data

Sponsored by

Today we talked with VAST Data’s Subramanian Kartik (@phyzzycyst), Global Systems Engineering Lead and Howard Marks (@DeepStorage@mastodon.social, @deepstoragenet) former GreyBeards co-host and now Technologist Extraordinary & Plenipotentiary at VAST. Howard needs no introduction to our listeners but Kartik does. Kartik has supported a number of customers implementing AI apps at VAST and prior companies, so he is well versed in the reality of AI ML DL. Moreover, VAST recently funded Silverton Consulting to write a paper discussing Deep Learning IO.

Although AI ML DL applications have been very popular these days in IT, there’s been a continuing challenge trying to understand its IO requirements. Listen to the podcast to learn more.

AI ML DL Neural Networks (NN) models train with data and lots of it while inferencing is also very data dependent. Kartik said AI model IO consists of small block, random reads with very few writes.

Some models contain huge NNs which consume mountains of data to train while others are relatively small and consume much less. GPT-3(.5), the model behind the original ChatGPT, has ~75B parameters in its ~800GB NN.

As many of us know, the key to AI processing is GPU hardware, which performs most, if not all, of the computations to train models and supply inferences. Moreover, to maximize training throughput, many organizations deploy model parallelism, using 10s to 1000s of GPUs.

For instance, in the paper mentioned earlier, we showed a model training IO chart based on all six storage vendor published NVIDIA DGX-A100 Reference Architecture reports for ResNet-50. On this single chart, all 6 storage systems supplied roughly the same images processed/sec (or ~IO bandwidth) performance to train the model on each of 8, 16 & 32 GPUs configurations. This is very unusual from our perspective but shows that ResNet-50 training is not IO bound.

However, another approach to speeding up NN training is to take advantage of newer, more advanced IO protocols. NVIDIA GPUDirect Storage transfers data directly from storage memory to GPU memory bypassing CPU memory all together which can significantly speed up GPU data consumption. It turns out that one bottleneck for AI training is CPU memory bandwidth

In addition, most AI model training reads data from a single file system mount point. Historically, an NFS mount point was limited to a single TCP connection and a maximum of ~2.5GB/sec of IO bandwidth. Recently, however, NConnect for NFS has been introduced which increased TCP connections to 16 per mount point .

Despite that, VAST Data found that by adding some code to Linux’s NFS TCP stack, they were able to increase NConnect to 64 TCP connections per compute node. Howard mentioned that with these changes and a 16 (compute) node VAST Data storage cluster they sustained 175GB/sec of GPUDirect Storage bandwidth using a DGX-A100 systems .

Subramanian Kartik, Global Systems Engineering Lead, VAST Data

Subramanian Kartik has been the Vice President of Systems Engineering at VAST Data since January of 2020, running the global presales organization. He is part of the incredible success of VAST Data which increased almost 10-fold in valuation and revenue in this period.

An accomplished technologist and executive in the industry, he has a wide array of experience in Cloud Architectures, AI/Machine Learning/Deep Learning, as well as in the  Life Sciences, covering high-performance computing and storage. He has had a lifelong deep passion for studying complex problems in all spheres spanning both workloads and infrastructure at the vanguard of current day technology. 

Prior to his work at VAST Data, he was with EMC (later Dell) for two decades, as both a Distinguished Engineer and global executive running the Converged and Hyperconverged Division  go-to-market. He has a Ph.D in Particle Physics with over 75 publications and 3 patents to his credit over the years. He enjoys mathematics, jazz, cooking and travelling with his family in his non-existent spare time.

Howard Marks, (former GreyBeards Co-Host) Technologist Extraordinary and Plenipotentiary, VAST Data

Howard Marks brings over forty years of experience as a technology architect for hire and Industry observer to his role as VAST Data’s Technologist Extraordinary and Plienopotentary. In this role, Howard demystifies VAST’s technologies for customers and customer requirements for VAST’s engineers.

Before joining VAST, Howard ran DeepStorage an industry test lab and analyst firm. An award-winning speaker, he has appeared at events on three continents including Comdex, Interop and VMworld.

Howard is the author of several books (all gratefully out of print) and hundreds of articles since Bill Machrone taught him journalism at PC Magazine in the 1980s.

Listeners may also remember that Howard was a founding co-Host of the Greybeards-on-Storage Podcast.