163: GreyBeards talk Ultra Ethernet with Dr J Metz, Chair of UEC steering committee, Chair of SNIA BoD, & Tech. Dir. AMD

Dr J Metz, (@drjmetz, blog) has been on our podcast before mostly in his role as SNIA spokesperson and BoD Chair, but this time he’s here discussing some of his latest work on the Ultra Ethernet Consortium (UEC) (LinkedIN: @ultraethernet, X: @ultraethernet)

The UEC is a full stack re-think of what Ethernet could do for large single application environments. UEC was originally focused on HPC, with 400-800 Gbps networks and single applications like simulating a hypersonic missile or airplane. But with the emergence of GenAI and LLMs, UEC could also be very effective for large AI model training with massive clusters doing a single LLM training job over months. Listen to the podcast to learn more.

The UEC is outside the realm of normal enterprise environments. But as AI training becomes more ubiquitous, who knows whether UEC may not find a place in the enterprise. However, it’s not intended for mixed network environments with multiple applications. It’s a single application network.

One wouldn’t think, HPC was a big user of Ethernet for their main network. But Dr J pointed out that the top 3 of the HPC 500, all use Ethernet and more are looking to use it in the future.

UEC is essentially an optimized software stack and hardware for networking used by single application environments. These types of workloads are constantly pushing the networking envelope. And by taking advantage of the “special networking personalities” of these workloads, UEC can significantly reduce networking overheads, boosting bandwidth and workload execution.

The scale of networks is extreme. The UEC is targeting up to a million endpoints, over >100K servers, with each network link >100Gbps and more likely 400-800Gpbs. With the new (AMD and others) networking cards coming out that support 4 400/800Gbps network ports, having a pair of these on each server, with 100K server cluster gives one 800K endpoints. A million is not that far away when you think of it at that scale.

Moreover, LLM training and HPC work are starting to look more alike these days. Yes there are differences but the scale of their clusters are similar, and the way work is sometimes fed to them is similar, which leads to similar networking requirements

UEC is attempting to handle a 5% problem. That is 95% of the users will not have 1M endpoints in their LAN, but maybe 5% will and for these 5%, a more mixed networking workload is unnecessary. In fact, a mixed network becomes a burden slowing down packet transmission.

UEC is finding that with a few select networking parameters, almost like workload fingerprints, network stacks can be much more optimized than current Ethernet and thereby support reduced packet overheads, and more bandwidth.

AI and HPC networks share a very limited set of characteristics which can be used as fingerprints. These characteristics are like reliable or unreliable transport, ordered or unordered delivery, multi-path packet spraying or not, etc, With a set of these types of parameters, selected for an environment, UEC can optimize a network stack to better support a million networking endpoints

We asked where CXL fits in with UEC? DrJ said it could potentially be an entity on the network but he sees CXL more as a within server or between a tight (limited) cluster of servers, solution rather than something on a UEC network.

Just 12 months ago the UEC had 10 members or so and this past week they were up to 60. UEC seems to have struck a chord.

The UEC plans to release a 1.0 specification, near the end of this year. UEC 1.0 is intended to operate on current (>100Gbps) networking equipment with firmware/software changes.

Considering the UEC was just founded in 2023, putting out their 1.0 technical spec. within 1.5 years is astonishing. But also speaks volumes to the interest in the technology.

The UEC has a blog post which talks more about UEC 1.0 specification and the technology behind it.

Dr J Metz, Chair of UEC Steering Committee, Chair of SNIA BoD, Technical Director of Systems Design, AMD

J works to coordinate and lead strategy on various industry initiatives related to systems architecture. Recognized as a leading storage networking expert, J is an evangelist for all storage-related technology and has a unique ability to dissect and explain complex concepts and strategies. He is passionate about the innerworkings and application of emerging technologies.

J has previously held roles in both startups and Fortune 100 companies as a Field CTO,  R&D Engineer, Solutions Architect, and Systems Engineer. He has been a leader in several key industry standards groups, sitting on the Board of Directors for the SNIA, Fibre Channel Industry Association (FCIA), and Non-Volatile Memory Express (NVMe). A popular blogger and active on Twitter, his areas of expertise include NVMe, SANs, Fibre Channel, and computational storage.

J is an entertaining presenter and prolific writer. He has won multiple awards as a speaker and author, writing over 300 articles and giving presentations and webinars attended by over 10,000 people. He earned his PhD from the University of Georgia.

159: GreyBeards Year End 2023 Wrap Up

Jason and Keith joined Ray for our annual year end wrap up and look ahead to 2024. I planned to discuss infrastructure technical topics but was overruled. Once we started talking AI, we couldn’t stop.

It’s hard to realize that Generative AI and ChatGPT in particular, haven’t been around that long. We discussed some practical uses Keith and Jason had done with the technology.

Keith mentioned its primary skill is language expertise. He has used it to help write up proposals. He often struggles to convince CTO Advisor non-sponsors of the value they can bring and found that using GenAI has helped do this better.

Jason mentioned he uses it to create BASH, perl, and PowerShell scripts. He says it’s not perfect but can get ~80% there and with a few tweaks, is able to have something a lot faster than if he had to do it completely by hand. He also mentioned its skill in translating from one scripting language to others and how well the code it generates is documented (- that hurt).

I was the odd GreyBeard out, having not used any GenAI, proprietary or not. I’m still working to get a reinforcement learning task to work well and consistently. I figured once I mastered that, I train an LLM on my body of (text and code) work (assuming of course someone gifts me a gang of GPUs).

I agreed GenAI are good at (English) language and some coding tasks (where lot’s of source code exists, such as java, scripting, python, etc.).

However, I was on a MLops slack channel and someone asked if GenAI could help with IBM RPG II code. I answered, probably not. There’s just not a lot of RPG II code publicly accessible on the web and the structure of RPG was never line of text/commands oriented.

We had some heated discussion on where LLMs get the data to train with. Keith was fine with them using his data. I was not. Jason was neutral.

We then turned to what this means to the white collar workers who are coding and writing text. Keith made the point that this has been a concern throughout history, at least since the industrial revolution.

Machines come along, displace work that was done by hand, increase production immensely, reduce costs. Organizations benefit, but people doing those jobs need to up level their skills, to take advantage of the new capabilities.

Easy for us to say, as we, except for Jason, in his present job, are essentially entrepreneurs and anything that helps us deliver more value, faster, easier or less expensively, is a boon for our businesses.

Jason mentioned, Stephen Wolfram wrote a great blog post discussing LLM technology (see What is ChatGPT doing … and why does it work). Both Jason and Keith thought it did a great job about explaining the science and practice behind LLMs.

We moved on to a topic harder to discuss but of great relevance to our listeners, GenAI’s impact on the enterprise.

It reminds me of when Cloud became most prominent. Then “C” suites tasked their staff to adopt “the cloud” anyway they could. Today, “C” suites are tasking their staff to determine what their “AI strategy” is and when will it be implemented.

Keith mentioned that this is wrong headed. The true path forward (for the enterprise) is to focus on what are the business problems and how can (Gen)AI address (some of) them.

AI is so varied and its capabilities across so many fields, is so good nowadays ,that organizations should really look at AI as a new facility that can recognize patterns, index/analyze/transform images, summarize/understand/transform text/code, etc., in near real-time and see where in the enterprise that could help.

We talked about how enterprises can size AI infrastructure needed to perform these activities. And it’s more than just a gaggle of GPUs.

MLcommons’s MLperf benchmarks can help show the way, for some cases, but they are not exhaustive. But it’s a start.

The consensus was maybe deploy in the cloud first and when the workload is dialed in there, re-home it later. With the proviso that hardware needed is available.

Our final topic was the Broadcom VMware acquisition. Keith mentioned their recent subscription pricing announcements vastly simplified VMware licensing, that had grown way too complex over the decades.

And although everyone hates the expense of VMware solutions, they often forget the real value VMware brings to enterprise IT.

Yes hyperscalars and their clutch of coders, can roll their own hypervisor services stacks, using open source virtualization. But the enterprise has other needs for their developers. And the value of VMware virtualization services, now that 128 Core CPUs are out, is even higher.

We mentioned the need for hybrid cloud and how VCF can get you part of the way there. Keith said that dev teams really want something like “AWS software” services running on GCP or Azure.

Keith mentioned that IBM Cloud is the closest he’s seen so far to doing what Dev wants in a hybrid cloud.

We all thought when DNN’s came out and became trainable, and reinforcement learning started working well, that AI had turned a real corner. Turns out, that was just a start. GenAI has taken DNNs to a whole other level and Deepmind and others are doing the same with reinforcement learning.

This time AI may actually help advance mankind, if it doesn’t kill us first. On the latter topic you may want to checkout my RayOnStorage AGI series of blog posts (latest … AGI part-8)

Jason Collier, Principal Member Of Technical Staff at AMD, Data Center and Embedded Solutions Business Group

Jason Collier (@bocanuts) is a long time friend, technical guru and innovator who has over 25 years of experience as a serial entrepreneur in technology.

He was founder and CTO of Scale Computing and has been an innovator in the field of hyperconvergence and an expert in virtualization, data storage, networking, cloud computing, data centers, and edge computing for years.

He’s on LinkedIN. He’s currently working with AMD on new technology and he has been a GreyBeards on Storage co-host since the beginning of 2022

Keith Townsend, President of The CTO Advisor a Futurum Group Company

Keith Townsend (@CTOAdvisor) is a IT thought leader who has written articles for many industry publications, interviewed many industry heavyweights, worked with Silicon Valley startups, and engineered cloud infrastructure for large government organizations.

Keith is the co-founder of The CTO Advisor, blogs at Virtualized Geek, and can be found on LinkedIN.

158: GreyBeards talk software defined storage with Brian Dean, Tech. Mkt., Dell PowerFlex

Sponsored By:

This is the 2nd time Brian Dean, Technical Marketing, Dell PowerFlex Storage has been on our show discussing their storage. Since last time there’s been a new release with significant functional enhancements to file services, Dell CloudIQ integration and other services. We discussed these and other topics on our talk with Brian. Please listen to the podcast to learn more.

We began the discussion on the recent (version 4.5) changes to Powerflex for file services. PowerFlex file services are provided by File Nodes each running a NAS Container, which supplies multiple NAS Servers. NAS servers supply tenant network namespaces, security policies and host file systems, each of which resides on a single PowerFlex volume.

File Nodes are deployed in HA pairs, each on a separate hardware server. One can have up to 16 File Nodes or 8 pairs of File Nodes running on a PowerFlex cluster. If one of the pair goes down, file access fails over to the other File Node in a pair.

Each NAS Server supports multiple file systems each of which can be up to 256TB. The NAS Container is also used for other Dell storage file services, so it’s full featured and very resilient.

PowerFlex file services support multiple NFS and SMB versions as well as SFTP/FTP and other essential file data services. In addition, it also supports a global name space which allows all PowerFlex cluster file systems to be accessed under a single name space and IP target.

Next, we discussed PowerFlex’s automated LCM (Life Cycle Management) services which is specific to the PowerFlex appliance and fully-integrated, rack deployment models. Recall that PowerFlex can be deployed as an appliance, rack solution or in a software only solution using X86 servers.

With the appliance and rack models, a PowerFlex Manager (PFxM) service is used to deploy, change, monitor and manage PowerFlex cluster nodes. It discovers networking and PowerFlex servers/storage, loads appropriate firmware, BIOS, PowerFlex storage data services software and then brings up PowerFlex block services.

PFxM also offers automated LCM by maintaining an intelligent catalog, which declares all current software/firmware/BIOS and hardware versions compatible with PowerFlex software. When changes are made to the cluster, say when storage is increased or a server is added, the PFxM service detects the change and goes about bringing any new hardware up to proper software levels.

Finally the PFxM service can non-disruptively update the cluster whenever a PowerFlex code change is deployed. This would involve an intelligent catalog update, after which the PFxM service detects the cluster is out of compliance, and then it would serially go through, bringing each cluster node up to the proper level, without host IO access interruption.

Finally, we discussed changes made to CloudIQ-PowerFlex interface, so that CloudIQ can now troubleshoot and report performance-capacity trends at the PowerFlex storage pool, fault set, and fault domain level. Previously, CloudIQ could only do this at the full PowerFlex system level.

CloudIQ is Dell’s free, cloud service used to monitor and trouble shoot all Dell storage systems and many other Dell solutions, whether on premises or in the cloud.

Brian mentioned that all technical information for PowerFlex is available on their InfoHub.

Brian Dean, Dell PowerFlex Technical Marketing

Brian is a 16+ year veteran of the technology industry, and before that spent a decade in higher education. Brian has worked at EMC and Dell for 7 years, first as Solutions Architect and then as TME, focusing primarily on PowerFlex and software-defined storage ecosystems.

Prior to joining EMC, Brian was on the consumer/buyer side of large storage systems, directing operations for two Internet-based digital video surveillance startups.

When he’s not wrestling with computer systems, he might be found hiking and climbing in the mountains of North Carolina. 

157: GreyBeards talk commercial cloud computer with Bryan Cantrill, CTO, Oxide Computer

Bryan Cantrill (@bcantrill), CTO, Oxide Computer was a hard man to interrupt once started but the GreyBeards did their best to have a conversation. Nonetheless, this is a long podcast. Oxide are making a huge bet on rack scale computing and have done everything they can to make their rack easy to unbox, setup and deploy VMs on.

They use commodity parts (AMD EPYC CPUs) and package them in their own designed hardware (server) sleds, which blind mate to networking and power in the back of the own designed rack. They use their own OS Helios (OpenSolaris derivative) with their own RTOS, Hubris, for system bringup, monitoring and the start of their hardware root of trust. And of course, to make it all connect easie,r they designed and developed their own programmable networking switch. Listen to the podcast to learn more.

Oxide essentially provides rack hardware which supports EC2-like compute and EBS-like storage to customers. It also has Terraform plugins to support infrastructure as code. In addition, all their software is completely API driven.

Bryan said time and time again, developing their own hardware and software made everything easier for them and their customers. Customers pay for hardware but there’s absolutely NO SOFTWARE LICENSING FEEs, because all their software is open source.

For example, the problem with AMI bios and UEFIs is their opacity, There’s really no way to understand what packages are included in its root of trust because it’s proprietary. Brian said one company UEFI they examined, had URL’s embedded in firmware. It seemed odd to have another vendor’s web pages linked to their root of trust.

Bryan said they did their own switch to reduce integration and validation test time. The Oxide rack supports all internal networking, compute sled to compute sled, and ToR switch (with no external cabling) and has 32 networking ports to connect the rack to the data center’s core networking.

As for storage, Bryan said each of the 10 U.2 NVMe drives in their compute sled is a separate, ZFS file system and customer data is 3 way mirrored across any of them. ZFS also provides end to end checksumming across all customer data for IO integrity.

Bryan said Oxide Computer rack bring up is 1) plug it in to core networking and power, 2) power it on, 3) attach a laptop to their service processor, 4) SSH into it, 5) Run a configuration script and your ready to assign VMs. He said that from the time an Oxide Rack hits your dock until you are up and firing up VMs, could be as short as an HOUR.

The Rust programming language is the other secret to Oxide’s success. More to the point their company is named after Rust (oxide get it). Apparently just about any software they developed is written in Rust.

The question for Oxide and every other computer and storage vendor is – do you believe that on premises computing will continue for the foreseeable future. The GreyBeards and Oxide believe yes. If not for compliance and better latency but also because it often costs less.

Bryan mentioned they have their own podcast, Oxide and Friends. On their podcast, they did a board bring up series (Tales from the Bring-Up Lab) and a series on taking their rack through FCC compliance (Oxide and the Chamber of Mysteries).

Bryan Cantrill, CTO, Oxide Computers

Bryan Cantrill is a software engineer who has spent over a quarter of a century at the hardware/software interface. He is the co-founder and CTO of Oxide Computer Company, the creator of the world’s first commercial cloud computer.

Prior to Oxide, he spent nearly a decade at Joyent, a cloud computing pioneer; prior to Joyent, he spent 14 years at Sun Microsystems.

Bryan received the Sc.B. magna cum laude with honors in Computer Science from Brown University, and is a MIT Technology Review 35 Top Young Innovators alumnus.

You can learn more about his work with Oxide at oxide.computer, or listen in on their weekly live show, Oxide and Friends (link above), on Discord or anywhere you get your podcasts.

156: GreyBeards talk data security with Jonathan Halstuch, Co-Founder and CTO, RackTop Systems

Sponsored By:

This is another repeat appearance of Jonathan Halstuch, Co-Founder and CTO, RackTop Systems on our podcast. This time he was here to discuss whether storage admins need to become security subject matter experts (SMEs) or not. Short answer, no but these days, security is everybody’s responsibility. Listen to the podcast to learn more.

It used to be that ransomware only encrypted data and then demanded money to decrypt. But nowadays, it’s more likely to steal data and then only encrypt some to get your attention. The criminal’s ultimate goal is to blackmail the organization not just once but possibly multiple times and then go after your clients, to extort them as well.

Data exfiltration or theft is a major concern today. And the only way to catch this happening is by checking any IO activity against normal IO and flag/stop unusual access. By doing so one can stop this, when it’s starting, rather than later, after your data is all gone. RackTop BrickStor storage provides assessors for IO activity to catch criminal acts like this while they are occurring.

Ransomware’s typical dwell time in an organizations systems, is on the order of 9 months. That is criminals are in your system server(s) for 9 months, using lateral actions, to infect other machines on your network and escalating privileges to gain even more access to your data.

Jason mentioned that a friend of his runs a major research university’s IT organization which is constantly under attack by foriegn adversaries. They found it typically takes:

  • Russian hackers 30 minutes once in your network to start escalating privileges and move laterally to access more systems.
  • Chinese hackers 2 hours, and
  • Iranian hackers 4 hours to do the same.

Jonathan also said that 1 in 3 cyber attacks is helped by an insider. Many insider attacks are used to steal IP and other information, but are never intended to be discovered. In this case, there may never be an external event to show you’ve been hacked.

Storage admins don’t need to become cyber security SMEs but everyone has a role to play in cyber security today. It’s important that storage admins provide proper information to upper management to identify risks and possible mitigations. This needs to include an understanding of an organizations data risks and what could be done with that data in the wrong hands.

Storage admins also need to run data security breach scenarios/simulations/tests showing what could happen and how they plan to recover. Sort of like DR testing but for ransomware.

And everyone needs to practice proper security hygiene. Storage admins have to lead on implementing security procedures, access controls, and the other functionality to protect an organization’s data. None of this replaces other network and server security functionality. But all of this functionality has to be in place to secure an organizations data.

Jonathan mentioned that the SEC in the US, has recently begun to enforce regulations to require public companies to disclose ransomware attacks within 3 days of discovery. Such disclosure needs to include any external data/users that are impacted. When organizations 1st disclose attacks, exposure is usually very limited, but over time, the organization typically finds exposure isn’t as limited as they first expected.

RackTop BrickStor maintains logs of who or what accessed which data. So when you identify an infection/culprit, BrickStor can tell you what data that entity has accessed over time. Making any initial disclosure more complete.

RackTop’s software defined storage solution can be implemented just about anywhere, in the cloud, in a VM, on bare metal (with approved hardware vendors) and can be used to front end anyone’s block storage or used with direct access storage.

Having something like RackTop Systems in place as your last line of defense to assess and log all IO activity, looking for anomalies, seems a necessary ingredient to any organizations cyber security regime.

Jonathan Halstuch, Co-Founder and CTO, RackTop Systems

Jonathan Halstuch is the Chief Technology Officer and Co-Founder of RackTop Systems. He holds a bachelor’s degree in computer engineering from Georgia Tech as well as a master’s degree in engineering and technology management from George Washington University.

With over 20-years of experience as an engineer, technologist, and manager for the federal government, he provides organizations the most efficient and secure data management solutions to accelerate operations while reducing the burden on admins, users, and executives.