158: GreyBeards talk software defined storage with Brian Dean, Tech. Mkt., Dell PowerFlex

Sponsored By:

This is the 2nd time Brian Dean, Technical Marketing, Dell PowerFlex Storage has been on our show discussing their storage. Since last time there’s been a new release with significant functional enhancements to file services, Dell CloudIQ integration and other services. We discussed these and other topics on our talk with Brian. Please listen to the podcast to learn more.

We began the discussion on the recent (version 4.5) changes to Powerflex for file services. PowerFlex file services are provided by File Nodes each running a NAS Container, which supplies multiple NAS Servers. NAS servers supply tenant network namespaces, security policies and host file systems, each of which resides on a single PowerFlex volume.

File Nodes are deployed in HA pairs, each on a separate hardware server. One can have up to 16 File Nodes or 8 pairs of File Nodes running on a PowerFlex cluster. If one of the pair goes down, file access fails over to the other File Node in a pair.

Each NAS Server supports multiple file systems each of which can be up to 256TB. The NAS Container is also used for other Dell storage file services, so it’s full featured and very resilient.

PowerFlex file services support multiple NFS and SMB versions as well as SFTP/FTP and other essential file data services. In addition, it also supports a global name space which allows all PowerFlex cluster file systems to be accessed under a single name space and IP target.

Next, we discussed PowerFlex’s automated LCM (Life Cycle Management) services which is specific to the PowerFlex appliance and fully-integrated, rack deployment models. Recall that PowerFlex can be deployed as an appliance, rack solution or in a software only solution using X86 servers.

With the appliance and rack models, a PowerFlex Manager (PFxM) service is used to deploy, change, monitor and manage PowerFlex cluster nodes. It discovers networking and PowerFlex servers/storage, loads appropriate firmware, BIOS, PowerFlex storage data services software and then brings up PowerFlex block services.

PFxM also offers automated LCM by maintaining an intelligent catalog, which declares all current software/firmware/BIOS and hardware versions compatible with PowerFlex software. When changes are made to the cluster, say when storage is increased or a server is added, the PFxM service detects the change and goes about bringing any new hardware up to proper software levels.

Finally the PFxM service can non-disruptively update the cluster whenever a PowerFlex code change is deployed. This would involve an intelligent catalog update, after which the PFxM service detects the cluster is out of compliance, and then it would serially go through, bringing each cluster node up to the proper level, without host IO access interruption.

Finally, we discussed changes made to CloudIQ-PowerFlex interface, so that CloudIQ can now troubleshoot and report performance-capacity trends at the PowerFlex storage pool, fault set, and fault domain level. Previously, CloudIQ could only do this at the full PowerFlex system level.

CloudIQ is Dell’s free, cloud service used to monitor and trouble shoot all Dell storage systems and many other Dell solutions, whether on premises or in the cloud.

Brian mentioned that all technical information for PowerFlex is available on their InfoHub.

Brian Dean, Dell PowerFlex Technical Marketing

Brian is a 16+ year veteran of the technology industry, and before that spent a decade in higher education. Brian has worked at EMC and Dell for 7 years, first as Solutions Architect and then as TME, focusing primarily on PowerFlex and software-defined storage ecosystems.

Prior to joining EMC, Brian was on the consumer/buyer side of large storage systems, directing operations for two Internet-based digital video surveillance startups.

When he’s not wrestling with computer systems, he might be found hiking and climbing in the mountains of North Carolina. 

157: GreyBeards talk commercial cloud computer with Bryan Cantrill, CTO, Oxide Computer

Bryan Cantrill (@bcantrill), CTO, Oxide Computer was a hard man to interrupt once started but the GreyBeards did their best to have a conversation. Nonetheless, this is a long podcast. Oxide are making a huge bet on rack scale computing and have done everything they can to make their rack easy to unbox, setup and deploy VMs on.

They use commodity parts (AMD EPYC CPUs) and package them in their own designed hardware (server) sleds, which blind mate to networking and power in the back of the own designed rack. They use their own OS Helios (OpenSolaris derivative) with their own RTOS, Hubris, for system bringup, monitoring and the start of their hardware root of trust. And of course, to make it all connect easie,r they designed and developed their own programmable networking switch. Listen to the podcast to learn more.

Oxide essentially provides rack hardware which supports EC2-like compute and EBS-like storage to customers. It also has Terraform plugins to support infrastructure as code. In addition, all their software is completely API driven.

Bryan said time and time again, developing their own hardware and software made everything easier for them and their customers. Customers pay for hardware but there’s absolutely NO SOFTWARE LICENSING FEEs, because all their software is open source.

For example, the problem with AMI bios and UEFIs is their opacity, There’s really no way to understand what packages are included in its root of trust because it’s proprietary. Brian said one company UEFI they examined, had URL’s embedded in firmware. It seemed odd to have another vendor’s web pages linked to their root of trust.

Bryan said they did their own switch to reduce integration and validation test time. The Oxide rack supports all internal networking, compute sled to compute sled, and ToR switch (with no external cabling) and has 32 networking ports to connect the rack to the data center’s core networking.

As for storage, Bryan said each of the 10 U.2 NVMe drives in their compute sled is a separate, ZFS file system and customer data is 3 way mirrored across any of them. ZFS also provides end to end checksumming across all customer data for IO integrity.

Bryan said Oxide Computer rack bring up is 1) plug it in to core networking and power, 2) power it on, 3) attach a laptop to their service processor, 4) SSH into it, 5) Run a configuration script and your ready to assign VMs. He said that from the time an Oxide Rack hits your dock until you are up and firing up VMs, could be as short as an HOUR.

The Rust programming language is the other secret to Oxide’s success. More to the point their company is named after Rust (oxide get it). Apparently just about any software they developed is written in Rust.

The question for Oxide and every other computer and storage vendor is – do you believe that on premises computing will continue for the foreseeable future. The GreyBeards and Oxide believe yes. If not for compliance and better latency but also because it often costs less.

Bryan mentioned they have their own podcast, Oxide and Friends. On their podcast, they did a board bring up series (Tales from the Bring-Up Lab) and a series on taking their rack through FCC compliance (Oxide and the Chamber of Mysteries).

Bryan Cantrill, CTO, Oxide Computers

Bryan Cantrill is a software engineer who has spent over a quarter of a century at the hardware/software interface. He is the co-founder and CTO of Oxide Computer Company, the creator of the world’s first commercial cloud computer.

Prior to Oxide, he spent nearly a decade at Joyent, a cloud computing pioneer; prior to Joyent, he spent 14 years at Sun Microsystems.

Bryan received the Sc.B. magna cum laude with honors in Computer Science from Brown University, and is a MIT Technology Review 35 Top Young Innovators alumnus.

You can learn more about his work with Oxide at oxide.computer, or listen in on their weekly live show, Oxide and Friends (link above), on Discord or anywhere you get your podcasts.

156: GreyBeards talk data security with Jonathan Halstuch, Co-Founder and CTO, RackTop Systems

Sponsored By:

This is another repeat appearance of Jonathan Halstuch, Co-Founder and CTO, RackTop Systems on our podcast. This time he was here to discuss whether storage admins need to become security subject matter experts (SMEs) or not. Short answer, no but these days, security is everybody’s responsibility. Listen to the podcast to learn more.

It used to be that ransomware only encrypted data and then demanded money to decrypt. But nowadays, it’s more likely to steal data and then only encrypt some to get your attention. The criminal’s ultimate goal is to blackmail the organization not just once but possibly multiple times and then go after your clients, to extort them as well.

Data exfiltration or theft is a major concern today. And the only way to catch this happening is by checking any IO activity against normal IO and flag/stop unusual access. By doing so one can stop this, when it’s starting, rather than later, after your data is all gone. RackTop BrickStor storage provides assessors for IO activity to catch criminal acts like this while they are occurring.

Ransomware’s typical dwell time in an organizations systems, is on the order of 9 months. That is criminals are in your system server(s) for 9 months, using lateral actions, to infect other machines on your network and escalating privileges to gain even more access to your data.

Jason mentioned that a friend of his runs a major research university’s IT organization which is constantly under attack by foriegn adversaries. They found it typically takes:

  • Russian hackers 30 minutes once in your network to start escalating privileges and move laterally to access more systems.
  • Chinese hackers 2 hours, and
  • Iranian hackers 4 hours to do the same.

Jonathan also said that 1 in 3 cyber attacks is helped by an insider. Many insider attacks are used to steal IP and other information, but are never intended to be discovered. In this case, there may never be an external event to show you’ve been hacked.

Storage admins don’t need to become cyber security SMEs but everyone has a role to play in cyber security today. It’s important that storage admins provide proper information to upper management to identify risks and possible mitigations. This needs to include an understanding of an organizations data risks and what could be done with that data in the wrong hands.

Storage admins also need to run data security breach scenarios/simulations/tests showing what could happen and how they plan to recover. Sort of like DR testing but for ransomware.

And everyone needs to practice proper security hygiene. Storage admins have to lead on implementing security procedures, access controls, and the other functionality to protect an organization’s data. None of this replaces other network and server security functionality. But all of this functionality has to be in place to secure an organizations data.

Jonathan mentioned that the SEC in the US, has recently begun to enforce regulations to require public companies to disclose ransomware attacks within 3 days of discovery. Such disclosure needs to include any external data/users that are impacted. When organizations 1st disclose attacks, exposure is usually very limited, but over time, the organization typically finds exposure isn’t as limited as they first expected.

RackTop BrickStor maintains logs of who or what accessed which data. So when you identify an infection/culprit, BrickStor can tell you what data that entity has accessed over time. Making any initial disclosure more complete.

RackTop’s software defined storage solution can be implemented just about anywhere, in the cloud, in a VM, on bare metal (with approved hardware vendors) and can be used to front end anyone’s block storage or used with direct access storage.

Having something like RackTop Systems in place as your last line of defense to assess and log all IO activity, looking for anomalies, seems a necessary ingredient to any organizations cyber security regime.

Jonathan Halstuch, Co-Founder and CTO, RackTop Systems

Jonathan Halstuch is the Chief Technology Officer and Co-Founder of RackTop Systems. He holds a bachelor’s degree in computer engineering from Georgia Tech as well as a master’s degree in engineering and technology management from George Washington University.

With over 20-years of experience as an engineer, technologist, and manager for the federal government, he provides organizations the most efficient and secure data management solutions to accelerate operations while reducing the burden on admins, users, and executives.

155: GreyBeards SDC23 wrap up podcast with Dr. J Metz, Technical Dir. of Systems Design AMD and Chair of SNIA BoD

Dr. J Metz (@drjmetz, blog), Technical Director of Systems Design at AMD and Chair of SNIA BoD, has been on our show before discussing SNIA research directions. We decided this year to add an annual podcast to discuss highlights from their Storage Developers Conference 2023 (SDC23).

Dr, J is working at AMD to help raise their view from a pure components perspective to a systems perspective. On the other hand, at SNIA, we can see them moving out of just storage interface technology into memory (of all things) and real long term, storage archive technologies.

SDC is SNIA’s main annual conference, which brings storage developers together with storage users to discuss all the technologies underpinning storing the data we all care so much about. Listen to the podcast to learn more

SNIA is trying to get their hands around trends impacting the IT industry today. These days, storage, compute and networking are all starting to morph into one another and the boundary lines, always tenuous at best, seem to be disappearing.

Aside from industry standards work that SNIA has always been known for, they are also deeply involved in education. One of their more popular artifacts is the SNIA Dictionary (recently moved online only), which provides definitions for probably over a 1000 storage terms. But SDC also has a lot of tutorials and other educational sessions worthy of time and effort. And all SDC sessions will be available online, at some point. (Update 10/25/23: they are all available now at Sessions | SDC 2023 website)

SNIA also presented at SFD26, while SDC23 was going on. At SFD26, SNIA discussed DNA data storage which is a recent technical affiliate and a new Smart Data Transfer Interface (SDXI), a software defined interface to perform memory to memory DMA.

First up, DNA storage, the DNA team said that they pretty much are able to store and access GB of DNA data storage today, without breaking a sweat and are starting to consider how to scale that up to TB of DNA storage.  We’ve discussed DNA data storage before on GBoS podcasts (see: 108: GreyBeards talk DNA storage... )

The talk at SFD26 was pretty detailed. Turns out the DNA data storage team have to re-invent a lot of standard storage technologies (catalogs/Indexes, metadata, ECC, etc) in order to support a DNA data soup of unstructured data.

For exampe, ECC for DNA segments (snippets) would be needed to correctly store and retrieve DNA data segments, And these segments could potentially be replicated 1000s of times in a DNA storage cell. And all DNA data segments would be tagged with file oriented metadata indicating (segment) address within file, file name or identifier, date created, etc.

As far as what an application for DNA storage would look like, Dr. J mentioned write once and read VERY infrequently. It turns out while making 1000s of copies of DNA data segments is straightforward, inexpensive and trivial, reading it is another matter entirely. And as was discussed at SFD26, reading DNA storage, as presently conceived, is destructive. (So maybe having lots of copies is a good and necessary idea.)

But the DNA guru’s really have to a come up with methods for indexing, searching, and writing/reading data quickly.  Todays disks have file systems that are self-defining. If you hand someone an HDD, it’s fairly straightforward to read information off of it and determine the file system used to create it. These days, with LTO-FS, the same could be said for LTO tape.

DNA is intended to be used to store data for 1000s of years. They have retrieved intact DNA from a number of organisms that are over 50K years old.  Retaining applications that can access, format and process data after a 1000 years is yet another serious problem someone will need to solve.

Next up was SDXI, a software defined DMA solution, that any application can use to move data from one memory to another without having to resort to 20 abstraction layers to do it. SDXI is just about moving data between memory banks.

Today, this is all within one system/server, but as CXL matures and more and more hardware starts supporting CXL 2 and 3, shared memory between servers will become more pervasive all on a CXL memory interface.

Keith tried bringing it home to moving data between containers or VMs and all that’s possible today within the same memory and sometime in the future between shared memory and local memory. 

Memory to memory transfers have to be done securely. It’s not like accessing memory from some other process hasn’t been frought with security exposures in the past. And Dr. J assured me that SDXI was built from the ground up with security considerations front and center.

To bring it all back home. SNIA has always been and always will be concerned with data. Whether that data resides on storage, memory or god forbid, in transit somewhere over a network. Keith went as far as to say that the network was storage, I felt that was a step too far.

Dr. J Metz, Technical Director of Systems at AMD, Chair of SNIA BoD

J is the Chair of SNIA’s (Storage Networking Industry Association) Board of Directors and Technical Director for Systems Design for AMD where he works to coordinate and lead strategy on various industry initiatives related to systems architecture. Recognized as a leading storage networking expert, J is an evangelist for all storage-related technology and has a unique ability to dissect and explain complex concepts and strategies. He is passionate about the innerworkings and application of emerging technologies.

J has previously held roles in both startups and Fortune 100 companies as a Field CTO,  R&D Engineer, Solutions Architect, and Systems Engineer. He has been a leader in several key industry standards groups, sitting on the Board of Directors for the SNIA, Fibre Channel Industry Association (FCIA), and Non-Volatile Memory Express (NVMe). A popular blogger and active on Twitter, his areas of expertise include NVMe, SANs, Fibre Channel, and computational storage.

J is an entertaining presenter and prolific writer. He has won multiple awards as a speaker and author, writing over 300 articles and giving presentations and webinars attended by over 10,000 people. He earned his PhD from the University of Georgia.

154: GreyBeards annual VMware Explore wrap-up podcast

Thanks, once again to The CTO Advisor|Keith Townsend, (@CTOadvisor) for letting us record the podcast in his studio. VMware Explore this year was better than last year. The show seemed larger. the show floor busier, the Hub better and the Hands-On Lab much larger than I ever remember before. The show seems to be growing, but still not at the pre-pandemic levels, but the trend is good.

The engineers have been busy at VMware this past year. Announced at the show include Private AI Foundation, a way for enterprises to train open source LLMs on corporate data kept private, a significant re-direct to VMware Edge environments moving from the push model updates to push model updates, and vSAN Max, NSX+, Tanzu App Engine, and more. And we heard that Brocade is clearing more hurdles to the acquisition. Listen to the podcast to learn more.

Private AI plays to VMware’s strengths and its control over on-prem processing. Customers need a safe space and secured data to train corporate ChatBots curated on corporations knowledge base. VMware rolled this out two ways,

  • Reference architecture approach based on Ray cluster management, KubeFlow, PyTorch, VectorDB, GPU Scaling (NVLink/NVswitch), vSAN fast path (RDMA, GPUdirect), and deep learning VMs. There was no discussion of tie ins to the Data Persistence (object) storage.
  • Proprietary NVIDIA approach based on NVIDIA workbench, TensorRT, NeMO, NVIDIA GPU & Network Operator

By having both approaches VMware provides alternatives for those wanting a non-proprietary solution. And with with AI/MLOps moving so fast, the open source may be better able to keep up.

The tie in with NVIDIA is a natural extension of what VMware have been doing with GPUs and DPUs, etc.

Also, VMware announced a technological partnership with Hugging Face. We were somewhat concerned with all the focus on LLM and GenAI but the agreement with Hugging Face goes beyond just LLMs.

VMware Edge solutions are pivoting. Apparently, VMware is moving from the vSphere pull model of code updates in the field which seems to handle 64 server, multi-cluster environments without problem to more of a YAML-GitHub push model of IoT device updates that seems better able to manage fleets of 1K to 100K devices in the field.

With the new model one creates a GitHub repo and a YAML file describing the code update to be done and all your IoT devices just get’s updated to the new level.

Once again the Brocade acquisition is on everyone’s mind. As I got to the show, one analyst asked if this was going to be the last VMware Explore. I highly doubt that, but Brocade will make lots of changes once the transaction closes. One thing mentioned at the show was that Brocade will make an immediate, additional $1B investment in R&D. The deal had provisionally passed the UK regulatory body and was on track to close near the end of October.

Other news from the show:

  • The Tanzu brand is broadening. Tanzu Application Platform (TAP) still exists but they have added a new App Engine is to take the VMware management approach to K8s clusters, other cloud infrastructure and the rest of the IT world. Tanzu Intelligent Services also now supports policy guardrails, cost control, management insight and migration services for other environments.
  • vSAN Max, which supports disaggregation (separation) of storage and compute is available. vSAN Max becomes a full fledged, standalone storage system that just happens to run on top of vSphere. Disaggregated (vSAN Max) storage and (regular vSAN) HCI can co-exist as different mounted datastores and vSAN Max supports PB of storage.
  • Workspace One is updated to provide enhanced digital experience monitoring that adds coverage of what Workspace One users are actually experiencing.
  • NSX+ continues to roll out. VMware mentioned that the number one continuing problem with hybrid cloud/multi-cloud setup is getting the networking right. NSX+ will reduce this complexity by becoming a management/configuration overlay over any and all cloud/on-prem networking for your environment(s).
  • VMware chatbots for Tanzu, Workspace One and NSX+ are now in tech preview and will supply intelligent assistants for these solutions. Based on LLM/GenAI and trained on VMware’s extensive corporate knowledge base, the chatbots will help admins focus on the signal over the noise and will provide recommendations on how to resolve issues. .

Jason Collier, Principal Member of Technical Staff, AMD

Jason Collier (@bocanuts) is a long time friend, technical guru and innovator who has over 25 years of experience as a serial entrepreneur in technology. Jason currently works at AMD focused on emerging technology for IT, IoT and anywhere else in the world and across the universe that needs compute, storage or networking resources.

He was Chief Evangelist, CTO & Co-Founder of Scale Computing and has been an innovator in the field of hyper-convergence and an expert in virtualization, data storage, networking, cloud computing, data centers, and edge computing for years.

He has also been another co-founder, director of research, VP of technical operations and director of operations at other companies over his long career prior to AMD and Scale.

He’s on LinkedIN.