148: GreyBeards talk software defined infrastructure with Anthony Cinelli and Brian Dean, Dell PowerFlex

Sponsored By:

This is one of a series of podcasts the GreyBeards are doing with Dell PowerFlex software defined infrastructure. Today, we talked with Anthony Cinelli, Sr. Director Dell Technologies and Brian Dean, Technical Marketing for PowerFlex. We have talked with Brian before but this is the first time we’ve met Anthony. They were both very knowledgeable about PowerFlex and the challenges large enterprises have today with their storage environments.

The key to PowerFlex’s software defined solution is its extreme flexibility, which comes mainly from its architecture which offers scale-out deployment options ranging from HCI solutions to a fully disaggregated compute-storage environment, in seemingly any combination (see technical resources for more info). With this sophistication, PowerFlex can help consolidate enterprise storage across just about any environment from virtualized workloads, to standalone databases, big data analytics, as well as containerized environments and of course, the cloud. Listen to the podcast to learn more.

To support this extreme flexibility, PowerFlex uses both client and storage software that can be configured together on a server (HCI) or apart, across compute and storage nodes to offer block storage. PowerFlex client software runs on any modern bare-metal or virtualized environment.

Anthony mentioned that one common problem to enterprises today is storage sprawl. Most large customers have an IT environment with sizable hypervisor based workloads, a dedicated database workload, a big data/analytics workload, a modern container based workload stack, an AI/ML/DL workload and more often than not, a vertical specific workload.

Each workload usually has their own storage system. And the problem with 4-7 different storage systems is cost, e.g., cost of underutilized storage. Typical to these environments, each storage system could be used at say, 60% utilization on average, but this will vary a lot between silos, leading to stranded capacity.

The main reason customers haven’t consolidated yet is because each silo has different performance characteristics. As a result, they end up purchasing excess capacity which increases cost and complexity, as a standard part of doing business.

To consolidate storage across these disparate environments requires a no-holds barred approach to IO performance, second to none, which PowerFlex can deliver. The secret to to its high levels of IO performance is RAID 10, deployed across a scale-out cluster. And PowerFlex clusters can range from 4 to 1000 or more nodes.

RAIID 10 mirrors data and spreads mirrored data across all drives and servers in a cluster or some subset. As a result, as you add storage nodes, IO performance scales up, almost linearly.

Yes, there can be other bottlenecks in clusters like this, most often networking, but with PowerFlex storage, IO need not be one of them. Anthony mentioned that PowerFlex will perform as fast as your infrastructure will support. So if your environment has 25 Gig Ethernet, it will perform IO at that speed, if you use 100 Gig Ethernet, it will perform at that speed.

In addition, PowerFlex offers automated LifeCycle Management (LCM), which can make having a 1000 node PowerFlex cluster almost as easy as a 10 node cluster. However to make use this automated LCM, one must run its storage server software on Dell PowerEdge servers.

Brian said adding or decommissioning PowerFlex nodes is a painless process. Because data is always mirrored, customers can remove any node, at any time and PowerFlex will automatically rebuild data across other nodes and drives. When you add nodes, those drives become immediately available to support more IO activity. Another item to note, because of RAID 10, PowerFlex mirror rebuilds happen very fast, as just about every other drive and node in the cluster (or subset) participates in the rebuild process.

PowerFlex supports Storage Pools. This partitions PowerFlex storage nodes and devices into multiple pools of storage used to host volume IO and data Storage pools can be used to segregate higher performing storage nodes from lower performing ones so that some volumes can exclusively reside on higher (or lower) performing hardware.

Although customers can configure PowerFlex to use all nodes and drives in a system or storage pool for volume data mirroring, PowerFlex offers other data placement alternatives to support high availability.

PowerFlex supports Protection Domains which are subsets or collections of storage servers and drives in a cluster where volume data will reside. This will allow one protection domain to go down while others continue to operate. Realize that because volume data is mirrored across all devices in a protection domain, it will take lots of nodes or devices to go down before a protection domain is out of action.

PowerFlex also uses Fault Sets, which are a collection of storage servers and their devices within a Protection Domain, that will contain one half of a volume’s data mirror. PowerFlex will insure that a primary and its mirror copy of volume’s data will not both reside on the same fault set. A fault set could be a rack of servers, multiple racks, all PowerFlex storage servers in an AZ, etc. With fault sets, customer data will always reside across a minimum of two fault sets, and if any one goes down, data is still available.

PowerFlex also operates in the cloud. In this case, customers bring their own PowerFlex software and deploy it over cloud compute and storage.

Brian mentioned that anything PowerFlex can do such as reconfiguring servers, can be done through RESTful/API calls. This can be particularly useful in cloud deployments as above, if customers want to scale up or down IO performance automatically.

Besides block services, PowerFlex also offers NFS/CIFS-SMB native file services using a File Node Controller. This frontends PowerFlex storage nodes to support customer NFS/SMB file access to PowerFlex data.

Anthony Cinelli, Sr. Director Global PowerFlex Software Defined & MultiCloud Solutions

Anthony Cinelli is a key leader for Dell Technologies helping drive the success of our software defined and multicloud solutions portfolio across the customer landscape. Anthony has been with Dell for 13 years and in that time has helped launch our HCI and Software Defined businesses from startup to the multi-billion dollar lines of business they now represent for Dell.

Anthony has a wealth of experience helping some of the largest organizations in the world achieve their IT transformation and multicloud initiatives through the use of software defined technologies.

Brian Dean, Dell PowerFlex Technical Marketing

Brian is a 16+ year veteran of the technology industry, and before that spent a decade in higher education. Brian has worked at EMC and Dell for 7 years, first as Solutions Architect and then as TME, focusing primarily on PowerFlex and software-defined storage ecosystems.

Prior to joining EMC, Brian was on the consumer/buyer side of large storage systems, directing operations for two Internet-based digital video surveillance startups.

When he’s not wrestling with computer systems, he might be found hiking and climbing in the mountains of North Carolina.

147: GreyBeards talk ransomware protection with Jonathan Halstuch, Co-Founder and CTO, RackTop Systems

Sponsored By:

This is another in our series of sponsored podcasts with Jonathan Halstuch (@JAHGT), Co-Founder and CTO of RackTop Systems. You can hear more in Episode 145.

We asked Jonathan what was wrong with ransomware protection today. Jonathan started by mentioning that bad actors had been present, on average, 277 days in an environment before being detected. That much dwell time, means they could have easily corrupted most backups and snapshots, stolen copies of all your most of sensitive/proprietary data, and of course, encrypted all your storage.

Backup ransomware protection works ok if dwell time is a couple of days or even a week, but not multiple months or longer.. The only real solution to this level of ransomware sophistication is real time monitoring of IO, looking for illegal activity. Listen to the podcast to learn more

Often, any data corruption, when discovered, is just notification to an unsuspecting IT organization that they have been compromised and lost control over their systems. Sort of like having a thief ring the door bell to tell you they stole all your stuff after the fact.

The only real solution to data breaches and ransomware attacks with significant dwell time, that protects both your data and your reputation is something like RackTop Systems and their BrickStore SP storage system. BrickStore offers an ongoing, in real-time, active defense against ransomware that’s embedded in your data storage, that’s continuously looking for bad actors and their activities during IO activity, all day, every day. 

When BrickStor detects ransomware in progress it shuts it down, by halting any further access to that user/apllication and snapshots the data before corruption, to immutable snapshots. That way admins have a good copy of data.

In addition, RackTop BrickStor SP supplies run book like recovery procedures that tell IT how to retrieve good data from snapshots, without wasting valuable time searching for the “last good backup”, which could be months old.

I asked whether data at rest encryption could offer any help. Jonathan said data encryption can thwart only some types of attacks. But it’s not that useful for ransomware, as bad actors who infiltrate your system masquerade as valid users/admins and by doing so, gain access to decrypted data.  

RackTop Systems uses AI in its labs to create ransomware “assesors”, automated routines embedded in their storage data path, which continuously execute looking for bad actor IO patterns. It’s these assessors that provide the first line of defense against ransomware.

In addition to assessors, Racktop Systems supplies many reports which depict data access permissions, user/admin access permissions, data being accessed, etc. All of which help IT and security teams better understand how data is being used and provide the visibility needed to help support better cyber security

When ransomware is detected, RackTop BrickStor offers a number of different notification features that range from web-hooks and slack channels to email notices and just about everything in between to notify IT and security teams that a breach is occurring and where.

RackTop Systems BrickStor SP is available in many deployments. One new option, from HPE, uses their block storage to present LUNs to BrickStor SP. Jonathan mentioned that other enterprise class block storage vendors are starting to use BrickStor SP to supply secure NAS services for their customers as well.

Jonathan mentioned that RackTop attended the HIMSS conference in Chicago last week and will be attending many others throughout the year. So check them out at a conference near you if you get a chance.

Jonathan Halstuch, Co-Founder & CTO RackTop Systems

Jonathan Halstuch is the Chief Technology Officer and co-founder of RackTop Systems. He holds a bachelor’s degree in computer engineering from Georgia Tech as well as a master’s degree in engineering and technology management from George Washington University.

With over 20-years of experience as an engineer, technologist, and manager for the federal government he provides organizations the most efficient and secure data management solutions to accelerate operations while reducing the burden on admins, users, and executives.

146: GreyBeards talk K8s cloud storage with Brian Carmody, Field CTO, Volumez

We’ve known Brian Carmody (@initzero), Field CTO, Volumez for over a decade now and he’s always been very technically astute. He moved to Volumez earlier this year and has once again joined a storage startup. Volumez is a cloud K8s storage provider with a new twist, K8s persistent volumes hosted on ephemeral storage.

Volumes currently works in public clouds (AWS & Azure( soft launch), with GCP coming soon) and is all about supplying high performing, enterprise class data services to K8s container apps. But doing this using transient (Azure ephemeral &AWS instance) storage and standard Linux. Hyperscalers offer transient storage as almost an afterthought with customer compute instances. Listen to the podcast to learn more.

It turns out that over the last decade or so, there has been a lot of time and effort devoted to maturing Linux’s storage stack and nowadays, with appropriate configuration, Linux can offer enterprise class data services and performance using direct attached NVMe SSDs. These services include thin provisioning, encryption, RAID/erasure coding, snapshots, etc., which on top of NVMe SSDs, provide IOPS, bandwidth and latency performance that boggles the mind.

However, configuring Linux sophisticated and high performing data services is a hard problem to solve..

Enter Volumez, they have a SaaS control plane, client software plus CSI drivers that will configure Linux with ephemeral storage to support any performance and data service that can be obtained from NVMe SSDs.

Once installed on your K8s cluster, Volumez software profiles all ephemeral storage, and supplies that information to their SaaS control plane. Once that’s done your platform engineers can define specific storage class policies or profiles useable by DevOps to consume ephemeral storage. .

These policies identify volume [IOPs, Bandwidth, Latency] X [read, write] performance specifications as well as data protection, resiliency and other data service requirements. DevOps engineers consume this storage using PVCs that call for these storage classes at some capacity. When it sees the PVC claim, Volumez SaaS control plane will carve out slices of ephemeral storage that can support the performance and other storage requirements defined in the storage class.

Once that’s done, their control plane next creates a network path from the compute instances with ephemeral storage to the worker nodes running container apps. After that it steps out of the picture and the container apps have a direct (network) data path to the storage they requested. Note, Volumez’s SaaS control plane is not in the container app storage data path at all.

Volumez supports multi-AZ data resiliency for PVCs. In this case, another mirror K8s cluster would reside in another AZ, with Volumez software active and similar if not equivalent ephemeral storage. Volumez will configure the container volume to mirror data between AZs. Similarly, if the policy requests erasure coding, Volumez SaaS software configures the ephemeral storage to provide erasure coding for that container volume.

Brian said they’ve done some amazing work to increase the speed of Linux snapshotting and restoring.

As noted above, the Volumez control plane SaaS software is outside the data path, so even if the K8s cluster running Volumez enabled storage loses access to the control plane, container apps continue to run and perform IO to their storage. This can continue until there’s a new PVC request that requires access to their control plane.

Ephemeral storage is accessed through special compute instances. These are not K8s worker nodes and they essentially act as a passthru or network attachment between worker nodes running apps with PVC’s and the Volumez configured Linux Logical Volumes hosted on slices of ephemeral storage.

Volumez is gaining customer traction with data platform clients, DBaaS companies, and some HPC environments. But just about anyone needing high performing data services for cloud K8s container apps should give Volumez a try.

I looked at AWS to see how they price instance store capacities and found out it’s not priced separately, but rather instance storage is bundled into the cost of EC2 compute instances.

Volumez is priced based on the number of media devices (instance/ephemeral stores) and performance (IOPs) available. They also have different tiers depending on support level requirements (e.g., community, Business hrs, 7X24) which also offers different levels of enterprise security functionality.

Brian said they have a free tier that customers can easily signup for and try out by going to their web site (see link above), or if you would like a guided demo, just contact him directly.

Brian Carmody, Field CTO, Volumez

Brian Carmody is Field CTO at Volumez. Prior to joining Volumez, he served as Chief Technology Officer of data storage company Infinidat where he drove the company’s technology vision and strategy as it ramped from pre-revenue to market leadership.

Before joining Infinidat, Brian worked in the Systems and Technology Group at IBM where he held senior roles in product management and solutions engineering focusing on distributed storage system technologies.

Prior to IBM, Brian served as a technology executive at MTV Networks Viacom, and at Novus Consulting Group as a Principal in the Media & Entertainment and Banking practices.

145: GreyBeards talk proactive NAS security with Jonathan Halstuch, CTO & Co-Founder, RackTop Systems

Sponsored By:

We’ve known about RackTop Systems. since episode 84, and have been watching them ever since. On this episode we, once again, talk with Jonathan Halstuch (@JAHGT), CTO and Co-Founder, RackTop Systems.

RackTop was always very security oriented but lately they have taken this to the next level. As Jonathan says on the podcast, historically security has been mostly a network problem but since ransomware has emerged, security is now often a data concern too. The intent of proactive NAS security is to identify and thwart bad actors before they impact data, rather than after the fact. Listen to the podcast to learn more.

Proactive security for NAS storage includes monitoring user IO and administrator activity and looking for anomalies. RackTop has the ability (via config options) to halt IO activity when things look wrong, that is user/application IO looks differently than what has been seen in the past. They also examine admin activity, a popular vector for ransomware attacks. RackTop IO/admin activity scanning is done in real time as IO is processed and admin commands received.

The customer gets to decide how far to take this. The challenge with automatically halting access is false positives, when say a new application starts taking off. Security admins must have an easy way to see and understand what was anomalous/what not and to quickly let that user/application return to normal activities or take it out.

In addition to just stopping access, they can also just report it to admins/security staff. Moreover, the system can also automatically take snapshots of data when anomalous behavior is detected, to give admins and security a point-in-time view into the data before bad behavior occurs.

RackTop Systems have a number of assessors that look for specific anomalous activity used to detect and act to twart malware. For example, an admin assessor is looking at all admin operations to determine if these are considered normal or not.

RackTop also support special time period access permissions. These provide temporary, time-dependent, unusual access rights to data for admins, users or applications that would normally be considered a breach. Such as having an admin copying lots of data or moving and deleting data. These are for situations that crop up where mass data deletion, movement or copying would be valid. When the time period access permission elapses, the system goes back into monitoring for anomalous behavior.

We talked about the overhead of doing all this scanning and detection in real time and how that may impact system IO performance. For other storage vendors, these sorts of activities are often done with standalone appliances, which of course add additional IO to a storage system to do offline scans.

Jonathan said, with recent Intel Xeon multi-core processors, they can readily afford the CPU cycles/cores required to do their scanning during IO processing, without sacrificing IO performance.

RackTop also supports a number of reports to show system configured data/user/application access rights as well as what accesses have occurred over time. Such reports offer admin/security teams visibility into data access rights and usage.

RackTop can be deployed in hybrid disk-flash solutions, as storage software in public clouds, in an HCI solution, or in edge environments that replicate back to core data centers. And they can also be used as a backup/archive data target for backup systems. RackTop Systems NAS supports CIFS 1.0- SMB 3.1.1, and NFSv3-v4.2.

RackTop Systems have customers in national government agencies, security sensitive commercial sectors, state gov’t, healthcare, and just about anyone subject to ransomware attacks on a regular basis. Which nowadays, is pretty much every IT organization on the planet.

Jonathan Halstuch, CTO & Co-Founder, RackTop Systems

Jonathan Halstuch is the Chief Technology Officer and co-founder of RackTop Systems. He holds a bachelor’s degree in computer engineering from Georgia Tech as well as a master’s degree in engineering and technology management from George Washington University.

With over 20-years of experience as an engineer, technologist, and manager for the federal government he provides organizations the most efficient and secure data management solutions to accelerate operations while reducing the burden on admins, users, and executives.

144: Greybeard talks AI IO with Subramanian Kartik & Howard Marks of VAST Data

Sponsored by

Today we talked with VAST Data’s Subramanian Kartik (@phyzzycyst), Global Systems Engineering Lead and Howard Marks (@DeepStorage@mastodon.social, @deepstoragenet) former GreyBeards co-host and now Technologist Extraordinary & Plenipotentiary at VAST. Howard needs no introduction to our listeners but Kartik does. Kartik has supported a number of customers implementing AI apps at VAST and prior companies, so he is well versed in the reality of AI ML DL. Moreover, VAST recently funded Silverton Consulting to write a paper discussing Deep Learning IO.

Although AI ML DL applications have been very popular these days in IT, there’s been a continuing challenge trying to understand its IO requirements. Listen to the podcast to learn more.

AI ML DL Neural Networks (NN) models train with data and lots of it while inferencing is also very data dependent. Kartik said AI model IO consists of small block, random reads with very few writes.

Some models contain huge NNs which consume mountains of data to train while others are relatively small and consume much less. GPT-3(.5), the model behind the original ChatGPT, has ~75B parameters in its ~800GB NN.

As many of us know, the key to AI processing is GPU hardware, which performs most, if not all, of the computations to train models and supply inferences. Moreover, to maximize training throughput, many organizations deploy model parallelism, using 10s to 1000s of GPUs.

For instance, in the paper mentioned earlier, we showed a model training IO chart based on all six storage vendor published NVIDIA DGX-A100 Reference Architecture reports for ResNet-50. On this single chart, all 6 storage systems supplied roughly the same images processed/sec (or ~IO bandwidth) performance to train the model on each of 8, 16 & 32 GPUs configurations. This is very unusual from our perspective but shows that ResNet-50 training is not IO bound.

However, another approach to speeding up NN training is to take advantage of newer, more advanced IO protocols. NVIDIA GPUDirect Storage transfers data directly from storage memory to GPU memory bypassing CPU memory all together which can significantly speed up GPU data consumption. It turns out that one bottleneck for AI training is CPU memory bandwidth

In addition, most AI model training reads data from a single file system mount point. Historically, an NFS mount point was limited to a single TCP connection and a maximum of ~2.5GB/sec of IO bandwidth. Recently, however, NConnect for NFS has been introduced which increased TCP connections to 16 per mount point .

Despite that, VAST Data found that by adding some code to Linux’s NFS TCP stack, they were able to increase NConnect to 64 TCP connections per compute node. Howard mentioned that with these changes and a 16 (compute) node VAST Data storage cluster they sustained 175GB/sec of GPUDirect Storage bandwidth using a DGX-A100 systems .

Subramanian Kartik, Global Systems Engineering Lead, VAST Data

Subramanian Kartik has been the Vice President of Systems Engineering at VAST Data since January of 2020, running the global presales organization. He is part of the incredible success of VAST Data which increased almost 10-fold in valuation and revenue in this period.

An accomplished technologist and executive in the industry, he has a wide array of experience in Cloud Architectures, AI/Machine Learning/Deep Learning, as well as in the  Life Sciences, covering high-performance computing and storage. He has had a lifelong deep passion for studying complex problems in all spheres spanning both workloads and infrastructure at the vanguard of current day technology. 

Prior to his work at VAST Data, he was with EMC (later Dell) for two decades, as both a Distinguished Engineer and global executive running the Converged and Hyperconverged Division  go-to-market. He has a Ph.D in Particle Physics with over 75 publications and 3 patents to his credit over the years. He enjoys mathematics, jazz, cooking and travelling with his family in his non-existent spare time.

Howard Marks, (former GreyBeards Co-Host) Technologist Extraordinary and Plenipotentiary, VAST Data

Howard Marks brings over forty years of experience as a technology architect for hire and Industry observer to his role as VAST Data’s Technologist Extraordinary and Plienopotentary. In this role, Howard demystifies VAST’s technologies for customers and customer requirements for VAST’s engineers.

Before joining VAST, Howard ran DeepStorage an industry test lab and analyst firm. An award-winning speaker, he has appeared at events on three continents including Comdex, Interop and VMworld.

Howard is the author of several books (all gratefully out of print) and hundreds of articles since Bill Machrone taught him journalism at PC Magazine in the 1980s.

Listeners may also remember that Howard was a founding co-Host of the Greybeards-on-Storage Podcast.