74: Greybeards talk NVMe shared storage with Josh Goldenhar, VP Cust. Success, Excelero

Sponsored by:

In this episode we talk NVMe shared storage with Josh Goldenhar (@eeschwa), VP, Customer Success at Excelero. Josh has been on our show before (please see our April 2017 podcast), the last time with Excelero’s CTO & Co-founder, Yavin Romen.

This is Excelero’s 1st sponsored GBoS podcast and we wish to welcome them again to the show. Since Excelero’s NVMesh storage software is in customer hands now, Josh is transitioning to add customer support to his other duties.

NVMe storage industry trends

We started our discussion with the maturing NVMe market. Howard mentioned he heard that NVMe SSD sales have overtaken SATA SSD volumes. Josh mentioned that NVMe SSDs are getting harder to come by,  driven primarily by Super 8 (8 biggest hyper-scalars) purchases. And even when these SSDs can be found, customers are paying a premium for NVMe drives.

The industry is also starting to sell larger capacity NVMe SSDs. Customers view this as a way of buying cheaper ($/GB) storage. However, most NVMe shared storage systems use mirroring for data protection, which cuts effective (protected) capacity in half, doubling cost/GB.

Another change in the market, is that with today’s apps many customers no longer need all the  read AND write IO performance from their NVMe storage. For newer applications/workloads, writes are less frequent and as such, less a driver of application performance. But read performance is still critical.

The other industry trend is a number of new vendors offering NVMeoF (Ethernet) storage arrays (see: Pavillion Data’s, Atalla Systems’s, and Solarflare Communication’s  podcasts in just the last few months). Most of the startup systems are essentially top of rack shared NVMe SSDs and some with limited data protection/ management services.

Excelero’s NVMesh has offered a logical volume manager as well as protected NVMe shared storage since the start, with RAID 0 and protected, RAID 1/10 storage.

Excelero is coming out with a new release of its NVMesh™ software defined storage.

NVMesh 2

We were particularly interested in one of NVMesh 2’s new capabilities, its distributed data protection, which is based on Erasure Coding (EC, like RAID 6), with a stripe that includes 8+2 segments. Unlike mirroring/RAID1-10, EC only reduces effective NVMe storage capacity by 20% for protection. And also protects against 2 drive failures within a RAID group.

However, with distributed data protection, write IO will not perform as well as reads. But reads perform just as fast as ever.

As with any data protection, customers will need sufficient spare capacity to rebuild data for a failed device.

The latest release will be available to all current customers, on service contract. When available, customers should immediately start benefiting from the space efficient, distributed data protection for new data on the system.

The new release also adds Fibre Channel (as Howard correctly guessed  on the podcast) and TCP/IP protocols to their current InfiniBand, RoCE, and NVMeoF support as well as new performance analytics to help diagnose performance issues faster and at scale.

The podcast runs ~25 minutes. Josh has an interesting perspective on the NVMe storage market as well as competitive solutions and was great to talk with again. The new data protection functionality in Excelero NVMesh 2 signals an evolving NVMe storage market. As NVMe storage matures, the tradeoff between performance and data services, looks to be an active war zone for some time to come. Listen to the podcast to learn more.

Josh Goldenhar, Vice President Customer Success, Excelero

Josh has been responsible for product strategy and vision at leading storage companies for over two decades. His experience puts him in a unique position to understand the needs of customers.
Prior to joining Excelero, Josh was responsible for product strategy and management at EMC (XtremIO) and DataDirect Networks. Previous to that, his experience and passion was in large scale, systems architecture and administration with companies such as Cisco Systems. He’s been a technology leader in Linux, Unix and other OS’s for over 20 years. Josh holds a Bachelor’s degree in Psychology/Cognitive Science from the University of California, San Diego.

68: GreyBeards talk NVMeoF/TCP with Ahmet Houssein, VP of Marketing & Strategy @ Solarflare Communications

In this episode we talk with Ahmet Houssein, VP of Marketing and Strategic Direction at Solarflare Communications, (@solarflare_comm). Ahmet’s been in the industry forever and has a unique view on where NVMeoF needs to go. Howard had talked with Ahmet at last years FMS. Ahmet will also be speaking at this years FMS (this week in Santa Clara, CA)..

Solarflare Communications sells Ethernet communication gear, mostly to the financial services market and has developed a software plugin for the standard TCP/IP stack on Linux that supports both target and client mode NVMeoF/TCP. That is, their software plugin provides a complete implementation of NVMeoF across TCP Ethernet that extends the TCP protocol but doesn’t require RDMA (RoCE or iWARP) or data center bridging.

Implementing NVMeoF/TCP

Solarflare’s NVMeoF/TCP is a free plugin that once approved by the NVMe(oF) standard’s committees anyone can use to create a NVMeoF storage system and consume that storage from almost anywhere. The standards committee is expected to approve the protocol extension soon and sometime after that the plugin will be added to the Linux Kernel. After standards approval, maybe VMware and Microsoft will adopt it as well, but may take more work.

Over the last year plus most NVMeoF/Ethernet we encounter requires sophisticated RDMA hardware. When we talked with Pavilion Data Systems, a month or so ago, they had designed a more networking like approach to NVMeoF using RoCE and TCP a special purpose FPGA that’s used in their RDMA NICs and Mellanox switches to support client-target mode NVMeoF/UDP [updated 8/8/18 after VR’s comment, the ed.]. When we talked with Attala Systems, they had special purpose FPGA that’s used in RDMA NICs and Mellanox switches to support target & client mode NVMeoF/UDP were using standard RDMA NICs and Mellanox switches to support their NVMeoF/Ethernet storage [updated 8/8/18 after VR’s comment, the ed.].

Solarflare is taking a different tack.

One problem with the NVMeoF/Ethernet RDMA is compatibility. You can use either RoCE or iWARP RDMA NICs but at the moment you can’t use both. With TCP/IP plugins there’s no hardware compatibility issue. (Yes, there’s software compatibility at both ends of the pipe).

SolarFlare recently measured latencies for their NVMeoF/TCP (Iometer/FIO) which shows that the with the protocol running it adds about a 5-10% increase in latency versus running RDMA NVMeoF/UDP-RoCE-iWARP.

Performance measurements were taken using a server, running Red Hat Linux + their TCP plugin with NVMe SSDs on the storage side and a similar configuration on the client side without the SSDs.

If they add 10% latency to 10 microsec. IO (for Optane), latency becomes 11 microsec. Similarly for flash NVMe SSDs it moves from 100 microsec to 110 microsec.

Ahmet did mention that their NICs have some hardware optimizations which brings down this added latency into something approaching closer to 5%. And later we discuss the immense parallelism opportunities using the TCP stack in user space. Their hardware also better supports more threads doing IO in parallel.

Why TCP

Ahmets on a mission. He says there’s this misbelief that Ethernet RDMA hardware is required to achieve lightening fast response times using NVMeoF, but it’s not true. Standard TCP with proper protocol enhancements is more than capable of performing at very close to the same latencies as RDMA, without special NICs and DCB switch configurations.

Furthermore, TCP/IP already has multipathing support. So current high availability characteristics of TCP are readily applicable to NVMeoF/TCP

Parallelism through user space

NVMeoF/TCP was the subject of 1st half of our discussion but we spent the 2nd half talking about scaling or parallelism. Even if you can do 11 or 110 microsecond latency at some point, if you do enough of these IOs, the kernel overhead in processing blocks and transferring control from kernel space to user space will become a bottleneck.

However, there’s nothing stopping IT from running the TCP/IP stack in user space and eliminating any kernel control transfer whatsoever. By doing so, data centers could parallelize all this IO using as many cores as available.

Running the plugin in a TCP/IP stack in user space allows you to scale NVMeoF lightening fast IO to as many users as you have user spaces or cores, and the kernel doesn’t even break into a sweat

Anyone could simply download Solarflare’s plugin, configure a white box server with Linux and 24 NVMe SSDs and support ~8.4M IOPS (350Kx24) at ~110 microsec latency And with user space scaling, one could easily have 1000s of user spaces connected to it.

They’re going to need need faster pipes!

The podcast runs ~39 minutes. Ahmet was very knowledgeable about NVMe, NVMeoF and TCP.  He was articulate and easy to talk with.  Listen to the podcast to learn more.

Ahmet Houssein, VP of Marketing and Strategic Direction at Solarflare Communications 

Ahmet Houssein is responsible for establishing marketing strategies and implementing programs to drive revenue growth, enter new markets and expand brand awareness to support Solarflare’s continuous development and global expansion.

He has over twenty-five years of experience in the server, storage, data center and networking industry, and held senior level executive positions in product development, marketing and business development at Intel and Honeywell. Most recently Houssein was SVP/GM at QLogic where he successfully delivered first to market with 25Gb Ethernet products securing design wins at HP and Dell.

One of the key leaders in the creation of the INFINIBAND and PCI-Express industry standard, Houssein is a recipient of the Intel Achievement Award and was a founding board member of the Storage Network Industry Association (SNIA), a global organization of 400 companies in the storage market. He was educated in London, UK and holds an Electrical Engineering Degree equivalent.

62: GreyBeards talk NVMeoF storage with VR Satish, Founder & CTO Pavilion Data Systems

In this episode,  we continue on our NVMeoF track by talking with VR Satish (@satish_vr), Founder and CTO of Pavilion Data Systems (@PavilionData). Howard had talked with Pavilion Data over the last year or so and I just had a briefing with them over the past week.

Pavilion data is taking a different tack to NVMeoF, innovating in software and hardware design, but using merchant silicon for their NVMeoF accelerated array solution. They offer Ethernet based NVMeoF block storage.

VR is a storage “lifer“, having worked at Veritas on their Volume Manager and other products for a long time. Moreover, Pavilion Data has a number of exec’s from Pure Storage (including their CEO, Gurpreet Singh), other storage technology companies and is located in San Jose, CA.

VR says there were 5 overriding principles for Pavilion Data as they were considering a new storage architecture:

  1. The IT industry is moving to rack scale compute and hence, there is a need for rack scale storage.
  2. Great merchant silicon was coming online so, there was less of a need to design their own silicon/asics/FPGAs.
  3. Rack scale storage needs to provide “local” (within the rack) resiliency/high availability and let modern applications manage “global” (outside the rack) resiliency/HA.
  4. Rack scale storage needs to support advanced data management services.
  5. Rack scale storage has to be easy to deploy and run

Pavilion Data’s key insight was in order to meet all those principles and deal with high performance NVMe flash and up and coming, SCM SSDs,  storage had to be redesigned to look more like network switches.

Controller cards?

One can see this new networking approach in their bottom of rack, 4U storage appliance. Their appliance has up to 20 controller cards creating a heavy compute/high bandwidth cluster attached via an internal PCIe switch to a backend storage complex made up of up to 72 U.2 NVMe SSDs.

The SSDs fit into an interposer that plugs into their PCIe switch and maps single (or dual ported) SSDs to the applianece’s PCIe bus. Each controller card supports an Intel  XeonD micrprocessor and 2 100GbE ports for up to 40 100GbE ports per appliance. The controller cards are configured in an active-active, auto-failover mode, for high availability. They don’t use memory caching or have any NVram.

On their website Pavilion data show  117 µsec response times and 114 GB/sec of throughput for IO performance.

Data management for NVMeoF storage

Pavilion Data storage supports widely striped/RAID6 data protection (16+2), thin provisioning, space efficient read only (redirect on write) snapshots and space efficient read-write clones. With RAID6, it takes more than 2  drive failures to lose data.

Like traditional storage, volumes (NVMe namespaces) are assigned to RAID groups.  The backend layout appears to be a log structured file. VR mentioned that they don’t do garbage collection and with no Nvram/no memory caching, there’s a bit of secret sauce here.

Pavilion Data storage offers two NVMeoF/Ethernet protocols:

  • Standard off the shelf,  NVMeoF/RoCE interface that makes use of v1.x of the Linux kernel NVMeoF/RoCE drivers and special NIC/switch hardware
  • New NVMeof/TCP interface that doesn’t need special networking  hardware and as such, offers NVMeoF over standard NIC/switches. I assume this takes host software to work.

In addition, Pavilion Data has developed their own Multi-path IO (MPIO) driver for NVMeoF high availability which they have contributed to the current Linux kernel project.

Their management software uses RESTful APIs (documented on their website). They also offer a CLI and GUI, both built using these APIs.  Bottom of rack storage appliances are managed as separate storage units, so they don’t support clusters of more than one appliance. However, there are only a few cluster storage systems we know of that support 20 controllers today for block storage.

Market

VR mentioned that they are going after new applications like MongoDB, Cassandra, CouchBase, etc. These applications are designed around rack scaling and provide “global”, off-rack/cross datacenter availability themselves. But VR also mentioned Oracle and other, more traditional applications. Pavilion Data storage is sold on a ($/GB) capacity basis.

The system comes in a minimum, 5 controller cards-18 NVMe SSD configuration and can be extended in groups of 5 controllers-18 NVMe SSDs to the full 20 controller cards-72 NVMe SSDs.

The podcast runs ~42 minutes. VR was very knowledgeable about the storage industry, NVMeoF storage protocols, NVMe SSDs and advanced data management capabilities. We had a good talk with VR on what Pavilion Data does and how well it works.   Listen to the podcast to learn more.

VR Satish, Founder and CTO, Pavilion Data Systems

VR Satish is the Chief Technology Officer at Pavilion Data Systems and brings more than 20 years of experience in enterprise storage software products.

Prior to joining Pavilion Data, he was an Entrepreneur-in-Residence at Artiman Ventures. Satish was an early employee of Veritas and later served as the Vice President and the Chief Technology Officer for the Information & Availability Group at Symantec Corporation prior to joining Artiman.

His current areas of interest include distributed computing, information-centric storage architectures and virtualization.

Satish holds multiple patents in storage management, and earned his Master’s degree in computer science from the University of Florida.

61: GreyBeards talk composable storage infrastructure with Taufik Ma, CEO, Attala Systems

In this episode,  we talk with Taufik Ma, CEO, Attala Systems (@AttalaSystems). Howard had met Taufik at last year’s FlashMemorySummit (FMS17) and was intrigued by their architecture which he thought was a harbinger of future trends in storage. The fact that Attala Systems was innovating with new, proprietary hardware made an interesting discussion, in its own right, from my perspective.

Taufik’s worked at startups and major hardware vendors in his past life and seems to have always been at the intersection of breakthrough solutions using hardware technology.

Attala Systems is based out of San Jose, CA.  Taufik has a class A team of executives, engineers and advisors making history again, this time in storage with JBoFs and NVMeoF.

Ray’s written about JBoF (just a bunch of flash) before (see  FaceBook moving to JBoF post). This is essentially a hardware box, filled with lots of flash storage and drive interfaces that directly connects to servers. Attala Systems storage is JBOF on steroids.

Composable Storage Infrastructure™

Essentially, their composable storage infrastructure JBOF connects with NVMeoF (NVMe over Fabric) using Ethernet to provide direct host access to  NVMe SSDs. They have implemented special purpose, proprietary hardware in the form of an FPGA, using this in a proprietary host network adapter (HNA) to support their NVMeoF storage.

Their HNA has a host side and a storage side version, both utilizing Attala Systems proprietary FPGA(s). With Attala HNAs they have implemented their own NVMeoF over UDP stack in hardware. It supports multi-path IO and highly available dual- or single-ported, NVMe SSDs in a storage shelf. They use standard RDMA capable Ethernet 25-50-100GbE (read Mellanox) switches to connect hosts to storage JBoFs.

They also support RDMA over Converged Ethernet (RoCE) NICS for additional host access. However I believe this requires host (NVMeoF) (their NVMeoY over UDP stack) software to connect to their storage.

From the host, Attala Systems storage on HNAs, looks like directly attached NVMe SSDs. Only they’re hot pluggable and physically located across an Ethernet network. In fact, Taufik mentioned that they already support VMware vSphere servers accessing Attala Systems composable storage infrastructure.

Okay on to the good stuff. Taufik said they measured their overhead and it was able to perform an IO with only an additional 5 µsec of overhead over native NVMe SSD latencies. Current NVMe SSDs operate with a response time of from 90 to 100 µsecs, and with Attala Systems Composable Storage Infrastructure, this means you should see 95 to 105 µsec response times over a JBoF(s) full of NVMe SSDs! Taufik said with Intel Optane SSD’s 10 µsec response times, they see response times at ~16 µsec (the extra µsec seems to be network switch delay)!!

Managing composable storage infrastructure

They also use a management “entity” (running on a server or as a VM),  that’s used to manage their JBoF storage and configure NVMe Namespaces (like a SCSI LUN/Volume).  Hosts use NVMe NameSpaces to access and split out the JBoF  NVMe storage space. That is, multiple Attala Systems Namespaces can be configured over a single NVMe SSD, each one corresponding to a single  (virtual to real) host NVMe SSD.

The management entity has a GUI but it just uses their RESTful APIs. They also support QoS on an IOPs or bandwidth limiting basis for Namespaces, to control manage noisy neighbors.

Attala systems architected their management system to support scale out storage. This means they could support many JBoFs in a rack and possibly multiple racks of JBoFs connected to swarms of servers. And nothing was said that would limit the number of Attala storage system JBoFs attached to a single server or under a single (dual for HA) management  entity. I thought the software may have a problem with this (e.g., 256 NVMe (NameSpaces) SSDs PCIe connected to the same server) but Taufik said this isn’t a problem for modern OS.

Taufik mentioned that with their RESTful APIs,  namespaces can be quickly created and torn down, on the fly. They envision their composable storage infrastructure to be a great complement to cloud compute and container execution environments.

For storage hardware, they use storage shelfs from OEM vendors. One recent configuration from Supermicro has hot-pluggable, dual ported, 32 NVMe slots in a 1U chasis, which at todays ~16TB capacities, is ~1/2PB of raw flash. Taufik mentioned 32TB NVMe SSDs are being worked on as we speak. Imagine that 1PB of flash NVMe SSD storage in 1U!!

The podcast runs ~47 minutes. Taufik took a while to get warmed up but once he got going, my jaw dropped away.  Listen to the podcast to learn more.

Taufik Ma, CEO Attala Systems

Tech-savvy business executive with track record of commercializing disruptive data center technologies.  After a short stint as an engineer at Intel after college, Taufik jumped to the business side where he led a team to define Intel’s crown jewels – CPUs & chipsets – during the ascendancy of the x86 server platform.

He honed his business skills as Co-GM of Intel’s Server System BU before leaving for a storage/networking startup.  The acquisition of this startup put him into the executive team of Emulex where as SVP of product management, he grew their networking business from scratch to deliver the industry’s first million units of 10Gb Ethernet product.

These accomplishments draw from his ability to engage and acquire customers at all stages of product maturity including partners when necessary.

56: GreyBeards talk high performance file storage with Liran Zvibel, CEO & Co-Founder, WekaIO

This month we talk high performance, cluster file systems with Liran Zvibel (@liranzvibel), CEO and Co-Founder of WekaIO, a new software defined, scale-out file system. I first heard of WekaIO when it showed up on SPEC sfs2014 with a new SWBUILD benchmark submission. They had a 60 node EC2-AWS cluster running the benchmark and achieved, at the time, the highest SWBUILD number (500) of any solution.

At the moment, WekaIO are targeting HPC and Media&Entertainment verticals for their solution and it is sold on an annual capacity subscription basis.

By the way, a Wekabyte is 2**100 bytes of storage or ~ 1 trillion exabytes (2**60).

High performance file storage

The challenges with HPC file systems is that they need to handle a large number of files, large amounts of storage with high throughput access to all this data. Where WekaIO comes into the picture is that they do all that plus can support high file IOPS. That is, they can open, read or write a high number of relatively small files at an impressive speed, with low latency. These are becoming more popular with AI-machine learning and life sciences/genomic microscopy image processing.

Most file system developers will tell you that, they can supply high throughput  OR high file IOPS but doing both is a real challenge. WekaIO’s is able to do both while at the same time supporting billions of files per directory and trillions of files in a file system.

WekaIO has support for up to 64K cluster nodes and have tested up to 4000 cluster nodes. WekaIO announced last year an OEM agreement with HPE and are starting to build out bigger clusters.

Media & Entertainment file storage requirements are mostly just high throughput with large (media) file sizes. Here WekaIO has a more competition from other cluster file systems but their ability to support extra-large data repositories with great throughput is another advantage here.

WekaIO cluster file system

WekaIO is a software defined  storage solution. And whereas many HPC cluster file systems have metadata and storage nodes. WekaIO’s cluster nodes are combined meta-data and storage nodes. So as one scale’s capacity (by adding nodes), one not only scales large file throughput (via more IO parallelism) but also scales small file IOPS (via more metadata processing capabilities). There’s also some secret sauce to their metadata sharding (if that’s the right word) that allows WekaIO to support more metadata activity as the cluster grows.

One secret to WekaIO’s ability to support both high throughput and high file IOPS lies in  their performance load balancing across the cluster. Apparently, WekaIO can be configured to constantly monitoring all cluster nodes for performance and can balance all file IO activity (data transfers and metadata services) across the cluster, to insure that no one  node is over burdened with IO.

Liran says that performance load balancing was one reason they were so successful with their EC2 AWS SPEC sfs2014 SWBUILD benchmark. One problem with AWS EC2 nodes is a lot of unpredictability in node performance. When running EC2 instances, “noisy neighbors” impact node performance.  With WekaIO’s performance load balancing running on AWS EC2 node instances, they can  just redirect IO activity around slower nodes to faster nodes that can handle the work, in real time.

WekaIO performance load balancing is a configurable option. The other alternative is for WekaIO to “cryptographically” spread the workload across all the nodes in a cluster.

WekaIO uses a host driver for Posix access to the cluster. WekaIO’s frontend also natively supports (without host driver) NFSv3, SMB3.1, HDFS and AWS S3  protocols.

WekaIO also offers configurable file system data protection that can span 100s of failure domains (racks) supporting from 4 to 16 data stripes with 2 to 4 parity stripes. Liran said this was erasure code like but wouldn’t specifically state what they are doing differently.

They also support high performance storage and inactive storage with automated tiering of inactive data to object storage through policy management.

WekaIO creates a global name space across the cluster, which can be sub-divided into one to thousands  of file systems.

Snapshoting, cloning & moving work

WekaIO also has file system snapshots (readonly) and clones (read-write) using re-direct on write methodology. After the first snapshot/clone, subsequent snapshots/clones are only differential copies.

Another feature Howard and I thought was interesting was their DR as a Service like capability. This is, using an onprem WekaIO cluster to clone a file system/directory, tiering that to an S3 storage object. Then using that S3 storage object with an AWS EC2 WekaIO cluster to import the object(s) and re-constituting that file system/directory in the cloud. Once on AWS, work can occur in the cloud and the process can be reversed to move any updates back to the onprem cluster.

This way if you had work needing more compute than available onprem, you could move the data and workload to AWS, do the work there and then move the data back down to onprem again.

WekaIO’s RtOS, network stack, & NVMeoF

WekaIO runs under Linux as a user space application. WekaIO has implemented their own  Realtime O/S (RtOS) and high performance network stack that runs in user space.

With their own network stack they have also implemented NVMeoF support for (non-RDMA) Ethernet as well as InfiniBand networks. This is probably another reason they can have such low latency file IO operations.

The podcast runs ~42 minutes. Linar has been around  data storage systems for 20 years and as a result was very knowledgeable and interesting to talk with. Liran almost qualifies as a Greybeard, if not for the fact that he was clean shaven ;/. Listen to the podcast to learn more.

Linar Zvibel, CEO and Co-Founder, WekaIO

As Co-Founder and CEO, Mr. Liran Zvibel guides long term vision and strategy at WekaIO. Prior to creating the opportunity at WekaIO, he ran engineering at social startup and Fortune 100 organizations including Fusic, where he managed product definition, design and development for a portfolio of rich social media applications.

 

Liran also held principal architectural responsibilities for the hardware platform, clustering infrastructure and overall systems integration for XIV Storage System, acquired by IBM in 2007.

Mr. Zvibel holds a BSc.in Mathematics and Computer Science from Tel Aviv University.