Server side caching – Grey Beards on Systems

October 13, 2022October 13, 2022

138: GreyBeards talk big data orchestration with Adit Madan, Dir. of Product, Alluxio

We have never talked with Alluxio before but after coming back last week from Cloud Field Day 15 (CFD15) it seemed a good time to talk with other solution providers attempting to make hybrid cloud easier to use. Adit Madan (@madanadit) , Director of Product Management, Alluxio, which is a data orchestration solution that’s available in both a free to download/use, open source, community edition (apparently, Meta is a customer ) or a licensed, closed source, enterprise edition.

Alluxio data orchestration is all about suppling local like, IO access to data that resides elsewhere for BI, AI/ML/DL, and just about any other application needing to process data residing elsewhere. Listen to the podcast to learn more

Podcast: Play in new window | Download (Duration: 50:21 — 69.2MB) | Embed

Subscribe: Apple Podcasts | Spotify | RSS

Alluxio started out at UC Berkeley’s AMPlab, which is focused on big data problems and was designed to provide local access to massive amounts of distributed data. Alluxio ends up constructing a locally accessible, federation of data sources for compute apps running elsewhere,

Alluxio software installs near where compute apps run that need access to remote data. We asked about a typical cloud bursting case where S3 object data needed by an app are sitting on prem, but the apps need to run in a cloud, e.g., AWS.

He said Alluxio software would be deployed in AWS, close to app compute and that’s all there is. There’s no Alluxio software running on prem, as Alluxio just uses normal (remote access) S3 APIs to supply data to the compute apps running in AWS.

Adit mentioned that BI was one of the main applications to take advantage of Alluxio, but AI/ML/DL learning is another that could use data orchestration. It turns out that AI/ ML/DL training’s consumption of data is repetitive and highly sequential, so caching, sequential pre-fetch and other Alluxio techniques can work well there to provide local-like access to remote data.

Adit said that enterprises are increasingly looking to avoid vendor lock-in and this applies equally well to the cloud. By supporting data access in one location, say GC,P and accessing that data from another, say Azure, data gravity need no longer limit where work is done.

Adit said what makes their solution so valuable is that instead of duplicating all data from one place to another all that Alluxio moves is just the data required/requested by the apps running there.

Keith asked whether Adit considered Alluxio a data mesh or data fabric. Keith had to explain the terms to me and said data fabrics are pipes and physical infrastructure/functionality that moves data around and data mesh is what gives clients/apps/users access to that data. From that perspective Alluxio is a data mesh.

Alluxio Caching

Adit said that caching is one of the keys to making Alluxio work. Much of the success of their solution depends on applications having a well behaved working set. He also mentioned they use pre-fetching and other techniques to minimize access latency and maximize throughput. However, the first byte of data being accessed may take some time to get to where compute executes.

Adit said it’s not unusual for them to have a 1/2PB of cache (storage) for an application with multiPBs of source data.

Keith asked how Alluxio’s performance can be managed. Adit said they (we assume enterprise edition) have a solution called Cache Insights which uses Alluxio’s extensive access pattern history to predict application IO performance with larger cache (storage), higher speed networking, higher performing/more compute cores, etc. In this way, customers can see what can be done to improve application IO performance and what it would cost.

Keith asked if Alluxio were available as a SaaS solution. Adit said, although it could be deployed in that fashion, it’s not currently a SaaS solution. When asked how Alluxio (enterprise) was priced, Adit said it’s a function of the total resources consumed by their service, i.e, storage (cache), cores, networking that runs Alluxio software etc.

As for deployment options, it turns out for Spark, Alluxio is just another lib package installed inside Spark. For K8s, Alluxio is installed as a CSI drivers and a set of containers and can be deployed as containers within a cluster that needs access to data or in an external, standalone K8s cluster, servicing IO from other clusters. Alluxio HA is supplied by using multiple nodes to provide IO access.

Alluxio also supports access to multiple data locations. In this case, the applications would just access different mount points.

Data reads are easy, writes can be harder due to data integrity issues. As such, trying to supply IO performance becomes a trade off for data integrity when data updates are supported. Adit said Alluxio offers a couple of different configuration options for write concurrency (data integrity) that customers can select from. We assume this includes write through, write back and perhaps other write consistency options.

Alluxio supports AWS, Azure and GCP cloud compute accessing HDFS, S3 and Posix protocol access to data residing at remote sites. At remote sites, they currently support MinIO, Cloudian and any other S3 compatible storage solutions as well as NetApp (ONTAP) and Dell (ECS) storage as data sources.

Adit Madan, Director of Product, Alluxio

Adit Madan is the Director of Product Management at Alluxio. Adit has extensive experience in distributed systems, storage systems, and large-scale data analytics.

Adit holds an MS from Carnegie Mellon University and a BS from the Indian Institute of Technology – Delhi.

Adit is the Director of Product Management at Alluxio and is also a core maintainer and Project Management Committee (PMC) member of the Alluxio Open Source project.

March 22, 2021March 22, 2021

115-GreyBeards talk database acceleration with Moshe Twitto, CTO&Co-founder, Pliops

We seem to be on a computational tangent this year. So we thought it best to talk with Moshe Twitto, CTO and Co-Founder at Pliops (@pliopsltd). We had first seen them at SFD21 (see videos of their sessions here) and their talk on how they could speed up database IO was pretty impressive. Essentially, they have a database/storage accelerator board used to increase block store IO activity to NVMe SSDs but also provide a key-value store IO accelerator,

Moshe was very knowledgeable about the technology and had previously worked at Samsung for their SSD group. He knew a lot about what happens underneath the covers of an SSD and what it takes to speed up IO. It turns out that many in memory databases use persistent key value stores to persist data or to operate in non- (or partial-) memory-mode. Listen to the podcast to learn more.

Podcast: Play in new window | Download (Duration: 47:26 — 65.1MB) | Embed

Subscribe: Apple Podcasts | Spotify | RSS

The Pliops board plugs into the PCIe bus and accelerates IO to NVMe SSDs connected to the bus or can act to accelerate IO to JBoF that’s networked behind it. Their board uses FPGA(s), NVDimms of their own design and DRAM to accelerate database IO using NVMe SSDS.

Pliops operates in one of two modes, as a Key-Value store or as a Block store. Their Key-Value store takes advantage of block store capabilities, so we start there.

In block mode, Pliops provides inline hardware data compression and encryption. Compression requires support for variable length blocks on backend SSDs. To better support this, they pack multiple compressed blocks into physical blocks. They also use a virtualization service to support mapping host LBAs to physical block addresses (using an internal key-value store). Hardware, inline encryption is also provided on a LUN (or namespace) basis. This could enable each database to have its own key. They have a root-of-trust secret key used to encrypt customer namespace (database) keys.

They also optimize physical block layouton the SSD to reduce write amplification (doing more than one write to the NAND for every host write to the SSD).

Block mode also supports smart caching. This is especially useful for database journaling/loging which reuses a portion of LBA address space (blocks} as a revolving journal/log. These blocks are overwritten with new data often and data written to them need not be destaged to NVMe SSDs as long as it can be maintained in NVDimm storage. At some point it gets destaged but probably only when log activity slows down (if ever) or some timeout occurs.

For their key-value storage accelerator, they have implemented an API that’s similar to RocksDB, a persistent key-value store, which is used as a physical storage backend for Reddis and similar in-memory databases. However, the challenge with RocksDB is that there are lots of tuning knobs/parameters. So getting right takes some work. But all this can be avoided just by using Pliops.

We didn’t talk too much about how their key-value store works. Moshe says they optimize the key structures and key data so that all database keys can be retained in their board’s memory and just by doing that, they can have immediate (1 IO) access to any data block pointed to by those keys.

He did mention that they provide ~the same performance for a database getting 10-25% host cache hit rates using their board as that same database would support with a 80-90% host cache hit rate not using their board. Some of this was shown at SFD21 (so check out the videos above for more performance info)

A couple of other advantages they bring to the table. As they are interposed between the host and the NVMe SSDs they can take advantage of their NVDIMMs and memory to write much wider stripes than the host writes. This allows them to reduce SSD read and write amplification (due to less garbage collection) by writing more full NAND pages. All this also reduces physical host (data) writes/day which can significantly improve SSD endurance.

Somewhere in all that smart caching and data compression, they are able to also decrease response times It turns out that databases that don’t use RocksDB or depend on key-value stores can easily take advantage of all their block store functionality to improve IO performance.

They mostly market their product to hyperscalers and superscalers. His definition of super-scalers was any organization that operates at public cloud levels but is not a public cloud (e.g., big social media companies).

Moshe Twitto, CTO & Co-founder Pliops

Moshe is an expert in advanced data management and coding algorithms. Prior to co-founding Pliops, Moshe served as CTO of Samsung’s SSD Controller Development Center in Israel.

Moshe holds MSEE, BSEE degrees from Technion University, Summa Cum Laude and served in the Unit 8200 Intelligence Division of the Israel Defense Corps.

May 10, 2016September 3, 2018

GreyBeards deconstruct storage with Brian Biles and Hugo Patterson, CEO and CTO, Datrium

In this our 32nd episode we talk with Brian Biles (@BrianBiles), CEO & Co-founder and Hugo Patterson, CTO & Co-founder of Datrium a new storage startup. We like to call it storage deconstructed, a new view of what storage could be based on today and future storage technologies. If I had to describe it succinctly, I would say it’s a hybrid between software defined storage, server side flash and external disk storage. We have discussed server side flash before but this takes it to a whole another level.

Their product, the DVX consists of Hyperdriver host software and a NetShelf, external disk storage unit. The DVX was designed from the ground up based on the use of host/server side flash or non-volatile memory as a given and built everything else around that. I hesitate to say this but the DVX NetShelf backend storage is pretty unintelligent, just a dual controller disk storage with a multi-task coordinator. In contrast, the DVX Hyperdriver host software used to access their storage system is pretty smart and is installed as a VIB in vSphere. Customers can assign up to 8TB of host-based, server side flash/non-volatile memory to the storage system per server. The Datrium DVX does the rest.

The Hyperdriver leverages host flash, DRAM and compute cores to act as a caching layer for read and write IO and as a data management engine. Write data is write-thru straight from the server side flash to the NetShelf storage system which has Non-volatile DRAM (NVRAM) caching. Once write data is in NetShelf cache, it’s in two places, one on the host server side flash and the other in storage NVRAM. Reads are easier to handle, just being cached from the NetShelf storage in the server side flash. There’s no unique data residing in the hosts.

The Hyperdriver looks like a NFS mount to vSphere and the DVX uses a proprietary protocol to talk with the backend DVX NetShelf. Datrium supports up to 32 hosts and you can define the amount of Flash, DRAM and host compute allocated to the DVX Hyperdriver activity.

But the other interesting part about DVX is that much of the storage management functionality and storage control logic is partitioned between the host Hyperdriver and NetShelf, with both participating to do what they do best.

For example, disk rebuilds are done in combination with the host Hyperdriver. DVX RAID rebuild brings data from the backend into host cache, computes rebuild data and writes the reconstructed data back out to the NetShelf backend. This way rebuild performance can scale up with the number of hosts active in a cluster.

DVX data are compressed and deduplicated at the host before being sent to the NetShelf. The NetShelf backend also does a global deduplication on the host data. Hashing computations and data compression activities are all done on the host and passed on to the NetShelf. Brian and Hugo were formerly with EMC Data Domain, and know all about data deduplication.

At the moment DVX is missing some storage functionality but they have an extensive roadmap with engineering resources to match and are plugging away at all of it. On the other hand, very few disk storage devices offer deduped/compressed data storage and warm server side caches during vMotion. They also support QoS functionality to limit the amount of host resources consumed by DVX Hyperdriver software

The podcast runs ~41 minutes and episode covers a lot of ground about how the new DVX product came about, how they separated storage functionality between host and backend and other aspects of DVX storage. Listen to the podcast to learn more.

Podcast: Play in new window | Download (Duration: 40:50 — 18.7MB) | Embed

Subscribe: Apple Podcasts | Spotify | RSS

Brian Biles, Datrium CEO & Co-founder

Prior to Datrium, Brian was Founder and VP of Product Mgmt. at EMC Backup Recovery Systems Division. Prior to that he was Founder, VP of Product Mgmt. and Business Development for Data Domain (acquired by EMC in 2009).

Hugo Patterson, Datrium CTO & Co-founder

Prior to Datrium, Hugo was an EMC Fellow serving as CTO of the EMC Backup Recovery Systems Division, and the Chief Architect and CTO of Data Domain (acquired by EMC in 2009), where he built the first deduplication storage system. Prior to that he was the engineering lead at NetApp, developing SnapVault, the first snap-and-replicate disk-based backup product. Hugo has a Ph.D. from Carnegie Mellon.

March 4, 2014September 5, 2018

Greybeards talk server DRAM IO caching with Peter Smith, Director, Product Management at Infinio

Welcome to our sixth episode. We once again dive into the technical end of the pool with an in-depth discussion of DRAM based server side caching with Peter Smith, Director of Product Management at Infinio. Unlike PernixData (checkout Episode 2, with Satyam Vaghani, CTO PernixData) and others in the server side caching business, Infinio supplies VMware server side storage caching using DRAM for NFS VMDKs. It got a bit technical fairly fast in the podcast, sorry about that.

This months podcast comes in at a little over 40 minutes and was recorded on 20 February 2014. The overall sound quality is much better than Episode 5 but we are still working out some of the kinks, so bear with us.

Peter comes from a number of different IT infrastructure and co-location services and brings a wealth of knowledge on IO caching within a VMware server environment. With all the DRAM supplied in ESX servers these days and the increasing compute power that’s now available, the time seems ripe to implement a deduplicated, DRAM cache for VMware IO.

Infinio clusters together segments of ESX DRAM, across nodes in a VMware cluster to supply an IO cache. The software installs across the VMware cluster non-disruptively (~ = Vmotion) and Infinio clusters can be expanded without operational impact.

There was some discussion on the odds of a (random) SHA-1 hash collisions happening in our lifetimes (as Greybeards our lifetimes may be shorter than yours). I tried to get Peter or Howard to give me commensurate odds on this happening but alas, no takers.

Listen to the podcast to learn more…

Podcast: Play in new window | Download (Duration: 40:49 — 38.0MB) | Embed

Subscribe: Apple Podcasts | Spotify | RSS

Peter Smith

Director of Product Management. Peter brings more than 10 years of expertise as an infrastructure architect and IT operations director. In previous companies such as Harvard Business School and Endeca Technologies, Peter managed full-service datacenters and colocation spaces. Most recently Peter led infrastructure services for Endeca, and has also directed operations of customer-hosting infrastructure for clients including American Express, Fidelity UK, Bank of America, and Nike.

November 2, 2013September 5, 2018

GreyBeards discuss server side flash with Satyam Vaghani, Co-Founder & CTO PernixData

Episode 2: Server Side Flash Software

Welcome to our second GreyBeards on Storage (GBoS) podcast. The podcast was recorded on October 29, 2013, when Howard and Ray talked with Satyam Vaghani, Co-Founder & CTO PernixData, a scale-out, server side flash caching solution provider

Well this month’s podcast runs about 40 minutes. Howard and I had an interesting and humorous discussion with Satyam Vaghani, who was a leading architect of VMware’s storage and file system stack and now is currently Co-Founder/CTO of PernixData, which offers a server side caching solution. Unlike other solutions in this space, Pernixdata offers advanced write-back caching and scale-out clustering solution for server side flash for data used by VMs running under vSphere. We discuss how they provide these features and why in our second podcast episode. Although why people are running analytics in a VM using server side flash is still a mystery …

Podcast: Play in new window | Download (Duration: 40:16 — 37.5MB) | Embed

Subscribe: Apple Podcasts | Spotify | RSS

Satyam Vaghani Bio’s

Satyam Vaghani is Co-Founder and CTO at PernixData, a company that leverages server flash to enable scale-out storage performance that is independent of capacity. Earlier, he was VMware’s Principal Engineer and Storage CTO where he spent 10 years delivering fundamental storage innovations in vSphere. He is most known for creating the Virtual Machine File System (VMFS) that set a storage standard for server virtualization. He has authored 50+ patents, led industry-wide changes in storage systems and standards via VAAI, and has been a regular speaker at VMworld and other industry and academic forums. Satyam received his Masters in CS from Stanford and Bachelors in CS from BITS Pilani.

	174: GreyBeards talk… on 174: GreyBeards talk SDN chips…
	GreyBeards talk Agen… on 169: GreyBeards talk AgenticAI…
	Computational (DNA)… on 155: GreyBeards SDC23 wrap up…
	155: GreyBeards SDC2… on 155: GreyBeards SDC23 wrap up…
	J Metz on 134: GreyBeards talk (storage)…