Big data – Grey Beards on Systems

November 4, 2024November 4, 2024

167: GreyBeards talk Distributed S3 storage with Enrico Signoretti, VP Product & Partnerships, Cubbit

Long time friend, Enrico Signoretti (LinkedIn), VP Product and Partnerships, Cubbit, used to be a common participant at Storage Field Day (SFD) events and I’ve known him since we first met there. Since then, he’s worked for a startup and a prominent analyst firms. But he’s back at another startup and this one looks like it’s got legs.

Cubbit offers Distributed S3 compatible object storage that offers geo-distribution and geo-fencing for object data, in which the organization owns the hardware and Cubbit supplies the software. There’s a management component, the Coordinator, which could run on your hardware or as a SaaS service they provide but other than that, IT controls the rest of the system hardware. Listen to the podcast to learn more.

Podcast: Play in new window | Download (Duration: 52:27 — 72.0MB) | Embed

Subscribe: Apple Podcasts | Spotify | RSS

Cubbit comes in 3 components:

One or more Storage nodes which includes their agent software running ontop of a linux system with direct attached storage.
One or more Gateway nodes which provides S3 protocol acces to the objects stored on storage nodes. Typical S3 access points https://S3.company_name, com/… points to either a load balancer, front end or one or more Gateway nodes. Gateway nodes provide the mapping between the bucket name/object identifier and where the data currently resides or will reside.
One Coordinator node which provides the metadata to locate the data for objects, manage the storage nodes, gateways and monitor the service. The Coordinator node can be a SaaS service supplied by Cubbit or a VM/bare metal node running Cubbit Coordinator software. Metadata is protected internally within the Coordinator node.

With these three components one can stand up a complete, geo-distributed/geo-fenced, S3 object storage system which the organization controls.

Cubbit encrypts data as it at the gateway and decrypts data when accessed. Sign-on to the system uses standard security offerings. Security keys can be managed by Cubbit or by standard key management systems.

All data for an object is protected by nested erasure codes. That is 1) erasure code within a data center/location over its storage drives and 2) erasure code across geographical locations/data centers..

With erasure coding across locations, customer with say 10 data center locations can have their data stored in such a fashion that as long as at least 8 data centers are online they still have access to their data, that is the Cubbit storage system can still provide data availability.

Similarly for erasure coding within the data center/location or across storage drives, say with 12 drives per stripe, one could configure lets say 9+3 erasure coding, where as long as 9 of the drives still operate, data will be available.

Please note the customer decides the number of locations to stripe across for erasure coding, and diet for the number of storage drives.

The customer supplies all the storage node hardware. Some customers start with re-purposed servers/drives for their original configuration and then upgrade to higher performing storage-servers-networking as performance needs change. Storage nodes can be on prem, in the cloud or at the edge.

For adequate performance gateways and storage nodes (and coordinator nodes) should be located close to one another. Although Coordinator nodes are not in the data path they are critical to initial object access.

Gateways can provide a cache for faster local data access.. Cubbit has recommendations for Gateway server hardware. And similar to storage nodes, Gateways can operate at the edge, in the cloud or on prem.

Use cases for the Distributed S3 storage include:

As a backup target for data elsewhere
As a geographically distributed/fenced object store.
As a locally controlled object storage to feed AI training/inferencing activity.

Most backup solutions support S3 object storage as a target for backups.

Geographically distributed S3 storage means that customers control where object data is located. This could be split across a number of physical locations, the cloud or at the edge.

Geographically fenced S3 storage means that the customer controls which of its many locations to store an object. For GDPR countries with multi-nation data center locations this could provide the compliance requirements to keep customer data within country.

Cubbit’s distributed S3 objects storage is strongly consistent in that an object loaded into the system at any location is immediately available to any user accessing it through any other gateway. Access times vary but the data will be the same regardless of where you access it from.

The system starts up through an Ansible playbook which asks a bunch of questions and loads and sets up the agent software for storage nodes, gateway nodes and where applicable, the coordinator node.

At any time, customers can add more gateways or storage nodes or retire them. The system doesn’t perform automatic load balancing for new nodes but customers can migrate data off storage nodes and onto other ones through api calls/UI requests to the Coordinator.

Cubbit storage supports multi-tenancy, so MSPs can offer their customers isolated access.

Cubbit charges for their service on data storage under management. Note it has no egress charges, and you don’t pay for redundancy. But you do supply all the hardware used by the system. They offer a discount for M&E customers as the metadata to data ratio is much smaller (lots of large files) than most other S3 object stores (mix of small and large files).

Cubbit is presently available only in Europe but will be coming to USA next year. So, if you are interested in geo-distributed/geo-fenced S3 object storage that you control and can be had for much cheaper than hyperscalar object storage, check it out.

Enrico Signoretti, VP Products & Partnerships

Enrico Signoretti has over 30 years of experience in the IT industry, having held various roles including IT manager, consultant, head of product strategy, IT analyst, and advisor.

He is an internationally renowned visionary author, blogger, and speaker on next-generation technologies. Over the past four years, Enrico has kept his finger on the pulse of the evolving storage industry as the Head of Research Product Strategy at GigaOm. He has worked closely and built relationships with top visionaries, CTOs, and IT decision makers worldwide.

Enrico has also contributed to leading global online sites (with over 40 million readers) for enterprise technology news.

January 19, 2024January 18, 2024

161: Greybeards talk AWS S3 storage with Andy Warfield, VP Distinguished Engineer, Amazon

We talked with Andy Warfield (@AndyWarfield), VP Distinguished Engineer, Amazon, about 10 years ago, when at Coho Data (see our (005:) Greybeards talk scale out storage … podcast). Andy has been a good friend for a long time and he’s been with Amazon S3 for over 5 years now. Since the recent S3 announcements at AWS Re:Invent, we thought it a good time to have him back on the show. Andy has a great knack for explaining technology, I suppose that comes from his time as a professor but whatever the reason, he was great to have on the show again.

Lately, Andy’s been working on S3 Express, One Zone storage, announced last November, a new version of S3 object storage with lower response time. We talked about this later in the podcast but first we touched on S3’s history and other advances. S3 and its ancillary services have advanced considerably over the years. Listen to the podcast to learn more

Podcast: Play in new window | Download (Duration: 47:52 — 65.7MB) | Embed

Subscribe: Apple Podcasts | Spotify | RSS

S3 is ~18 years old now and was one of the first AWS offerings. It was originally intended to be the internet’s file system which is why it was based on HTTP protocols.

Andy said that S3 was designed for 11-9s durability and high availability options. AWS constantly monitors server and storage failures/performance to insure that they can maintain this level of durability. The problem with durability is that when a drive/server goes down, the data needs to be rebuilt onto another drive before another drive fails. One way to do this is to have more replicas of the data. Another way is to speed up rebuild times. I’m sure AWS does both.

S3 high availability requires replicas across availability zones (AZ). AWS availability zone data centers are carefully located so that they are power-networking isolated from others data centers in the region. Further, AZ site locations are deliberately selected with an eye towards ensuring they are not susceptible to similar physical disasters.

Andy discussed other AWS file data services such as their FSx systems (Amazon FSx for Lustre, for OpenZFS, for Windows File Server, & for NetApp ONTAP) as well as Elastic File System (EFS). Andy said they sped up one of these FSx services by 3-5X over the last year.

Andy mentioned one of the guiding principles for lot of AWS storage is to try to eliminate any hard decisions for enterprise developers. By offering FSx files, S3 objects and their other storage and data services, customers already using similar systems in house can just migrate apps to AWS without having to modify code.

Andy said one thing that struck him as he came on the S3 team was the careful deliberation that occurred whenever they considered S3 API changes. He said the team is focused on the long term future of S3 and any API changes go through a long and deliberate review before implementation.

One workload that drove early S3 adoption was data analytics. Hadoop and BigTable have significant data requirements. Early on, someone wrote an HDFS interface to S3 and over time lots of data analytics activity moved to S3 object hosted data.

Databases have also changed over the last decade or so. Keith mentioned that many customers are foregoing traditional data bases to use open source database solutions with S3 as their backend storage. It turns out that Open Table Format database offerings such as Apache Iceberg, Apache Hudi and Delta Lake are all available on AWS use S3 objects as their storage

We talked a bit about Lambda Server-less processing triggered by S3 objects. This was a new paradigm for computing when it came out and many customers have adopted Lambda to reduce cloud compute spend.

Recently Amazon introduced a file system M ount point for S 3 storage. Customers can now use an NFS mount point to access any S3 bucket.

Amazon also supports the Registry for Open Data, which holds just about every canonical data set (stored as S3 objects) used for AI training.

In the last ReInvent, Amazon announced S3 Express One Zone which is a high performance, low latency version of S3 storage. The goal for S3 express was to get latency down from 40-60 msec to less than 10 sec.

They ended up making a number of changes to S3 such as:

Redesigned/redeveloped some S3 micro services to reduce latency
Restricted S3 Express storage to a single zone reducing replication requirements, but maintained 11-9s durability
Used higher performing storage
Re-designed S3 API to move some authentication/verification to the beginning of object access from every object access call.

Somewhere during our talk Andy said that, in aggregate, S3 is providing 100TBytes/sec of data bandwidth. How’s that for a scale out storage.

Andy Warfield, VP Distinguished Engineer, Amazon

Andy is a Vice President and Distinguished Engineer in Amazon Web Services. He focusses primarily on data storage and analytics.

Andy holds a PhD from the University of Cambridge, where he was one of the authors of the Xen hypervisor. Xen is an open source hypervisor that was used as the initial virtualization layer in AWS, among multiple other early cloud companies. Andy was a founder at Xensource, a startup based on Xen that was subsequently acquired by Citrix Systems for $500M. Following XenSource,

Andy was a professor at the University of British Columbia (UBC), where he was awarded a Canada Research Chair, and a Sloan Research Fellowship. As a professor, Andy did systems research in areas including operating systems, networking, security, and storage.

Andy’s second startup, Coho Data, was a scale-out enterprise storage array that integrated NVMe SSDs with programmable networks. It raised over 80M in funding from VCs including Andreessen Horowitz, Intel Capital, and Ignition Partners.

October 13, 2022October 13, 2022

138: GreyBeards talk big data orchestration with Adit Madan, Dir. of Product, Alluxio

We have never talked with Alluxio before but after coming back last week from Cloud Field Day 15 (CFD15) it seemed a good time to talk with other solution providers attempting to make hybrid cloud easier to use. Adit Madan (@madanadit) , Director of Product Management, Alluxio, which is a data orchestration solution that’s available in both a free to download/use, open source, community edition (apparently, Meta is a customer ) or a licensed, closed source, enterprise edition.

Alluxio data orchestration is all about suppling local like, IO access to data that resides elsewhere for BI, AI/ML/DL, and just about any other application needing to process data residing elsewhere. Listen to the podcast to learn more

Podcast: Play in new window | Download (Duration: 50:21 — 69.2MB) | Embed

Subscribe: Apple Podcasts | Spotify | RSS

Alluxio started out at UC Berkeley’s AMPlab, which is focused on big data problems and was designed to provide local access to massive amounts of distributed data. Alluxio ends up constructing a locally accessible, federation of data sources for compute apps running elsewhere,

Alluxio software installs near where compute apps run that need access to remote data. We asked about a typical cloud bursting case where S3 object data needed by an app are sitting on prem, but the apps need to run in a cloud, e.g., AWS.

He said Alluxio software would be deployed in AWS, close to app compute and that’s all there is. There’s no Alluxio software running on prem, as Alluxio just uses normal (remote access) S3 APIs to supply data to the compute apps running in AWS.

Adit mentioned that BI was one of the main applications to take advantage of Alluxio, but AI/ML/DL learning is another that could use data orchestration. It turns out that AI/ ML/DL training’s consumption of data is repetitive and highly sequential, so caching, sequential pre-fetch and other Alluxio techniques can work well there to provide local-like access to remote data.

Adit said that enterprises are increasingly looking to avoid vendor lock-in and this applies equally well to the cloud. By supporting data access in one location, say GC,P and accessing that data from another, say Azure, data gravity need no longer limit where work is done.

Adit said what makes their solution so valuable is that instead of duplicating all data from one place to another all that Alluxio moves is just the data required/requested by the apps running there.

Keith asked whether Adit considered Alluxio a data mesh or data fabric. Keith had to explain the terms to me and said data fabrics are pipes and physical infrastructure/functionality that moves data around and data mesh is what gives clients/apps/users access to that data. From that perspective Alluxio is a data mesh.

Alluxio Caching

Adit said that caching is one of the keys to making Alluxio work. Much of the success of their solution depends on applications having a well behaved working set. He also mentioned they use pre-fetching and other techniques to minimize access latency and maximize throughput. However, the first byte of data being accessed may take some time to get to where compute executes.

Adit said it’s not unusual for them to have a 1/2PB of cache (storage) for an application with multiPBs of source data.

Keith asked how Alluxio’s performance can be managed. Adit said they (we assume enterprise edition) have a solution called Cache Insights which uses Alluxio’s extensive access pattern history to predict application IO performance with larger cache (storage), higher speed networking, higher performing/more compute cores, etc. In this way, customers can see what can be done to improve application IO performance and what it would cost.

Keith asked if Alluxio were available as a SaaS solution. Adit said, although it could be deployed in that fashion, it’s not currently a SaaS solution. When asked how Alluxio (enterprise) was priced, Adit said it’s a function of the total resources consumed by their service, i.e, storage (cache), cores, networking that runs Alluxio software etc.

As for deployment options, it turns out for Spark, Alluxio is just another lib package installed inside Spark. For K8s, Alluxio is installed as a CSI drivers and a set of containers and can be deployed as containers within a cluster that needs access to data or in an external, standalone K8s cluster, servicing IO from other clusters. Alluxio HA is supplied by using multiple nodes to provide IO access.

Alluxio also supports access to multiple data locations. In this case, the applications would just access different mount points.

Data reads are easy, writes can be harder due to data integrity issues. As such, trying to supply IO performance becomes a trade off for data integrity when data updates are supported. Adit said Alluxio offers a couple of different configuration options for write concurrency (data integrity) that customers can select from. We assume this includes write through, write back and perhaps other write consistency options.

Alluxio supports AWS, Azure and GCP cloud compute accessing HDFS, S3 and Posix protocol access to data residing at remote sites. At remote sites, they currently support MinIO, Cloudian and any other S3 compatible storage solutions as well as NetApp (ONTAP) and Dell (ECS) storage as data sources.

Adit Madan, Director of Product, Alluxio

Adit Madan is the Director of Product Management at Alluxio. Adit has extensive experience in distributed systems, storage systems, and large-scale data analytics.

Adit holds an MS from Carnegie Mellon University and a BS from the Indian Institute of Technology – Delhi.

Adit is the Director of Product Management at Alluxio and is also a core maintainer and Project Management Committee (PMC) member of the Alluxio Open Source project.

June 10, 2022June 10, 2022

133: GreyBeards talk trillion row databases/data lakes with Ocient CEO & Co-founder, Chris Gladwin

We saw a recent article in Blocks and Files (Storage facing trillion-row db apocalypse), about a couple of companies which were trying to deal with trillion row database queries without taking weeks to respond. One of those companies was Ocient (@Ocient), a Chicago startup, whose CEO and Co-Founder, Chris Gladwin, was an old friend from CleverSafe (now IBM Cloud Object Storage).

Chris and team have been busy creating a new way to perform data analytics on massive data lakes. It’s has a lot to do with extreme parallelism, high core counts, NVMe SSDs, and sophisticated network and compute flow control. Listen to the podcast to learn more.

Podcast: Play in new window | Download (Duration: 43:44 — 60.1MB) | Embed

Subscribe: Apple Podcasts | Spotify | RSS

The key to Ocient’s approach involves NVMe SSDs which have become ubiquitous over the last couple of years which can be deployed to deal with large data problems. Another key to Ocient is multi-core CPUs, which again seem everwhere and if anything, are almost doubling with every new generation of CPU chip.

We let Chris wax a little too long on the SSD revolution in IOPs, especially as pertains too random 4K reads. Put a 20 or so NVMe SSDs in a server with dual 50 core CPU chips and you have one fast random IO machine.

Another key to Ocient is very sophisticated network and bus data flow management. With all this data running any query on it, involves consuming lots of data that all has to be brought into the CPU. PCIe bandwidth helps, as does NVMe SSDs, but you still need to insure that nothing gets bottlenecked moving all that data around a system/server.

Yet another key to Ocient is parallelism. With one 20 NVMe SSD server and 2-50 core CPUs you’ve got a lot of capability but when you are talking about trillion row databases you need more. So in order to respond to queries in anything a second or so, they throw a lot of NVMe servers at the problem.

I asked how they split the data across all these servers and Chris mentioned that at the moment that’s part of their secret sauce and involves professional services.

Ocient supports full ANSI SQL queries against trillion row databases and replies to those queries in a matter of seconds. And we aren’t just talking about SQL selects, Ocient can do splits, joins and updates to this trillion row database at the same time as the SQL select are going on. Chris mentioned that Ocient can be loading 100K JSON files each second, while still performing SQL queries in near real time against the trillion row database.

Ocient supports Reed-Solomon error correction on database data as well as data compression and encryption.

In addition to SQL queries, Chris mentioned that Ocient supports data load and transform activities. He said that most of this data is being generated from IoT applications and often needs to be cleaned up before it can be processed. Doing this in real time, while handling queries to the database is part of their secret sauce.

Chris said there’s probably not that many organizations that have need for trillion row databases. But ad auctions, telecom routers, financial services already use trillion row databases and they all want to be able to process queries faster on this data. Ocient is betting that there will be plenty more like this over time.

Ocient is available on AWS and GCP as a cloud service, can also be used operating in their own Ocient Cloud or can be deployed on premises. Ocient services are billed on a per core pack (500 cores, I think) subscription model.

Chris Gladwin, CEO and Co-founder, Ocient

Chris is the CEO and Co-Founder of Ocient whose mission is to provide the leading platform the world uses to transform, store, and analyze its largest datasets.

In 2004, Chris founded Cleversafe which became the largest and most strategic object storage vendor in the world (according to IDC.) He raised $100M and then led the company to over a $1.3B exit in 2015 when IBM acquired the company. The technology Cleversafe created is used by most people in the U.S. every day and generated over 1,000 patents granted or filed, creating one of the ten most powerful patent portfolios in the world.

Prior to Cleversafe, Chris was the Founding CEO of startups MusicNow and Cruise Technologies and led product strategy for Zenith Data Systems. He started his career at Lockheed Martin as a database programmer and holds an engineering degree from MIT.

May 6, 2022May 6, 2022

132: GreyBeards talk fast embedded k-v stores with Speedb’s Co-Founder&CEO Adi Gelvan

We’ve been talking a lot about K8s of late so we thought it was time to get back down to earth and spend some time with Adi Gelvan (@speedb_io), Co-founder and CEO of Speedb, an embedded key-value store, drop-in/replacement for RocksDB, that significantly improves on its IO performance for large metadata databases.

At Adi’s last job they were searching for a key-values store or database to manage the substantial metadata they needed. After looking at RocksDB, they found it had a number of performance problems, especially as you got up to lots of metadata. Speedb was specifically designed to address the problems they found. Listen to the podcast to learn more.

Podcast: Play in new window | Download (Duration: 43:37 — 59.9MB) | Embed

Subscribe: Apple Podcasts | Spotify | RSS

RocksDB is a key-value store engine that manages the metadata for just about every open source project in existence that uses metadata. RocksDB is a Facebook open source fork of Google’s LevelDB database.

The main issues with RocksDB is that when you have a lot of metadata (key:volume pairs), RocksDB performance suffers from highly variable latency and write stalls.

Most RocksDB users are aware of these problems and turn to sharding the database to address them (by essentially shrinking the amount of metadata under management within a single node/instance..

Historically, key-volume stores used B+-trees to store data. B+-trees are great for reading, but bad for writing. Namely the B+-tree usually had to be rebalanced when entries were added and potentially when they were updated. This could cause a cascade of read-write IO throughout the tree, delaying the original IO.

Log Structured Merge trees (LMS-trees) were created to reduce write problems while at the same time provide B+-tree speed for reading. Essentially, an LSM-tree is an in-memory, sequence of (sometimes sorted) key-value pairs that can be written (destaged) to multiple sequential (sorted) string tables (SST) files on some backing store. A hierarchical index is maintained in memory to identify which SSTs holds which key-value data.

RocksDB uses LSM-tree, in memory data structures, to buffer writes. When memory becomes full, the LSM tree can be destaged to backing store to one or more SST files. Howewer, SSTs, when first written, aren’t necessarily in sorted key order, and they may contain duplicate key-value entries to what’s already in other SSTs.

So earlier versions of SSTs will need to be read back in, compacted (duplicate key-value entries deleted), sorted and written back out. The earliest version of the SSTs is considered Level 0 (L0), the next (1st level compacted and sorted) is considered L1, and this process can go on generating L2 to Ln SSTs. We would call this garbage collection, the metadata world calls it compaction.

But each time an SST is written out that’s another read of all the key-value pairs AND another write to storage. In SSDs we would call these repeated writes, write amplification. It turns out that RocksDB can have up to a 30X write amplification for a key-value entry. This means that instead of just writing it once or twice it’s written (and reread) up to 30 times. This IO takes away bandwidth and processing power from normal metadata read and write activity, which impacts IO performance

As GreyBeards know, storage (and flash) garbage collection can lead to unpredictable latencies and system busy times. Intense garbage collection (for SSDs) can seemingly hold off or stall all other IO for some amount of time during this activity. This is the main reason why RocksDB has highly variable latencies and write stalls.

Garbage collection is not an issue when you have limited amounts of metadata entries (key:volume pairs), but as you get more entries, ongoing garbage collection can become a serious impediment to performing IO. When we say “large metadata stores” we are talking 30GBs of metadata, with probably, billions of key-volume pair entries.

There appears to be two dimensions to (RocksDB) LSM-tree/SST file performance. One is the number of levels allowed and the other is the size of SST file.

Speedb determined that two dimensions weren’t sufficient to solve RocksDB performance problems. And sharding the database seemed to be putting the burden on the customer to fix the issue. So Speedb restructured their LSM-trees and SSTs to create 3 or more dimensions to tune for database performance.

With Speedb’s restructured LMS-tree and SST files, they reduce write amplification for large metadata databases, from 30X to 5X. That alone could easily increase system performance by a factor of 6.

Adi mentioned that for one cloud based customer, they were able to double performance with 1/4 the (cloud instance) server hardware, essentially providing an ~8X improvement in performance over RocksDB.

Adi also mentioned that they are targeting system developers with large metadata stores. Luckily Speedb is a fully RocksDB compatible replacement. This means developers should only take ~30 minutes to convert a system to use Speedb.

We also asked about pricing. Adi said there’s two current pricing models: 1) OEMs pay a revenue share to use Speedb and 2) non-OEMs can license the product on a per node per month basis. Given that Speedb node efficiency over RocksDB, there should be a less nodes required to support the same performance for any given metadata store.

Adi also mentioned they are in the process of releasing an open source version of Speedb that incorporates some of the enterprise product. This way developers can try Speedb to see how it works for free. It won’t bethe complete product but it’s better than native RocksDB.

Adi Gelvan, Co-Founder and CEO Speedb

Adi Gelvan is co-founder and CEO of Speedb, a data management startup, that provides a drop-in replacement for RocksDB embedded storage engine.

Adi was a former IT infrastructure manager with over two decades of management, commercialization and executive sales position. Adi specializes in leading global software technology companies like Infinidat and SQream to outstanding growth.

Adi holds a double academic degree in mathematics & computer science.

	174: GreyBeards talk… on 174: GreyBeards talk SDN chips…
	GreyBeards talk Agen… on 169: GreyBeards talk AgenticAI…
	Computational (DNA)… on 155: GreyBeards SDC23 wrap up…
	155: GreyBeards SDC2… on 155: GreyBeards SDC23 wrap up…
	J Metz on 134: GreyBeards talk (storage)…