Big data – Grey Beards on Storage

January 19, 2024January 18, 2024

161: Greybeards talk AWS S3 storage with Andy Warfield, VP Distinguished Engineer, Amazon

We talked with Andy Warfield (@AndyWarfield), VP Distinguished Engineer, Amazon, about 10 years ago, when at Coho Data (see our (005:) Greybeards talk scale out storage … podcast). Andy has been a good friend for a long time and he’s been with Amazon S3 for over 5 years now. Since the recent S3 announcements at AWS Re:Invent, we thought it a good time to have him back on the show. Andy has a great knack for explaining technology, I suppose that comes from his time as a professor but whatever the reason, he was great to have on the show again.

Lately, Andy’s been working on S3 Express, One Zone storage, announced last November, a new version of S3 object storage with lower response time. We talked about this later in the podcast but first we touched on S3’s history and other advances. S3 and its ancillary services have advanced considerably over the years. Listen to the podcast to learn more

Podcast: Play in new window | Download (Duration: 47:52 — 65.7MB) | Embed

Subscribe: Apple Podcasts | Google Podcasts | Spotify | Email | RSS

S3 is ~18 years old now and was one of the first AWS offerings. It was originally intended to be the internet’s file system which is why it was based on HTTP protocols.

Andy said that S3 was designed for 11-9s durability and high availability options. AWS constantly monitors server and storage failures/performance to insure that they can maintain this level of durability. The problem with durability is that when a drive/server goes down, the data needs to be rebuilt onto another drive before another drive fails. One way to do this is to have more replicas of the data. Another way is to speed up rebuild times. I’m sure AWS does both.

S3 high availability requires replicas across availability zones (AZ). AWS availability zone data centers are carefully located so that they are power-networking isolated from others data centers in the region. Further, AZ site locations are deliberately selected with an eye towards ensuring they are not susceptible to similar physical disasters.

Andy discussed other AWS file data services such as their FSx systems (Amazon FSx for Lustre, for OpenZFS, for Windows File Server, & for NetApp ONTAP) as well as Elastic File System (EFS). Andy said they sped up one of these FSx services by 3-5X over the last year.

Andy mentioned one of the guiding principles for lot of AWS storage is to try to eliminate any hard decisions for enterprise developers. By offering FSx files, S3 objects and their other storage and data services, customers already using similar systems in house can just migrate apps to AWS without having to modify code.

Andy said one thing that struck him as he came on the S3 team was the careful deliberation that occurred whenever they considered S3 API changes. He said the team is focused on the long term future of S3 and any API changes go through a long and deliberate review before implementation.

One workload that drove early S3 adoption was data analytics. Hadoop and BigTable have significant data requirements. Early on, someone wrote an HDFS interface to S3 and over time lots of data analytics activity moved to S3 object hosted data.

Databases have also changed over the last decade or so. Keith mentioned that many customers are foregoing traditional data bases to use open source database solutions with S3 as their backend storage. It turns out that Open Table Format database offerings such as Apache Iceberg, Apache Hudi and Delta Lake are all available on AWS use S3 objects as their storage

We talked a bit about Lambda Server-less processing triggered by S3 objects. This was a new paradigm for computing when it came out and many customers have adopted Lambda to reduce cloud compute spend.

Recently Amazon introduced a file system M ount point for S 3 storage. Customers can now use an NFS mount point to access any S3 bucket.

Amazon also supports the Registry for Open Data, which holds just about every canonical data set (stored as S3 objects) used for AI training.

In the last ReInvent, Amazon announced S3 Express One Zone which is a high performance, low latency version of S3 storage. The goal for S3 express was to get latency down from 40-60 msec to less than 10 sec.

They ended up making a number of changes to S3 such as:

Redesigned/redeveloped some S3 micro services to reduce latency
Restricted S3 Express storage to a single zone reducing replication requirements, but maintained 11-9s durability
Used higher performing storage
Re-designed S3 API to move some authentication/verification to the beginning of object access from every object access call.

Somewhere during our talk Andy said that, in aggregate, S3 is providing 100TBytes/sec of data bandwidth. How’s that for a scale out storage.

Andy Warfield, VP Distinguished Engineer, Amazon

Andy is a Vice President and Distinguished Engineer in Amazon Web Services. He focusses primarily on data storage and analytics.

Andy holds a PhD from the University of Cambridge, where he was one of the authors of the Xen hypervisor. Xen is an open source hypervisor that was used as the initial virtualization layer in AWS, among multiple other early cloud companies. Andy was a founder at Xensource, a startup based on Xen that was subsequently acquired by Citrix Systems for $500M. Following XenSource,

Andy was a professor at the University of British Columbia (UBC), where he was awarded a Canada Research Chair, and a Sloan Research Fellowship. As a professor, Andy did systems research in areas including operating systems, networking, security, and storage.

Andy’s second startup, Coho Data, was a scale-out enterprise storage array that integrated NVMe SSDs with programmable networks. It raised over 80M in funding from VCs including Andreessen Horowitz, Intel Capital, and Ignition Partners.

October 13, 2022October 13, 2022

138: GreyBeards talk big data orchestration with Adit Madan, Dir. of Product, Alluxio

We have never talked with Alluxio before but after coming back last week from Cloud Field Day 15 (CFD15) it seemed a good time to talk with other solution providers attempting to make hybrid cloud easier to use. Adit Madan (@madanadit) , Director of Product Management, Alluxio, which is a data orchestration solution that’s available in both a free to download/use, open source, community edition (apparently, Meta is a customer ) or a licensed, closed source, enterprise edition.

Alluxio data orchestration is all about suppling local like, IO access to data that resides elsewhere for BI, AI/ML/DL, and just about any other application needing to process data residing elsewhere. Listen to the podcast to learn more

Podcast: Play in new window | Download (Duration: 50:21 — 69.2MB) | Embed

Subscribe: Apple Podcasts | Google Podcasts | Spotify | Email | RSS

Alluxio started out at UC Berkeley’s AMPlab, which is focused on big data problems and was designed to provide local access to massive amounts of distributed data. Alluxio ends up constructing a locally accessible, federation of data sources for compute apps running elsewhere,

Alluxio software installs near where compute apps run that need access to remote data. We asked about a typical cloud bursting case where S3 object data needed by an app are sitting on prem, but the apps need to run in a cloud, e.g., AWS.

He said Alluxio software would be deployed in AWS, close to app compute and that’s all there is. There’s no Alluxio software running on prem, as Alluxio just uses normal (remote access) S3 APIs to supply data to the compute apps running in AWS.

Adit mentioned that BI was one of the main applications to take advantage of Alluxio, but AI/ML/DL learning is another that could use data orchestration. It turns out that AI/ ML/DL training’s consumption of data is repetitive and highly sequential, so caching, sequential pre-fetch and other Alluxio techniques can work well there to provide local-like access to remote data.

Adit said that enterprises are increasingly looking to avoid vendor lock-in and this applies equally well to the cloud. By supporting data access in one location, say GC,P and accessing that data from another, say Azure, data gravity need no longer limit where work is done.

Adit said what makes their solution so valuable is that instead of duplicating all data from one place to another all that Alluxio moves is just the data required/requested by the apps running there.

Keith asked whether Adit considered Alluxio a data mesh or data fabric. Keith had to explain the terms to me and said data fabrics are pipes and physical infrastructure/functionality that moves data around and data mesh is what gives clients/apps/users access to that data. From that perspective Alluxio is a data mesh.

Alluxio Caching

Adit said that caching is one of the keys to making Alluxio work. Much of the success of their solution depends on applications having a well behaved working set. He also mentioned they use pre-fetching and other techniques to minimize access latency and maximize throughput. However, the first byte of data being accessed may take some time to get to where compute executes.

Adit said it’s not unusual for them to have a 1/2PB of cache (storage) for an application with multiPBs of source data.

Keith asked how Alluxio’s performance can be managed. Adit said they (we assume enterprise edition) have a solution called Cache Insights which uses Alluxio’s extensive access pattern history to predict application IO performance with larger cache (storage), higher speed networking, higher performing/more compute cores, etc. In this way, customers can see what can be done to improve application IO performance and what it would cost.

Keith asked if Alluxio were available as a SaaS solution. Adit said, although it could be deployed in that fashion, it’s not currently a SaaS solution. When asked how Alluxio (enterprise) was priced, Adit said it’s a function of the total resources consumed by their service, i.e, storage (cache), cores, networking that runs Alluxio software etc.

As for deployment options, it turns out for Spark, Alluxio is just another lib package installed inside Spark. For K8s, Alluxio is installed as a CSI drivers and a set of containers and can be deployed as containers within a cluster that needs access to data or in an external, standalone K8s cluster, servicing IO from other clusters. Alluxio HA is supplied by using multiple nodes to provide IO access.

Alluxio also supports access to multiple data locations. In this case, the applications would just access different mount points.

Data reads are easy, writes can be harder due to data integrity issues. As such, trying to supply IO performance becomes a trade off for data integrity when data updates are supported. Adit said Alluxio offers a couple of different configuration options for write concurrency (data integrity) that customers can select from. We assume this includes write through, write back and perhaps other write consistency options.

Alluxio supports AWS, Azure and GCP cloud compute accessing HDFS, S3 and Posix protocol access to data residing at remote sites. At remote sites, they currently support MinIO, Cloudian and any other S3 compatible storage solutions as well as NetApp (ONTAP) and Dell (ECS) storage as data sources.

Adit Madan, Director of Product, Alluxio

Adit Madan is the Director of Product Management at Alluxio. Adit has extensive experience in distributed systems, storage systems, and large-scale data analytics.

Adit holds an MS from Carnegie Mellon University and a BS from the Indian Institute of Technology – Delhi.

Adit is the Director of Product Management at Alluxio and is also a core maintainer and Project Management Committee (PMC) member of the Alluxio Open Source project.

June 10, 2022June 10, 2022

133: GreyBeards talk trillion row databases/data lakes with Ocient CEO & Co-founder, Chris Gladwin

We saw a recent article in Blocks and Files (Storage facing trillion-row db apocalypse), about a couple of companies which were trying to deal with trillion row database queries without taking weeks to respond. One of those companies was Ocient (@Ocient), a Chicago startup, whose CEO and Co-Founder, Chris Gladwin, was an old friend from CleverSafe (now IBM Cloud Object Storage).

Chris and team have been busy creating a new way to perform data analytics on massive data lakes. It’s has a lot to do with extreme parallelism, high core counts, NVMe SSDs, and sophisticated network and compute flow control. Listen to the podcast to learn more.

Podcast: Play in new window | Download (Duration: 43:44 — 60.1MB) | Embed

Subscribe: Apple Podcasts | Google Podcasts | Spotify | Email | RSS

The key to Ocient’s approach involves NVMe SSDs which have become ubiquitous over the last couple of years which can be deployed to deal with large data problems. Another key to Ocient is multi-core CPUs, which again seem everwhere and if anything, are almost doubling with every new generation of CPU chip.

We let Chris wax a little too long on the SSD revolution in IOPs, especially as pertains too random 4K reads. Put a 20 or so NVMe SSDs in a server with dual 50 core CPU chips and you have one fast random IO machine.

Another key to Ocient is very sophisticated network and bus data flow management. With all this data running any query on it, involves consuming lots of data that all has to be brought into the CPU. PCIe bandwidth helps, as does NVMe SSDs, but you still need to insure that nothing gets bottlenecked moving all that data around a system/server.

Yet another key to Ocient is parallelism. With one 20 NVMe SSD server and 2-50 core CPUs you’ve got a lot of capability but when you are talking about trillion row databases you need more. So in order to respond to queries in anything a second or so, they throw a lot of NVMe servers at the problem.

I asked how they split the data across all these servers and Chris mentioned that at the moment that’s part of their secret sauce and involves professional services.

Ocient supports full ANSI SQL queries against trillion row databases and replies to those queries in a matter of seconds. And we aren’t just talking about SQL selects, Ocient can do splits, joins and updates to this trillion row database at the same time as the SQL select are going on. Chris mentioned that Ocient can be loading 100K JSON files each second, while still performing SQL queries in near real time against the trillion row database.

Ocient supports Reed-Solomon error correction on database data as well as data compression and encryption.

In addition to SQL queries, Chris mentioned that Ocient supports data load and transform activities. He said that most of this data is being generated from IoT applications and often needs to be cleaned up before it can be processed. Doing this in real time, while handling queries to the database is part of their secret sauce.

Chris said there’s probably not that many organizations that have need for trillion row databases. But ad auctions, telecom routers, financial services already use trillion row databases and they all want to be able to process queries faster on this data. Ocient is betting that there will be plenty more like this over time.

Ocient is available on AWS and GCP as a cloud service, can also be used operating in their own Ocient Cloud or can be deployed on premises. Ocient services are billed on a per core pack (500 cores, I think) subscription model.

Chris Gladwin, CEO and Co-founder, Ocient

Chris is the CEO and Co-Founder of Ocient whose mission is to provide the leading platform the world uses to transform, store, and analyze its largest datasets.

In 2004, Chris founded Cleversafe which became the largest and most strategic object storage vendor in the world (according to IDC.) He raised $100M and then led the company to over a $1.3B exit in 2015 when IBM acquired the company. The technology Cleversafe created is used by most people in the U.S. every day and generated over 1,000 patents granted or filed, creating one of the ten most powerful patent portfolios in the world.

Prior to Cleversafe, Chris was the Founding CEO of startups MusicNow and Cruise Technologies and led product strategy for Zenith Data Systems. He started his career at Lockheed Martin as a database programmer and holds an engineering degree from MIT.

May 6, 2022May 6, 2022

132: GreyBeards talk fast embedded k-v stores with Speedb’s Co-Founder&CEO Adi Gelvan

We’ve been talking a lot about K8s of late so we thought it was time to get back down to earth and spend some time with Adi Gelvan (@speedb_io), Co-founder and CEO of Speedb, an embedded key-value store, drop-in/replacement for RocksDB, that significantly improves on its IO performance for large metadata databases.

At Adi’s last job they were searching for a key-values store or database to manage the substantial metadata they needed. After looking at RocksDB, they found it had a number of performance problems, especially as you got up to lots of metadata. Speedb was specifically designed to address the problems they found. Listen to the podcast to learn more.

Podcast: Play in new window | Download (Duration: 43:37 — 59.9MB) | Embed

Subscribe: Apple Podcasts | Google Podcasts | Spotify | Email | RSS

RocksDB is a key-value store engine that manages the metadata for just about every open source project in existence that uses metadata. RocksDB is a Facebook open source fork of Google’s LevelDB database.

The main issues with RocksDB is that when you have a lot of metadata (key:volume pairs), RocksDB performance suffers from highly variable latency and write stalls.

Most RocksDB users are aware of these problems and turn to sharding the database to address them (by essentially shrinking the amount of metadata under management within a single node/instance..

Historically, key-volume stores used B+-trees to store data. B+-trees are great for reading, but bad for writing. Namely the B+-tree usually had to be rebalanced when entries were added and potentially when they were updated. This could cause a cascade of read-write IO throughout the tree, delaying the original IO.

Log Structured Merge trees (LMS-trees) were created to reduce write problems while at the same time provide B+-tree speed for reading. Essentially, an LSM-tree is an in-memory, sequence of (sometimes sorted) key-value pairs that can be written (destaged) to multiple sequential (sorted) string tables (SST) files on some backing store. A hierarchical index is maintained in memory to identify which SSTs holds which key-value data.

RocksDB uses LSM-tree, in memory data structures, to buffer writes. When memory becomes full, the LSM tree can be destaged to backing store to one or more SST files. Howewer, SSTs, when first written, aren’t necessarily in sorted key order, and they may contain duplicate key-value entries to what’s already in other SSTs.

So earlier versions of SSTs will need to be read back in, compacted (duplicate key-value entries deleted), sorted and written back out. The earliest version of the SSTs is considered Level 0 (L0), the next (1st level compacted and sorted) is considered L1, and this process can go on generating L2 to Ln SSTs. We would call this garbage collection, the metadata world calls it compaction.

But each time an SST is written out that’s another read of all the key-value pairs AND another write to storage. In SSDs we would call these repeated writes, write amplification. It turns out that RocksDB can have up to a 30X write amplification for a key-value entry. This means that instead of just writing it once or twice it’s written (and reread) up to 30 times. This IO takes away bandwidth and processing power from normal metadata read and write activity, which impacts IO performance

As GreyBeards know, storage (and flash) garbage collection can lead to unpredictable latencies and system busy times. Intense garbage collection (for SSDs) can seemingly hold off or stall all other IO for some amount of time during this activity. This is the main reason why RocksDB has highly variable latencies and write stalls.

Garbage collection is not an issue when you have limited amounts of metadata entries (key:volume pairs), but as you get more entries, ongoing garbage collection can become a serious impediment to performing IO. When we say “large metadata stores” we are talking 30GBs of metadata, with probably, billions of key-volume pair entries.

There appears to be two dimensions to (RocksDB) LSM-tree/SST file performance. One is the number of levels allowed and the other is the size of SST file.

Speedb determined that two dimensions weren’t sufficient to solve RocksDB performance problems. And sharding the database seemed to be putting the burden on the customer to fix the issue. So Speedb restructured their LSM-trees and SSTs to create 3 or more dimensions to tune for database performance.

With Speedb’s restructured LMS-tree and SST files, they reduce write amplification for large metadata databases, from 30X to 5X. That alone could easily increase system performance by a factor of 6.

Adi mentioned that for one cloud based customer, they were able to double performance with 1/4 the (cloud instance) server hardware, essentially providing an ~8X improvement in performance over RocksDB.

Adi also mentioned that they are targeting system developers with large metadata stores. Luckily Speedb is a fully RocksDB compatible replacement. This means developers should only take ~30 minutes to convert a system to use Speedb.

We also asked about pricing. Adi said there’s two current pricing models: 1) OEMs pay a revenue share to use Speedb and 2) non-OEMs can license the product on a per node per month basis. Given that Speedb node efficiency over RocksDB, there should be a less nodes required to support the same performance for any given metadata store.

Adi also mentioned they are in the process of releasing an open source version of Speedb that incorporates some of the enterprise product. This way developers can try Speedb to see how it works for free. It won’t bethe complete product but it’s better than native RocksDB.

Adi Gelvan, Co-Founder and CEO Speedb

Adi Gelvan is co-founder and CEO of Speedb, a data management startup, that provides a drop-in replacement for RocksDB embedded storage engine.

Adi was a former IT infrastructure manager with over two decades of management, commercialization and executive sales position. Adi specializes in leading global software technology companies like Infinidat and SQream to outstanding growth.

Adi holds a double academic degree in mathematics & computer science.

March 23, 2022March 23, 2022

130: GreyBeards talk high-speed database access using Apache Arrow Flight, with James Duong and David Li

We had heard about Apache Arrow and Arrow Flight as being a hi-performing database with access speeds to match for a while now and finally got a chance to hear what it was all about with James Duong, Co-Fourder of Bit Quill Technologies/Senior Staff Developer at Dremio and David Li (@lidavidm), Apache PMC and software developer at Voltron Data.

First, Apache Arrow is an open source, in memory data base (GitHub repo) for columnar data that enables lightening fast access and processing of data. Apache Arrow Flight is a set of interfaces, protocols, and services that parallelizes access to load and unload Arrow data over the network, from storage to memory and back, very fast. Listen to the podcast to learn more.

Podcast: Play in new window | Download (Duration: 42:09 — 57.9MB) | Embed

Subscribe: Apple Podcasts | Google Podcasts | Spotify | Email | RSS

Arrow Flight podcast transcript Download

Columnar databases are all the rage these days and have more or less taken over from row oriented data bases. With row based database, data is stored (and accessed) row by row. In a columnar database, data is stored in columns, i.e, all data for one column is stored in sequence and then the next column is stored in sequence. Columnar databases can be queried/processed faster than row databased (depending on whether you are looking at/accessing multiple columns per row or not). And columnar data should compress better as all the data in a single column is of the same type..

Also the fact that columns are located contiguous in memory means if you process a column at a time, CPU data caches should work better. This is because they can grab a whole vector (columns worth of data) with one request.

Arrow data is processed and accessed in record batches. These are 2D segments which represent all the columns in a sequence/set of rows. And record batches are the unit of parallelism in Arrow and Arrow Flight. So an Arrow client operating on a CPU thread/core/chip or server could be processing one record batch while another CPU thread/core/CPU or server could process a different record batch.

Arrow Flight (GitHub RPC format doc repo) is an RPC framework that includes API’s, protocols, standards (for on storage, on wire and in memory) and libraries used to transfer Arrow data and metadata (record batches) across the network. For the typical system there exists Flight clients and Flight services in a system.

Arrow Flight currently uses Google’s gRPC for data transfers. gRPC is a open source remote procedure call (RPC) service that supports within data center, across data centers and out to the edge processing services. Although Arrow Flight is currently implemented on top of gRPC, other network protocols will be supported in the future.

What makes Arrow Flight so fast is its ability to support parallel transfers. That is customers can configure Arrow (Flight) clients across clusters of servers and Arrow (Flight) services residing on one or more other servers. Any client can request metadata and record batches from any end point (Flight service) in the data center. And yes Arrow data can be supplied from multiple end points by being mirrored/replicated. All data transfers can operate in parallel across all Flight client and services, with no known bottleneck other than the network.

A single stream of Arrow Flight data was able to deliver 20GB/sec. The fact that you can have any (?) number of Arrow Flight data streams in operation at the same time makes that a very interesting number.

Also, Arrow data can be stored on or sourced from typical data lakes such as Azure Data Lake, AWS S3, Google Cloud storage, etc.

Another advantage of Arrow Flight is the ability to use the same format on the wire and in storage. Normally JDBC (and ODBC) have on storage and on wire formats which require format conversion (serialization) to move data from storage/memory to wire and another conversion (deserialization) to move data from on wire format to in storage/memory format. Arrow Flight does away with serialization and deserialization of data all together and uses the same format for on wire and in storage.

Arrow Flight SQL allows Arrow processing of SQL database data. My understanding is that customers using non Arrow databases such as Oracle, SQL Server, Postgres, etc. can use Arrow Flight SQL to provide Arrow in-memory database processin/query execution for their data.

Arrow and Arrow flight are primarily used to process data analytics workloads but Arrow also has a new execution engine, the Arrow Gandiva project, that enables vectorized processing of Arrow data. This is a special execution engine for Arrow that supports X86 cores with AVX instructions, (NVIDIA) GPUs, and FPGAs.

There’s also an open source package, Fletcher, used to create Arrow and Arrow flight processing HDLs so that customers can add Arrow data processing and Arrow Flight data transfer functionality to custom built FPGAs.

One challenge with open source software is support for problems/bugs that crop up. An active developer community helps, but enterprise customers require professional, on call 7×24 (5×12?) support for all their critical (and most non-critical) software. Voltron Data (David’s) company provides paid for support for Arrow Flight and Arrow data services.

The other major problem with open source software has been use complexity. At the moment the Arrow Flight team is very responsive in clarifying documentation and are trying to make it easier to use. But at the moment Arrow Flight is mostly a set of APIs, libraries and connectors that end users can use to standup Arrow (Flight) clients and servers to transfer Arrow data between them.

James Duong, Co-Founder Bit Quill Technologies & Sr. Staff Developer at Dremio

An Apache Arrow contributor, cofounder at Bit Quill Technologies, and contributor to Dremio Corporation projects, James Duong has worked with databases for over 15 years, from backend query engines to drivers and protocols. He’s worked with a variety of relational, big data, and cloud databases including Dremio, SQL Server, Redshift, and Hive.

Previously at Simba Technologies, James architected and built connectors for sources, as well as designing the Simba Engine SDK for developing connectivity solutions for any data source.

Bit Quill Technologies, the company James helped co-found, builds back end software in the data and cloud space. Bit Quill has built a name for itself as a producer of high-quality software, a collaborative approach to design and development, and a love for good tech and happy people.

Balancing his passion for the data ecosystem with a young family, James occasionally steps away from it all to go hiking.

David Li, Apache Arrow PMC and software engineer at Voltron Data

David is a PMC member for Apache Arrow and a software engineer at Voltron Data (formerly known as Ursa Computing). Prior to that, he worked on data services and Apache Arrow at Two Sigma.

David holds an M.Eng. in Computer Science from Cornell University.

	Computational (DNA)… on 155: GreyBeards SDC23 wrap up…
	155: GreyBeards SDC2… on 155: GreyBeards SDC23 wrap up…
	J Metz on 134: GreyBeards talk (storage)…
	Administrator on 68: GreyBeards talk NVMeoF/TCP…
	Administrator on 68: GreyBeards talk NVMeoF/TCP…