115-GreyBeards talk database acceleration with Moshe Twitto, CTO&Co-founder, Pliops

We seem to be on a computational tangent this year. So we thought it best to talk with Moshe Twitto, CTO and Co-Founder at Pliops (@pliopsltd). We had first seen them at SFD21 (see videos of their sessions here) and their talk on how they could speed up database IO was pretty impressive. Essentially, they have a database/storage accelerator board used to increase block store IO activity to NVMe SSDs but also provide a key-value store IO accelerator,

Moshe was very knowledgeable about the technology and had previously worked at Samsung for their SSD group. He knew a lot about what happens underneath the covers of an SSD and what it takes to speed up IO. It turns out that many in memory databases use persistent key value stores to persist data or to operate in non- (or partial-) memory-mode. Listen to the podcast to learn more.

The Pliops board plugs into the PCIe bus and accelerates IO to NVMe SSDs connected to the bus or can act to accelerate IO to JBoF that’s networked behind it. Their board uses FPGA(s), NVDimms of their own design and DRAM to accelerate database IO using NVMe SSDS.

Pliops operates in one of two modes, as a Key-Value store or as a Block store. Their Key-Value store takes advantage of block store capabilities, so we start there.

In block mode, Pliops provides inline hardware data compression and encryption. Compression requires support for variable length blocks on backend SSDs. To better support this, they pack multiple compressed blocks into physical blocks. They also use a virtualization service to support mapping host LBAs to physical block addresses (using an internal key-value store). Hardware, inline encryption is also provided on a LUN (or namespace) basis. This could enable each database to have its own key. They have a root-of-trust secret key used to encrypt customer namespace (database) keys.

They also optimize physical block layouton the SSD to reduce write amplification (doing more than one write to the NAND for every host write to the SSD).

Block mode also supports smart caching. This is especially useful for database journaling/loging which reuses a portion of LBA address space (blocks} as a revolving journal/log. These blocks are overwritten with new data often and data written to them need not be destaged to NVMe SSDs as long as it can be maintained in NVDimm storage. At some point it gets destaged but probably only when log activity slows down (if ever) or some timeout occurs.

For their key-value storage accelerator, they have implemented an API that’s similar to RocksDB, a persistent key-value store, which is used as a physical storage backend for Reddis and similar in-memory databases. However, the challenge with RocksDB is that there are lots of tuning knobs/parameters. So getting right takes some work. But all this can be avoided just by using Pliops.

We didn’t talk too much about how their key-value store works. Moshe says they optimize the key structures and key data so that all database keys can be retained in their board’s memory and just by doing that, they can have immediate (1 IO) access to any data block pointed to by those keys.

He did mention that they provide ~the same performance for a database getting 10-25% host cache hit rates using their board as that same database would support with a 80-90% host cache hit rate not using their board. Some of this was shown at SFD21 (so check out the videos above for more performance info)

A couple of other advantages they bring to the table. As they are interposed between the host and the NVMe SSDs they can take advantage of their NVDIMMs and memory to write much wider stripes than the host writes. This allows them to reduce SSD read and write amplification (due to less garbage collection) by writing more full NAND pages. All this also reduces physical host (data) writes/day which can significantly improve SSD endurance.

Somewhere in all that smart caching and data compression, they are able to also decrease response times It turns out that databases that don’t use RocksDB or depend on key-value stores can easily take advantage of all their block store functionality to improve IO performance.

They mostly market their product to hyperscalers and superscalers. His definition of super-scalers was any organization that operates at public cloud levels but is not a public cloud (e.g., big social media companies).

Moshe Twitto, CTO & Co-founder Pliops

Moshe is an expert in advanced data management and coding algorithms. Prior to co-founding Pliops, Moshe served as CTO of Samsung’s SSD Controller Development Center in Israel.

Moshe holds MSEE, BSEE degrees from Technion University, Summa Cum Laude and served in the Unit 8200 Intelligence Division of the Israel Defense Corps.

114: GreyBeards talk computational storage with Tong Zhang, Co-Founder & Chief Scientist, ScaleFlux

Seeing as how one topic on last years FMS2020 wrap-up with Jim Handy was the rise of computational storage and it’s been a long time (see GreyBeards talk with Scott Shadley at NGD Systems) since we discussed this, we thought it time to check in on the technology. So we reached out to Dr. Tong Zhang, Chief Scientist and Co-founder, ScaleFlux to see what’s going on. ScaleFlux is seeing rising adoption of their product in hyper-scalers as well as large enterprises. Their computational storage is a programmable FPGA based 4TB and 8TB SSD.

Tong was very knowledgeable on current industry trends (Moore’s law slowing & others) that have created an opening for computational storage and other outboard compute. He also is well versed into how some of the worlds biggest customers are using the technology to work faster and cheaper in their data centers. Listen to the podcast to learn more.

At the start Tong mentioned Alibaba’s use of ScaleFlux’s transparent, line speed, outboard encryption/decryption and compression/decompression. And, depending on the data, they can see compression ratios far exceeding 2:1. As such, customers not only benefit from a cheaper $/GB but can also see better NAND endurance and higher performance.

Hosts can do compression and encryption but doing so takes a lot of CPU cycles. It turns out that compression is more compute intensive than encryption. Tong said that most modern cores can encrypt/decrypt at 1GB/sec but, depending on the compression algorithm, can only compress at 40 to 100MB/sec. But in any case doing so on the host consumes a lot of CPU instruction cycles. With ScaleFlux, they can compress and decompress at PCIe bus speeds.

Most storage controllers that offer compression/decompression must have some sort of LBA (logical block address) virtualization. Because while the host may be writing 512 or 4096 byte blocks, what’s actually written to the NAND is more like, 231 or 1999 bytes. So packing these odd, variable length blocks into NAND blocks can become a problem. But most SSDs already have a flash translation layer (FTL) where LBA addresses are mapped, over time, to different physical NAND page/block addresses. ScaleFlux has combined support for LBA virtualization and FTL into the same process and by doing so, they reduce IO overhead to perform better.

ScaleFlux’s drive is an NVMe SSD, which already supports great native response times but when you are transferring 1/2 or less of (compressed) data from the host onto NAND, you can reduce latencies even more. .

Although their current generation product is based on TLC NAND they are working on the next generation which will support QLC. And the benefits of writing and reading less data should also help QLC endurance and performance.

Although ScaleFlux is seeing great adoption with just outboard transparent compression and encryption, there is more that could be done, For example,

  • Filtering query’s at the drive rather than at the host. If customers can send a search key/phrase or other filtering request directly to the drive, the drive can pass over all it’s data and send back just the data that matches that filter request.
  • Transcoding and other data format changes. Although transcoding makes a lot of sense to do outboard, Tong also mentioned format changes. We asked him to clarify and he said consider a row based database that needs to be accessed in columnar format. If the drive could change the format from one to the other, it opens up more analytics tool sets.

At the moment, ScaleFlux engineering teams are the ones that program the FPGA to perform outboard functionality. But in a future release, they plan to adding ARM cores in a SoC, which can handle more general purpose outboard functionality as code.

Because of this added complexity of compression, encryption and other outboard logic, we asked Tong what power loss protection was available at the drive level. Tong assured us that once data has been received by their device, it is maintained across a power failure with CAPs and other logic to offload it.

Tong also mentioned that Intel, AWS and the NVMe standard committee are looking at adding some computational storage support into the NVMe standard, so applications and host software can invoke and maybe modify outboard functionality on the fly. Sort of like loading containers of functionality to run on the fly on an SSD drive.

Dr. Tong Zhang, Chief Scientist and Co-fonder, ScaleFlux

Dr. Tong Zhang is a well-established researcher with significant contributions to data storage systems and VLSI signal processing. Dr. Zhang is responsible for developing key techniques and algorithms for ScaleFlux’s Computational Storage products and exploring their use in mainstream application domains.

He is currently a Professor at Rensselaer Polytechnic Institute (RPI). His current and past research span over database, filesystem, solid-state and magnetic data storage devices and systems, digital signal processing and communication, error correction coding, VLSI architectures, and computer architecture.

He has published over 150 technical papers at prestigious USENIX/IEEE/ACM conferences and journals with the citation h-index of 36, and has served as general and technical program chairs for several premier conferences. Among his many research accomplishments, he made pioneering contributions to establishing flash memory signal processing and enabling practical implementation of low-density parity-check (LDPC) codecs. He received two best paper awards and has over 20 issued/pending US patent applications.

He holds BS/MS degrees in EE from the Xi’an Jiaotong University, China, and PhD degree in ECE from the University of Minnesota.