In this episode we talk with Ahmet Houssein, VP of Marketing and Strategic Direction at Solarflare Communications, (@solarflare_comm). Ahmet’s been in the industry forever and has a unique view on where NVMeoF needs to go. Howard had talked with Ahmet at last years FMS. Ahmet will also be speaking at this years FMS (this week in Santa Clara, CA)..
Solarflare Communications sells Ethernet communication gear, mostly to the financial services market and has developed a software plugin for the standard TCP/IP stack on Linux that supports both target and client mode NVMeoF/TCP. That is, their software plugin provides a complete implementation of NVMeoF across TCP Ethernet that extends the TCP protocol but doesn’t require RDMA (RoCE or iWARP) or data center bridging.
Solarflare’s NVMeoF/TCP is a free plugin that once approved by the NVMe(oF) standard’s committees anyone can use to create a NVMeoF storage system and consume that storage from almost anywhere. The standards committee is expected to approve the protocol extension soon and sometime after that the plugin will be added to the Linux Kernel. After standards approval, maybe VMware and Microsoft will adopt it as well, but may take more work.
Over the last year plus most NVMeoF/Ethernet we encounter requires sophisticated RDMA hardware. When we talked with Pavilion Data Systems, a month or so ago, they had designed a more networking like approach to NVMeoF using RoCE and TCP a special purpose FPGA that’s used in their RDMA NICs and Mellanox switches to support client-target mode NVMeoF/UDP [updated 8/8/18 after VR’s comment, the ed.]. When we talked with Attala Systems, they had special purpose FPGA that’s used in RDMA NICs and Mellanox switches to support target & client mode NVMeoF/UDP were using standard RDMA NICs and Mellanox switches to support their NVMeoF/Ethernet storage [updated 8/8/18 after VR’s comment, the ed.].
Solarflare is taking a different tack.
One problem with the NVMeoF/Ethernet RDMA is compatibility. You can use either RoCE or iWARP RDMA NICs but at the moment you can’t use both. With TCP/IP plugins there’s no hardware compatibility issue. (Yes, there’s software compatibility at both ends of the pipe).
SolarFlare recently measured latencies for their NVMeoF/TCP (Iometer/FIO) which shows that the with the protocol running it adds about a 5-10% increase in latency versus running RDMA NVMeoF/UDP-RoCE-iWARP.
Performance measurements were taken using a server, running Red Hat Linux + their TCP plugin with NVMe SSDs on the storage side and a similar configuration on the client side without the SSDs.
If they add 10% latency to 10 microsec. IO (for Optane), latency becomes 11 microsec. Similarly for flash NVMe SSDs it moves from 100 microsec to 110 microsec.
Ahmet did mention that their NICs have some hardware optimizations which brings down this added latency into something approaching closer to 5%. And later we discuss the immense parallelism opportunities using the TCP stack in user space. Their hardware also better supports more threads doing IO in parallel.
Ahmets on a mission. He says there’s this misbelief that Ethernet RDMA hardware is required to achieve lightening fast response times using NVMeoF, but it’s not true. Standard TCP with proper protocol enhancements is more than capable of performing at very close to the same latencies as RDMA, without special NICs and DCB switch configurations.
Furthermore, TCP/IP already has multipathing support. So current high availability characteristics of TCP are readily applicable to NVMeoF/TCP
Parallelism through user space
NVMeoF/TCP was the subject of 1st half of our discussion but we spent the 2nd half talking about scaling or parallelism. Even if you can do 11 or 110 microsecond latency at some point, if you do enough of these IOs, the kernel overhead in processing blocks and transferring control from kernel space to user space will become a bottleneck.
However, there’s nothing stopping IT from running the TCP/IP stack in user space and eliminating any kernel control transfer whatsoever. By doing so, data centers could parallelize all this IO using as many cores as available.
Running the plugin in a TCP/IP stack in user space allows you to scale NVMeoF lightening fast IO to as many users as you have user spaces or cores, and the kernel doesn’t even break into a sweat
Anyone could simply download Solarflare’s plugin, configure a white box server with Linux and 24 NVMe SSDs and support ~8.4M IOPS (350Kx24) at ~110 microsec latency And with user space scaling, one could easily have 1000s of user spaces connected to it.
They’re going to need need faster pipes!
The podcast runs ~39 minutes. Ahmet was very knowledgeable about NVMe, NVMeoF and TCP. He was articulate and easy to talk with. Listen to the podcast to learn more.
Podcast: Play in new window | Download (Duration: 38:50 — 53.3MB) | Embed
Subscribe: Apple Podcasts | Google Podcasts | Spotify | Stitcher | Email | RSS
Ahmet Houssein, VP of Marketing and Strategic Direction at Solarflare Communications
Ahmet Houssein is responsible for establishing marketing strategies and implementing programs to drive revenue growth, enter new markets and expand brand awareness to support Solarflare’s continuous development and global expansion.
He has over twenty-five years of experience in the server, storage, data center and networking industry, and held senior level executive positions in product development, marketing and business development at Intel and Honeywell. Most recently Houssein was SVP/GM at QLogic where he successfully delivered first to market with 25Gb Ethernet products securing design wins at HP and Dell.
One of the key leaders in the creation of the INFINIBAND and PCI-Express industry standard, Houssein is a recipient of the Intel Achievement Award and was a founding board member of the Storage Network Industry Association (SNIA), a global organization of 400 companies in the storage market. He was educated in London, UK and holds an Electrical Engineering Degree equivalent.
3 Replies to “68: GreyBeards talk NVMeoF/TCP with Ahmet Houssein, VP of Marketing & Strategy @ Solarflare Communications”
The information above is factually incorrect with respect to Pavilion Data.
Here are the wrong information:
1. We also dont use mellanox switches inside our product.
2. We dont make NICs too, let alone use FPGAs in our NICs!! We use standard RDMA NICs from mellanox, Cx4.
3. NVMeoF/UDP is not supported by us.
Maybe, you have us confused with some other company. I have no idea where this information came from.
Pavilion Data Systems
VR Gosh struck out on all of the above. My sincerest apologies, I will clarify this in the post ASAP with edits. RL
Post content corrected above