Composable Infrastructure – Grey Beards on Systems

February 8, 2022February 9, 2022

129: GreyBeards talk composable infrastructure with GigaIO’s, Matt Demas, Field CTO

We haven’t talked composable infrastructure in a while now but it’s been heating up lately. GigaIO has some interesting tech and I’ve been meaning to have them on the show but scheduling never seemed to work out. Finally, we managed to sync schedules and have Matt Demas, field CTO at GigaIO (@g iga_io) on our show.

Also, please welcome Jason Collier (@bocanuts), a long time friend, technical guru and innovator to our show as another co-host. We used to have these crazy discussions in front of financial analysts where we disagreed completely on the direction of IT. We don’t do these anymore, probably because the complexities in this industry can be hard to grasp for some. From now on, Jason will be added to our gaggle of GreyBeard co-hosts.

GigaIO has taken a different route to composability than some other vendors we have talked with. For one, they seem inordinately focused on speed of access and reducing latencies. For another, they’re the only ones out there, to our knowledge, demonstrating how today’s technology can compose and share memory across servers, storage, GPUs and just about anything with DRAM hanging off a PCIe bus. Listen to the podcast to learn more.

Podcast: Play in new window | Download (Duration: 47:14 — 64.9MB) | Embed

Subscribe: Apple Podcasts | Spotify | RSS

GigaIO22-01-transcript Download

GigaIO started out with pooling/composing memory across PCIe devices. Their current solution is built around a ToR (currently Gen4) PCIe switch with logic and a party of pooling appliances (JBoG[pus], JBoF[lash], JBoM[emory],…). They use their FabreX fabric to supply rack-scale composable infrastructure that can move (attach) PCIe componentry (GPUs, FPGAs, SSDs, etc.) to any server on the fabric, to service workloads.

We spent an awful long time talking about composing memory. I didn’t think this was currently available, at least not until the next version of CXL, but Matt said GigaIO together with their partner MemVerge, are doing it today over FabreX.

We’ve talked with MemVerge before (see: 102: GreyBeards talk big memory … episode). But when last we met, MemVerge had a memory appliance that virtualized DRAM and Optane into an auto-tiering, dual tier memory. Apparently, with GigaIO’s help they can now attach a third tier of memory to any server that needs it. I asked Matt what the extended DRAM response time to memory requests were and he said ~300ns. And then he said that the next gen PCIe technology will take this down considerably.

Matt and Jason started talking about High Bandwidth Memory (HBM) which is internal to GPUs, AI boards, HPC servers and some select CPUs that stacks synch DRAM (SDRAM) into a 3D package. 2nd gen HBM silicon is capable of 256 GB/sec per package. Given this level of access and performance. Matt indicated that GigaIO is capable of sharing this memory across the fabric as well.

We then started talking about software and how users can control FabreX and their technology to compose infrastructure. Matt said GigaIO has no GUI but rather uses Redfish management, a fully RESTfull interface and API. Redfish has been around for ~6 yrs now and has become the de facto standard for management of server infrastructure. GigaIO composable infrastructure support has been natively integrated into a couple of standard cluster managers. For example. CIQ Singularity & Fuzzball, Bright Computing cluster managers and SLURM cluster scheduling. Matt also mentioned they are well plugged into OCP.

Composable infrastructure seems to have generated new interest with HPC customers that are deploying bucketfuls of expensive GPUs with their congregation of compute cores. Using GigaIO, HPC environments like these can overnight, go from maybe 30% average GPU utilization to 70%. Doing so can substantially reduce acquisition and operational costs for GPU infrastructure significantly. One would think the cloud guys might be interested as well.

Matt Demas, Field CTO, GigaIO

Matt’s career spans two decades of experience in architecting innovative IT solutions, starting with the US Air Force. He has built federal, healthcare, and education-based vertical solutions at companies like Dell, where he was a Senior Solutions Architect. Immediately prior to joining GigaIO, he served as Field CTO at Liqid.

Matt holds a Bachelor’s degree in Information Technology from American InterContinental University, and an MBA from Concordia University Austin.

May 7, 2019May 13, 2019

82: GreyBeards talk composable infrastructure with Sumit Puri, CEO & Co-founder, Liqid Inc.

This is the first time we’ve had Sumit Puri, CEO & GM Co-founder of Liqid on the show but both Greg and I have talked with Liqid in the past. Given that we talked with another composable infrastructure company (see our DriveScale podcast), we thought it would be nice to hear from their competition.

We started with a brief discussion of the differences between them and DriveScale. Sumit mentioned that they were mainly focused on storage and not as much on the other components of composable infrastructure.

[This was Greg Schulz’s (@storageIO & StorageIO.com), first time as a GreyBeard co-host and we had some technical problems with his feed, sorry about that.]

Multi-fabric composable infrastructure

At Dell Tech World (DTW) 2019 last week, Liqid announced a new, multi-fabric composability solution. Originally, Liqid composable infrastructure only supported PCIe switching, but with their new announcement, they also now support Ethernet and InfiniBand infrastructure composability. In their multi-fabric solution, they offer JBoG(PUs) which can attach to Ethernet/InfiniBand as well as other compute accelerators such as FPGAs or AI specific compute engines.

For non-PCIe switch fabrics, Liqid adds an “HBA-like” board in the server side that converts PCIe protocols to Ethernet or InfiniBand and has another HBA-like board sitting in the JBoG.

As such, if you were a Media & Entertainment (M&E) shop, you could be doing 4K real time editing during the day, where GPUs were each assigned to a separate servers running editing apps and at night, move all those GPUs to a central server where they could now be used to do rendering or transcoding. All with the same GPU-sever hardware andusing Liqid to re-assign those GPUs, back and forth during day and night shifts.

Even before the multi-fabric option Liqid supported composing NVMe SSDS and servers. So with a 1U server which in the package may support 4 SSDS, with Liqid you could assign 24-48 or whatever number made the most sense to that 1U server for a specialized IO intensive activity. When that activity/app was done, you could then allocate those NVMe SSDs to other servers to support other apps.

Why compose infrastructure

The promise of composability is no more isolated/siloed/dedicated hardware in your environment. Resources like SSDs, GPUS, FPGAs and really servers can be torn apart and put back together without sending out a service technician and waiting for hours while they power down your system and move hardware around. I asked Sumit how long it took to re-configure (compose) hardware into a new congfiguration and he said it was a matter of 20 seconds.

Sumit was at an NVIDIA show recently and said that Liqid could non-disruptively swap out GPUs. For this you would just isolate the GPU from any server and then go over to the JBoG and take the GPU out of the cabinet.

How does it work

Sumit mentioned that they have support for Optane SSDs to be used as DRAM memory (not Optane DC PM) using IMDT (Intel Memory Drive Technology). In this way you can extend your DRAM up to 6TB for a server. And with Liqid it could be concentrated on one server one minute and then spread across dozens the next.

I asked Sumit about the overhead of the fabrics that can be used with Liqid. He said that the PCIe switching may add on the order of 100 nanoseconds and the Ethernet/InfiniBand networks on the order of 10-15 microseconds or roughly 2 orders of magnitude difference in overhead between the two fabrics.

Sumit made a point of saying that Liqid is a software company. Liqid software runs on switch hardware (currently Mellanox Ethernet/InfiniBand switches) or their PCIe switches.

But given their solution can require HBAs, JBoGs and potentially PCIe switches there’s at least some hardware involved. But for Ethernet and InfiniBand their software runs in the Mellanox switch gear. Liqid control software has a CLI, GUI and supports an API.

Liqid supports any style of GPU (NVIDIA, AMD or ?). And as far as they were concerned, anything that could be plugged into a PCIe bus was fair game to be disaggregated and become composable.

Solutions using Liqid

Their solution is available from a number of vendors. And at last week’s, DTW 2019 Liqid announced a new OEM partnership with Dell EMC. So now, you can purchase composable infrastructure, directly from Dell. Liqid’s route to market is through their partner ecosystem and Dell EMC is only the latest.

Sumit mentioned a number of packaged solutions and one that sticks in my mind was a an AI appliance pod solution (sold by Dell), that uses Liqid to compose an training data ingestion environment at one time, a data cleaning/engineering environment at another time, a AI deep learning/model training environment at another time, and then an scaleable inferencing engine after that. Something that can conceivably do it all, an almost all in one AI appliance.

Sumit said that these types of solutions would be delivered in 1/4, 1/2, or full racks and with multi-fabric could span racks of data center infrastructure. The customer ultimately gets to configure these systems with whatever hardware they want to deploy, JBoGs, JBoFs, JBoFPGAs, JBoAIengines, etc.

The podcast runs ~42 minutes. Sumit was very knowledgeable data center infrastructure and how composability could solve many of the problems of today. Some composability use cases he mentioned could apply to just about any data center. Ray and Sumit had a good conversation about the technology. Both Greg and I felt Liqid’s technology represented the next step in data center infrastructure evolution. Listen to the podcast to learn more.

Podcast: Play in new window | Download (Duration: 42:19 — 58.1MB) | Embed

Subscribe: Apple Podcasts | Spotify | RSS

Sumit Perl, CEO & Co-founder, Liqid, Inc.

Sumit Puri is CEO and Co-founder at Liqid. An industry veteran with over 20 years of experience, Sumit has been focused on defining the technology roadmaps for key industry leaders including Avago, SandForce, LSI, and Toshiba.

Sumit has a long history with bringing successful products to market with numerous teams and large-scale organizations.

	174: GreyBeards talk… on 174: GreyBeards talk SDN chips…
	GreyBeards talk Agen… on 169: GreyBeards talk AgenticAI…
	Computational (DNA)… on 155: GreyBeards SDC23 wrap up…
	155: GreyBeards SDC2… on 155: GreyBeards SDC23 wrap up…
	J Metz on 134: GreyBeards talk (storage)…