82: GreyBeards talk composable infrastructure with Sumit Puri, CEO & Co-founder, Liqid Inc.

This is the first time we’ve had Sumit Puri, CEO & GM Co-founder of Liqid on the show but both Greg and I have talked with Liqid in the past. Given that we talked with another composable infrastructure company (see our DriveScale podcast), we thought it would be nice to hear from their  competition.

We started with a brief discussion of the differences between them and DriveScale. Sumit mentioned that they were mainly focused on storage and not as much on the other components of composable infrastructure.

[This was Greg Schulz’s (@storageIO & StorageIO.com), first time as a GreyBeard co-host and we had some technical problems with his feed, sorry about that.]

Multi-fabric composable infrastructure

At Dell Tech World (DTW) 2019 last week, Liqid announced a new, multi-fabric composability solution. Originally, Liqid composable infrastructure only supported PCIe switching, but with their new announcement, they also now support Ethernet and InfiniBand infrastructure composability. In their multi-fabric solution, they offer JBoG(PUs) which can attach to Ethernet/InfiniBand as well as other compute accelerators such as FPGAs or AI specific compute engines.

For non-PCIe switch fabrics, Liqid adds an “HBA-like” board in the server side that converts PCIe protocols to Ethernet or InfiniBand and has another HBA-like board sitting in the JBoG.

As such, if you were a Media & Entertainment (M&E) shop, you could be doing 4K real time editing during the day, where GPUs were each assigned to a separate servers running editing apps and at night, move all those GPUs to a central server where they could now be used to do rendering or transcoding. All with the same GPU-sever hardware andusing Liqid to re-assign those GPUs, back and forth during day and night shifts.  

Even before the multi-fabric option Liqid supported composing NVMe SSDS and servers. So with a 1U server which in the package may support 4 SSDS, with Liqid you could assign 24-48 or whatever number made the most sense  to that 1U server for a specialized IO intensive activity. When that activity/app was done, you could then allocate those NVMe SSDs to other servers to support other apps.

Why compose infrastructure

The promise of composability is no more isolated/siloed/dedicated hardware in your environment. Resources like SSDs, GPUS, FPGAs and really servers can be torn apart and put back together without sending out a service technician and waiting for hours while they power down your system and move hardware around. I asked Sumit how long it took to re-configure (compose) hardware into a new congfiguration and he said it was a matter of 20 seconds.

Sumit was at an NVIDIA show recently and said that Liqid could non-disruptively swap out GPUs. For this you would just isolate the GPU from any server and then go over to the JBoG and take the GPU out of the cabinet.

How does it work

Sumit mentioned that they have support for Optane SSDs to be used as DRAM memory (not Optane DC PM) using IMDT (Intel Memory Drive Technology). In this way you can extend your DRAM up to 6TB for a server. And with Liqid it could be concentrated on one server one minute and then spread across dozens the next.

I asked Sumit about the overhead of the fabrics that can be used with Liqid. He said that the PCIe switching may add on the order of 100 nanoseconds and the Ethernet/InfiniBand networks on the order of 10-15 microseconds or roughly 2 orders of magnitude difference in overhead between the two fabrics.

Sumit made a point of saying that Liqid is a software company. Liqid software runs on switch hardware (currently Mellanox Ethernet/InfiniBand switches) or their PCIe switches.

But given their solution can require HBAs, JBoGs and potentially PCIe switches there’s at least some hardware involved. But for Ethernet and InfiniBand their software runs in the Mellanox switch gear. Liqid control software has a CLI, GUI and supports an API.

Liqid supports any style of GPU (NVIDIA, AMD or ?). And as far as they were concerned, anything that could be plugged into a PCIe bus was fair game to be disaggregated and become composable.

Solutions using Liqid

Their solution is available from a number of vendors. And at last week’s, DTW 2019 Liqid announced a new OEM partnership with Dell EMC. So now, you can purchase composable infrastructure, directly from Dell. Liqid’s route to market is through their partner ecosystem and Dell EMC is only the latest.

Sumit mentioned a number of packaged solutions and one that sticks in my mind was a an AI appliance pod solution (sold by Dell), that uses Liqid to compose an training data ingestion environment at one time, a data cleaning/engineering environment at another time, a AI deep learning/model training environment at another time, and then an scaleable inferencing engine after that. Something that can conceivably do it all, an almost all in one AI appliance.

Sumit said that these types of solutions would be delivered in 1/4, 1/2, or full racks and with multi-fabric could span racks of data center infrastructure. The customer ultimately gets to configure these systems with whatever hardware they want to deploy, JBoGs, JBoFs, JBoFPGAs, JBoAIengines, etc.

The podcast runs ~42 minutes. Sumit was very knowledgeable data center infrastructure and how composability could solve many of the problems of today. Some composability use cases he mentioned could apply to just about any data center. Ray and Sumit had a good conversation about the technology. Both Greg and I felt Liqid’s technology represented the next step in data center infrastructure evolution. Listen to the podcast to learn more.

Sumit Perl, CEO & Co-founder, Liqid, Inc.

Sumit Puri is CEO and Co-founder at Liqid. An industry veteran with over 20 years of experience, Sumit has been focused on defining the technology roadmaps for key industry leaders including Avago, SandForce, LSI, and Toshiba.

Sumit has a long history with bringing successful products to market with numerous teams and large-scale organizations.

79: GreyBeards talk AI deep learning infrastructure with Frederic Van Haren, CTO & Founder, HighFens, Inc.

We’ve talked with Frederic before (see: Episode #33 on HPC storage) but since then, he has worked for an analyst firm and now he’s back on his own again, at HighFens. Given all the interest of late in AI, machine learning and deep learning, we thought it would be a great time to catch up and have him shed some light on deep learning and what it needs for IT infrastructure.

Frederic has worked for HPC / Big Data / AI / IoT solutions in the speech recognition industry, providing speech recognition services for some of the largest organizations in the world. As I understand it, the last speech recognition AI application he worked on implemented deep learning.

A brief history of AI

Frederic walked the Greybeards through the history of AI from the dawn of computing (1950s) until the recent emergence of deep learning (2010).

He explained that, early on one could implement a chess playing program, using hand coded rules based on a chess expert’s playing technique. Later when machine learning came out, one could use statistical analysis on multiple games and limited rule creation to teach a AI machine learning system how to play chess. With deep learning (DL), all you have to do now is to feed a DL model all the games you have and it learns how to play chess well all by itself. No rule making needed.

AI DL training and deployment infrastructure

Frederic described some of the infrastructure and data needs for various phases of an industrial scale, AI DL workflow.

Training deep learning models takes data and the more, the better. Gathering/saving large amounts of data used for DL training is a massive write workload and at the end of that process, hopefully you have PB of data to work with.

Selecting DL training data from all those PBs, involves a lot of mixed read and write IO. In the end, one has selected and extracted the data to use to train your DL models.

During DL training, IO needs are all about heavy data read throughput. But there’s more, in the later half of the talk, Frederic talked about the need to keep expensive GPU cores busy and that requires sophisticated caching or Tier 0 storage supporting low latency IO.

Ray’s been doing a lot of blogging and other work on AI machine and deep learning (e.g., see Learning machine learning – parts 1, 2, & 3) so it was great to hear from Frederic, a real practitioner of the art. Frederic (with some of Ray’s help) explained the deep learning training process. But it wasn’t detailed enough for Howard, so per Howard’s request, we went deeper into how it really works.

Once you have a DL model trained and working within specifications (e.g., prediction accuracy), Frederic said deploying DL models into production involves creating two separate clusters. One devoted to deep learning model inferencing, which takes in data from the world and performs inferencing (prediction, classification, interpretations, etc.) and the other uses that information for model adaption to fine tune DL models for specific instances.

Adaption and inferencing were both read and write IO workloads and the performance of this IO was dependent on a specific model’s use

Model adaption would personalize model predictions for each and every person, car, genotype, etc. This would be done periodically (based on SLAs, e.g. every 4 hrs). After that, a new, adapted model could be introduced into production, adapted for that specific person/car/genotype.

If the adaption applied more generally, that data and its human-machine validated/vetted prediction, classification, interpretation, etc. would be added back into the DL model training set to be used the next time a full model training pass was to be done. Frederic said AI DL model training is never done.

Sometime later, all this DL training, production and adaption data needs to be archived for long term access.

We then discussed the recent offerings from NVIDIA and major storage vendors that package up a solution for AI deep learning. It seems we are seeing another iteration of Converged Infrastructure, only this time for AI DL.

Finally, over the course of Ray’s AI DL education, he had come to the belief that AI deep learning could be applied by anyone. Frederic corrected Ray stating that AI deep learning should be applied by anyone.

The podcast runs ~44 minutes. Frederic’s been an old friend of Howard’s and Ray’s, since before the last podcast. He’s one of the few persons in the world that the GreyBeards know that has real world experience in deploying AI DL, at industrial scale. Frederic’s easy to talk with and very knowledgeable about the intersection of Ai DL and IT infrastructure. Howard and I had fun talking with him again on this episode. Listen to the podcast to learn more. .

Frederic Van Haren

Frederic Van Haren is the Chief Technology Officer @ HighFens. He has over 20 years of experience in high tech and is known for his insights in HPC, Big Data and AI from his hands-on experience leading research and development teams. He has provided technical leadership and strategic direction in the Telecom and Speech markets.

He spent more than a decade at Nuance Communications building large HPC and AI environments from the ground up and is frequently invited to speak at events to provide his vision on the HPC, AI, and storage markets. Frederic has also served as the president of a variety of technology user groups promoting the use of innovative technology.

As an engineer, he enjoys working directly with engineering teams from technology vendors and on challenging customer projects.

Frederic lives in Massachusetts,  USA but grew up in the northern part of Belgium where he received his Masters in Electrical Engineering, Electronics and Automation.