161: Greybeards talk AWS S3 storage with Andy Warfield, VP Distinguished Engineer, Amazon

We talked with Andy Warfield (@AndyWarfield), VP Distinguished Engineer, Amazon, about 10 years ago, when at Coho Data (see our (005:) Greybeards talk scale out storage … podcast). Andy has been a good friend for a long time and he’s been with Amazon S3 for over 5 years now. Since the recent S3 announcements at AWS Re:Invent, we thought it a good time to have him back on the show. Andy has a great knack for explaining technology, I suppose that comes from his time as a professor but whatever the reason, he was great to have on the show again.

Lately, Andy’s been working on S3 Express, One Zone storage, announced last November, a new version of S3 object storage with lower response time. We talked about this later in the podcast but first we touched on S3’s history and other advances. S3 and its ancillary services have advanced considerably over the years. Listen to the podcast to learn more

S3 is ~18 years old now and was one of the first AWS offerings. It was originally intended to be the internet’s file system which is why it was based on HTTP protocols.

Andy said that S3 was designed for 11-9s durability and high availability options. AWS constantly monitors server and storage failures/performance to insure that they can maintain this level of durability. The problem with durability is that when a drive/server goes down, the data needs to be rebuilt onto another drive before another drive fails. One way to do this is to have more replicas of the data. Another way is to speed up rebuild times. I’m sure AWS does both.

S3 high availability requires replicas across availability zones (AZ). AWS availability zone data centers are carefully located so that they are power-networking isolated from others data centers in the region. Further, AZ site locations are deliberately selected with an eye towards ensuring they are not susceptible to similar physical disasters.

Andy discussed other AWS file data services such as their FSx systems (Amazon FSx for Lustre, for OpenZFS, for Windows File Server, & for NetApp ONTAP) as well as Elastic File System (EFS). Andy said they sped up one of these FSx services by 3-5X over the last year.

Andy mentioned one of the guiding principles for lot of AWS storage is to try to eliminate any hard decisions for enterprise developers. By offering FSx files, S3 objects and their other storage and data services, customers already using similar systems in house can just migrate apps to AWS without having to modify code.

Andy said one thing that struck him as he came on the S3 team was the careful deliberation that occurred whenever they considered S3 API changes. He said the team is focused on the long term future of S3 and any API changes go through a long and deliberate review before implementation.

One workload that drove early S3 adoption was data analytics. Hadoop and BigTable have significant data requirements. Early on, someone wrote an HDFS interface to S3 and over time lots of data analytics activity moved to S3 object hosted data.

Databases have also changed over the last decade or so. Keith mentioned that many customers are foregoing traditional data bases to use open source database solutions with S3 as their backend storage. It turns out that Open Table Format database offerings such as Apache Iceberg, Apache Hudi and Delta Lake are all available on AWS use S3 objects as their storage

We talked a bit about Lambda Server-less processing triggered by S3 objects. This was a new paradigm for computing when it came out and many customers have adopted Lambda to reduce cloud compute spend.

Recently Amazon introduced a file system Mount point for S3 storage. Customers can now use an NFS mount point to access any S3 bucket.

Amazon also supports the Registry for Open Data, which holds just about every canonical data set (stored as S3 objects) used for AI training.

In the last ReInvent, Amazon announced S3 Express One Zone which is a high performance, low latency version of S3 storage. The goal for S3 express was to get latency down from 40-60 msec to less than 10 sec.

They ended up making a number of changes to S3 such as:

  • Redesigned/redeveloped some S3 micro services to reduce latency
  • Restricted S3 Express storage to a single zone reducing replication requirements, but maintained 11-9s durability
  • Used higher performing storage
  • Re-designed S3 API to move some authentication/verification to the beginning of object access from every object access call.

Somewhere during our talk Andy said that, in aggregate, S3 is providing 100TBytes/sec of data bandwidth. How’s that for a scale out storage.

Andy Warfield, VP Distinguished Engineer, Amazon

Andy is a Vice President and Distinguished Engineer in Amazon Web Services. He focusses primarily on data storage and analytics.

Andy holds a PhD from the University of Cambridge, where he was one of the authors of the Xen hypervisor. Xen is an open source hypervisor that was used as the initial virtualization layer in AWS, among multiple other early cloud companies. Andy was a founder at Xensource, a startup based on Xen that was subsequently acquired by Citrix Systems for $500M. Following XenSource,

Andy was a professor at the University of British Columbia (UBC), where he was awarded a Canada Research Chair, and a Sloan Research Fellowship. As a professor, Andy did systems research in areas including operating systems, networking, security, and storage.

Andy’s second startup, Coho Data, was a scale-out enterprise storage array that integrated NVMe SSDs with programmable networks. It raised over 80M in funding from VCs including Andreessen Horowitz, Intel Capital, and Ignition Partners.

159: GreyBeards Year End 2023 Wrap Up

Jason and Keith joined Ray for our annual year end wrap up and look ahead to 2024. I planned to discuss infrastructure technical topics but was overruled. Once we started talking AI, we couldn’t stop.

It’s hard to realize that Generative AI and ChatGPT in particular, haven’t been around that long. We discussed some practical uses Keith and Jason had done with the technology.

Keith mentioned its primary skill is language expertise. He has used it to help write up proposals. He often struggles to convince CTO Advisor non-sponsors of the value they can bring and found that using GenAI has helped do this better.

Jason mentioned he uses it to create BASH, perl, and PowerShell scripts. He says it’s not perfect but can get ~80% there and with a few tweaks, is able to have something a lot faster than if he had to do it completely by hand. He also mentioned its skill in translating from one scripting language to others and how well the code it generates is documented (- that hurt).

I was the odd GreyBeard out, having not used any GenAI, proprietary or not. I’m still working to get a reinforcement learning task to work well and consistently. I figured once I mastered that, I train an LLM on my body of (text and code) work (assuming of course someone gifts me a gang of GPUs).

I agreed GenAI are good at (English) language and some coding tasks (where lot’s of source code exists, such as java, scripting, python, etc.).

However, I was on a MLops slack channel and someone asked if GenAI could help with IBM RPG II code. I answered, probably not. There’s just not a lot of RPG II code publicly accessible on the web and the structure of RPG was never line of text/commands oriented.

We had some heated discussion on where LLMs get the data to train with. Keith was fine with them using his data. I was not. Jason was neutral.

We then turned to what this means to the white collar workers who are coding and writing text. Keith made the point that this has been a concern throughout history, at least since the industrial revolution.

Machines come along, displace work that was done by hand, increase production immensely, reduce costs. Organizations benefit, but people doing those jobs need to up level their skills, to take advantage of the new capabilities.

Easy for us to say, as we, except for Jason, in his present job, are essentially entrepreneurs and anything that helps us deliver more value, faster, easier or less expensively, is a boon for our businesses.

Jason mentioned, Stephen Wolfram wrote a great blog post discussing LLM technology (see What is ChatGPT doing … and why does it work). Both Jason and Keith thought it did a great job about explaining the science and practice behind LLMs.

We moved on to a topic harder to discuss but of great relevance to our listeners, GenAI’s impact on the enterprise.

It reminds me of when Cloud became most prominent. Then “C” suites tasked their staff to adopt “the cloud” anyway they could. Today, “C” suites are tasking their staff to determine what their “AI strategy” is and when will it be implemented.

Keith mentioned that this is wrong headed. The true path forward (for the enterprise) is to focus on what are the business problems and how can (Gen)AI address (some of) them.

AI is so varied and its capabilities across so many fields, is so good nowadays ,that organizations should really look at AI as a new facility that can recognize patterns, index/analyze/transform images, summarize/understand/transform text/code, etc., in near real-time and see where in the enterprise that could help.

We talked about how enterprises can size AI infrastructure needed to perform these activities. And it’s more than just a gaggle of GPUs.

MLcommons’s MLperf benchmarks can help show the way, for some cases, but they are not exhaustive. But it’s a start.

The consensus was maybe deploy in the cloud first and when the workload is dialed in there, re-home it later. With the proviso that hardware needed is available.

Our final topic was the Broadcom VMware acquisition. Keith mentioned their recent subscription pricing announcements vastly simplified VMware licensing, that had grown way too complex over the decades.

And although everyone hates the expense of VMware solutions, they often forget the real value VMware brings to enterprise IT.

Yes hyperscalars and their clutch of coders, can roll their own hypervisor services stacks, using open source virtualization. But the enterprise has other needs for their developers. And the value of VMware virtualization services, now that 128 Core CPUs are out, is even higher.

We mentioned the need for hybrid cloud and how VCF can get you part of the way there. Keith said that dev teams really want something like “AWS software” services running on GCP or Azure.

Keith mentioned that IBM Cloud is the closest he’s seen so far to doing what Dev wants in a hybrid cloud.

We all thought when DNN’s came out and became trainable, and reinforcement learning started working well, that AI had turned a real corner. Turns out, that was just a start. GenAI has taken DNNs to a whole other level and Deepmind and others are doing the same with reinforcement learning.

This time AI may actually help advance mankind, if it doesn’t kill us first. On the latter topic you may want to checkout my RayOnStorage AGI series of blog posts (latest … AGI part-8)

Jason Collier, Principal Member Of Technical Staff at AMD, Data Center and Embedded Solutions Business Group

Jason Collier (@bocanuts) is a long time friend, technical guru and innovator who has over 25 years of experience as a serial entrepreneur in technology.

He was founder and CTO of Scale Computing and has been an innovator in the field of hyperconvergence and an expert in virtualization, data storage, networking, cloud computing, data centers, and edge computing for years.

He’s on LinkedIN. He’s currently working with AMD on new technology and he has been a GreyBeards on Storage co-host since the beginning of 2022

Keith Townsend, President of The CTO Advisor a Futurum Group Company

Keith Townsend (@CTOAdvisor) is a IT thought leader who has written articles for many industry publications, interviewed many industry heavyweights, worked with Silicon Valley startups, and engineered cloud infrastructure for large government organizations.

Keith is the co-founder of The CTO Advisor, blogs at Virtualized Geek, and can be found on LinkedIN.

155: GreyBeards SDC23 wrap up podcast with Dr. J Metz, Technical Dir. of Systems Design AMD and Chair of SNIA BoD

Dr. J Metz (@drjmetz, blog), Technical Director of Systems Design at AMD and Chair of SNIA BoD, has been on our show before discussing SNIA research directions. We decided this year to add an annual podcast to discuss highlights from their Storage Developers Conference 2023 (SDC23).

Dr, J is working at AMD to help raise their view from a pure components perspective to a systems perspective. On the other hand, at SNIA, we can see them moving out of just storage interface technology into memory (of all things) and real long term, storage archive technologies.

SDC is SNIA’s main annual conference, which brings storage developers together with storage users to discuss all the technologies underpinning storing the data we all care so much about. Listen to the podcast to learn more

SNIA is trying to get their hands around trends impacting the IT industry today. These days, storage, compute and networking are all starting to morph into one another and the boundary lines, always tenuous at best, seem to be disappearing.

Aside from industry standards work that SNIA has always been known for, they are also deeply involved in education. One of their more popular artifacts is the SNIA Dictionary (recently moved online only), which provides definitions for probably over a 1000 storage terms. But SDC also has a lot of tutorials and other educational sessions worthy of time and effort. And all SDC sessions will be available online, at some point. (Update 10/25/23: they are all available now at Sessions | SDC 2023 website)

SNIA also presented at SFD26, while SDC23 was going on. At SFD26, SNIA discussed DNA data storage which is a recent technical affiliate and a new Smart Data Transfer Interface (SDXI), a software defined interface to perform memory to memory DMA.

First up, DNA storage, the DNA team said that they pretty much are able to store and access GB of DNA data storage today, without breaking a sweat and are starting to consider how to scale that up to TB of DNA storage.  We’ve discussed DNA data storage before on GBoS podcasts (see: 108: GreyBeards talk DNA storage... )

The talk at SFD26 was pretty detailed. Turns out the DNA data storage team have to re-invent a lot of standard storage technologies (catalogs/Indexes, metadata, ECC, etc) in order to support a DNA data soup of unstructured data.

For exampe, ECC for DNA segments (snippets) would be needed to correctly store and retrieve DNA data segments, And these segments could potentially be replicated 1000s of times in a DNA storage cell. And all DNA data segments would be tagged with file oriented metadata indicating (segment) address within file, file name or identifier, date created, etc.

As far as what an application for DNA storage would look like, Dr. J mentioned write once and read VERY infrequently. It turns out while making 1000s of copies of DNA data segments is straightforward, inexpensive and trivial, reading it is another matter entirely. And as was discussed at SFD26, reading DNA storage, as presently conceived, is destructive. (So maybe having lots of copies is a good and necessary idea.)

But the DNA guru’s really have to a come up with methods for indexing, searching, and writing/reading data quickly.  Todays disks have file systems that are self-defining. If you hand someone an HDD, it’s fairly straightforward to read information off of it and determine the file system used to create it. These days, with LTO-FS, the same could be said for LTO tape.

DNA is intended to be used to store data for 1000s of years. They have retrieved intact DNA from a number of organisms that are over 50K years old.  Retaining applications that can access, format and process data after a 1000 years is yet another serious problem someone will need to solve.

Next up was SDXI, a software defined DMA solution, that any application can use to move data from one memory to another without having to resort to 20 abstraction layers to do it. SDXI is just about moving data between memory banks.

Today, this is all within one system/server, but as CXL matures and more and more hardware starts supporting CXL 2 and 3, shared memory between servers will become more pervasive all on a CXL memory interface.

Keith tried bringing it home to moving data between containers or VMs and all that’s possible today within the same memory and sometime in the future between shared memory and local memory. 

Memory to memory transfers have to be done securely. It’s not like accessing memory from some other process hasn’t been frought with security exposures in the past. And Dr. J assured me that SDXI was built from the ground up with security considerations front and center.

To bring it all back home. SNIA has always been and always will be concerned with data. Whether that data resides on storage, memory or god forbid, in transit somewhere over a network. Keith went as far as to say that the network was storage, I felt that was a step too far.

Dr. J Metz, Technical Director of Systems at AMD, Chair of SNIA BoD

J is the Chair of SNIA’s (Storage Networking Industry Association) Board of Directors and Technical Director for Systems Design for AMD where he works to coordinate and lead strategy on various industry initiatives related to systems architecture. Recognized as a leading storage networking expert, J is an evangelist for all storage-related technology and has a unique ability to dissect and explain complex concepts and strategies. He is passionate about the innerworkings and application of emerging technologies.

J has previously held roles in both startups and Fortune 100 companies as a Field CTO,  R&D Engineer, Solutions Architect, and Systems Engineer. He has been a leader in several key industry standards groups, sitting on the Board of Directors for the SNIA, Fibre Channel Industry Association (FCIA), and Non-Volatile Memory Express (NVMe). A popular blogger and active on Twitter, his areas of expertise include NVMe, SANs, Fibre Channel, and computational storage.

J is an entertaining presenter and prolific writer. He has won multiple awards as a speaker and author, writing over 300 articles and giving presentations and webinars attended by over 10,000 people. He earned his PhD from the University of Georgia.

152: GreyBeards talk agent-less data security with Jonathan Halstuch, Co-Founder & CTO, RackTop Systems

Sponsored By:

Once again we return to our ongoing series with RackTop Systems, and their Co-Founder & CTO, Jonathan Halstuch (@JAHGT). This time we discuss how agent-less, storage based, security works and how it can help secure many organizations with (IoT) end points they may not control or can’t deploy agents on them. But agent-less security can also help other organizations with security agents deployed over their end points. Listen to the podcast to learn more.

The challenge for enterprise’s with agent based security, is that not all end points support them. Jonathan mentioned one health care customer with an older electron microscope that couldn’t be modified. These older, outdated systems are often targeted by cyber criminals because they are seldom updated.

But even the newest IoT devices often can’t be modified by organizations that use them. Agent-less, storage based security can be a final line of defense to any environment with IoT devices deployed.

But security exposures go beyond IoT devices. Agents can sometimes take manual effort to deploy and update. And as such, sometimes they are left un-deployed or improperly configured.

The advantage of a storage based, agent-less security approach is that it’s always on/always present, because it’s in the middle of the data path and is updated by the storage company, where possible. Yes, not every organization may allows this and for those organizations, storage agent updates will be also require manual effort.

Jonathan mentioned the term Data Firewall. I (a networking novice, at best) have always felt firewalls were a configuration nightmare.

But as we’ve discussed previously in our series, RackTop has a “learning” and an “active” mode. During learning, the system automatically configures application/user IO assessors to characterize normal IO activity. Once learning has completed, the RackTop Systems in the environment now understands what sorts of IO to expect from users/applications and can then flag anything outside normal IO patterns.

But even during “learning” mode, the system is actively monitoring for known malware signatures and other previously characterized bad actor IO. These assesors are always active. 

Keith mentioned that most organizations run special jobs on occasion (quarterly, yearly) which might have not been characterized during learning. Jonathan said these will be flagged and may be halted (depending on RackTop’s configuration). But authorized parties can easily approve that applications IO activity, using a web link provided in the storage security alert.

Once alerted, authorized personnel can allow that IO activity for a specific time period (say Dec-Jan), or just for a one time event. When the time period expires, that sort of IO will be flagged again.

Some sophisticated customers have change control and may know, ahead of time, that end of quarter or end of year processing is coming up. If so, they can easily configure RackTop Systems, ahead of time, to authorize the applications IO activity. In this case there wouldn’t be any interruption to the application.

With RackTop Systems, security agents are centrally located, in the data path and are always operating. This has no dependency on your backend storage such as, SAN, cloud, hybrid storage, etc., or any end point. If anything in your environment accesses data, those RackTop System assessors will be active, checking IO activity and securing your data. 

Jonathan Halstuch, Co-Founder and CTO, RackTop Systems

onathan Halstuch is the Chief Technology Officer and co-founder of RackTop Systems. He holds a bachelor’s degree in computer engineering from Georgia Tech as well as a master’s degree in engineering and technology management from George Washington University.

With over 20-years of experience as an engineer, technologist, and manager for the federal government he provides organizations the most efficient and secure data management solutions to accelerate operations while reducing the burden on admins, users, and executives.