135: Greybeard(s) talk file and object challenges with Theresa Miller & David Jayanathan, Cohesity

Sponsored By:

I’ve known Theresa Miller, Director of Technology Advocacy Group at Cohesity, for many years now and just met David Jayanathan (DJ), Cohesity Solutions Architect during the podcast. Theresa could easily qualify as an old timer, if she wished and DJ was very knowledgeable about traditional file and object storage.

We had a wide ranging discussion covering many of the challenges present in today’s file and object storage solutions. Listen to the podcast to learn more.

IT is becoming more distributed. Partly due to moving to the cloud, but now it’s moving to multiple clouds and on prem has never really gone away. Further, the need for IT to support a remote work force, is forcing data and systems that use them, to move as well.

Customers need storage that can reside anywhere. Their data must be migrate-able from on prem to cloud(s) and back again. Traditional storage may be able to migrate from one location to a select few others or replicate to another location (with the same storage systems present), but migration to and from the cloud is just not easy enough.

Moreover, traditional storage management has not kept up with this widely disbursed data world we live in. With traditional storage, customers may require different products to manage their storage depending on where data resides.

Yes, having storage that performs, provides data access, resilience and integrity is important, but that alone is just not enough anymore.

And to top that all off, the issues surrounding data security today have become just too complex for traditional storage to solve alone, anymore. One needs storage, data protection and ransomware scanning/detection/protection that operates together, as one solution to deal with IT security in today’s world

Ransomware has rapidly become the critical piece of this storage puzzle needing to be addressed. It’s a significant burden on every IT organization today. Some groups are getting hit each day, while others even more frequently. Traditional storage has very limited capabilities, outside of snapshots and replication, to deal with this ever increasing threat.

To defeat ransomware, data needs to be vaulted, to an immutable, air gapped repository, whether that be in the cloud or elsewhere. Such vaulting needs to be policy driven and integrated with data protection cycles to be recoverable.

Furthermore, any ransomware recovery needs to be quick, easy, AND securely controlled. RBAC (role-based, access control) can help but may not suffice for some organizations. For these environments, multiple admins may need to approve ransomware recovery, which will wipe out all current data by restoring a good, vaulted copy of the organizations data.

Edge and IoT systems also need data storage. How much may depend on where the data is being processed/pre-processed in the IoT system. But, as these systems mature, they will have their own storage requirements which is yet another data location to be managed, protected, and secured.

Theresa and DJ had mentioned Cohesity SmartFiles during our talk which I hadn’t heard about. Turns out that SmartFiles is Cohesity’s file and object storage solution that uses the Cohesity storage cluster. Cohesity data protection and other data management solutions also use the cluster to store their data. Adding SmartFiles to the mix, brings a more complete storage solution to support customer data needs. .

We also discussed Helios, Cohesity’s, next generation, data platform that provides a control and management plane for all Cohesity products and services,.

Theresa Miller, Director, Technology Advocacy Group, Cohesity

Theresa Miller is the Director, Technology Advocacy Group at Cohesity.  She is an IT professional that has worked as a technical expert in IT for over 25 years and has her MBA.

She is uniquely industry recognized as a Microsoft MVP, Citrix CTP, and VMware vExpert.  Her areas of expertise include Cloud, Hybrid-cloud, Microsoft 365, VMware, and Citrix.

David Jayanathan, Solutions Architect, Cohesity

David Jayanathan is a Solutions Architect at Cohesity, currently working on SmartFiles. 

DJ is an IT professional that has specialized in all things related to enterprise storage and data protection for over 15 years.

128: GreyBeards talk containers, K8s, and object storage with AB Periasamy, Co-Founder&CEO MinIO

Sponsored by:

Once again Keith and I are talking K8s storage, only this time it was object storage. Anand Babu (AB) Periasamy, Co-founder and CEO of MinIO, has been on our show a couple of times now and its always an insightful discussion. He’s got an uncommon perspective on IT today and what needs to change.

Although MinIO is an open source, uber-compatible, S3 object store, AB more often talks like a revolutionary, touting the benefits of containerization, scale and automation with K8s. Object storage is just one of the vehicles to help get there. Listen to the podcast to learn more.

We started our discussion on the changing role of object storage in applications. Object storage started out as an archive solution. But then, over time, something happened, modern database startups adopted object storage to hold primary data, then analytics moved over to objects in a big way, and finally AI/ML came out with an unquenchable thirst for data and object storage was its only salvation.

Keith questioned the use of objects in analytics. Both AB and I pointed out that Splunk (and Spark) fully supported objects. But Keith said R (and Python) data scientists prefer to use protocols they learned in school, and these were all about (CSV, JPEGs, JSON) files. AB said what usually happens is this data is stored as object storage and then downloaded onto local disk as files to be processed. That’s not to say, that R or Python can’t process objects directly, but when they don’t, the ultimate source of data truth is object storage.

Somehow, we got onto the multi-cloud question. AB said the multi-cloud is really all about containers and K8s. When customers talk multi-cloud, what they really mean is they want applications that can run anywhere, in any cloud, on premise, or anyplace else for that matter.

I thought multi cloud was a DR solution. But AB reiterated it’s more a solution to vendor lock-in. What containerization gives IT is the option (ability) to run applications anywhere, but IT is not obligated to execute that option unless it makes sense

AB said that dev today doesn’t develop apps in the cloud anymore. They develop locally using minikube, once it’s working there they then add CI/CD tool chains and then move it to its final resting place (the cloud or wherever it ultimately needs to run). It turns out, containers, YAML files, scripts etc. are small and trivial to upload, migrate, or move to any internet location. And with ubiquitous K8s support available everywhere, they can move anywhere unchanged.

But where’s the data. AB said anywhere the app executes. It’s never moved, it takes too much time and effort to move this amount of data. But as applications move, any data it generates grows in that location over time.

We next turned to how MinIO was supported in K8s. AB mentioned they have a DirectPV CSI driver that creates a distributed PV to support MinIO services on local disks. In this way, containers needing access to MinIO S3 object storage can directly allocate data to user storage.

Then we asked about opinionated stacks. AB said most customers don’t want these. They may have some value in preserving an infrastructure environment but they’re better off transitioning to containerization and build any stack within those containers and the K8s cluster services.

On the other hand, MinIO object storage is available with the same S3 API, in bare metal, on VMware, OpenShift, K8s, every public cloud and most private clouds, as well. The advantage of the same, single storage interface, available everywhere can’t be beat.

MinIO recently closed a new funding round of $103M. AB mentioned they had new investments from Intel and Softbank, but I was more interested in plans he had for the new cash. And Keith asked where the new funding left MinIO with respect to its competitors in this space.

AB said it was never about the money, it was more about what you did with your team that mattered in the long run. AB’s imperative was to enter an existing market with a better product and succeed with that. Creating a new market plus a new product always cost more, takes longer and is riskier.

As for the new funds, there are really two ways to go: 1) improve the current product or 2) create a new one. My sense is that AB leans towards improving the current product.

For instance, MinIO is often asked to support a different object storage API. But AB’s perspective is that S3 was an early bet that paid off well by becoming the de facto standard for object storage. Supporting another API would divide his resources and probably make their current product worse not better. AB mentioned they are getting 1.1M downloads of their Docker container version so they seem to be succeeding well with the current product

Anand Babu (AB) Periasamy, Co-founder and CEO

AB Periasamy is the co-founder and CEO of MinIO, an open-source provider of high performance, object storage software. In addition to this role, AB is an active investor and advisor to a wide range of technology companies, from H2O.ai and Manetu where he serves on the board to advisor or investor roles with Humio, Isovalent, Starburst, Yugabyte, Tetrate, Postman, Storj, Procurify, and Helpshift. Successful exits include Gitter.im (Gitlab), Treasure Data (ARM) and Fastor (SMART).

AB co-founded Gluster in 2005 to commoditize scalable storage systems. As CTO, he was the primary architect and strategist for the development of the Gluster file system, a pioneer in software defined storage. After the company was acquired by Red Hat in 2011, AB joined Red Hat’s Office of the CTO. Prior to Gluster, AB was CTO of California Digital Corporation, where his work led to scaling of the commodity cluster computing to supercomputing class performance. His work there resulted in the development of Lawrence Livermore Laboratory’s “Thunder” code, which, at the time was the second fastest in the world.  

AB holds a Computer Science Engineering degree from Annamalai University, Tamil Nadu, India.

124: GreyBeards talk k8s storage orchestration using CNCF Rook Project with Sébastien Han & Travis Nielsen, Red Hat

Stateful containers are becoming a hot topic these days so we thought it a good time to talk to the CNCF (Cloud Native Computing Foundation) Rook team about what they are doing to make storage easier to use for k8s container apps. CNCF put us into contact with Sébastien Han (@leseb_), Ceph Storage Architect and Travis Nielsen (@STravisNielsen), both Principal Software Engineers at Red Hat and active on the Rook project. Rook is a CNCF “graduated” open source project just like Kubernetes, Prometheus, ContainerD, etc., this means it’s mature enough to run production workloads.

Rook is used to configure, deploy and manage a Red Hat Ceph(r) Storage cluster under k8s. Rook creates all the k8s deployment scripts to set up a Ceph Storage cluster as containers, start it and monitor its activities. Rook monitoring of Ceph operations can restart any Ceph service container or scale any Ceph services up/down as needed by container apps using its storage. Rook is not in the Ceph data path, but rather provides a k8s based Ceph control or management plane for running Ceph storage under k8s.

Readers may recall we talked to SoftIron, an appliance provider, for Ceph Storage in the enterprise for our 120th episode. Rook has another take on using Ceph storage, only this time running it under k8s,. Listen to the podcast to learn more.

The main problem Rook is solving is how to easily incorporate storage services and stateful container apps within k8s control. Containerized apps can scale up or down based on activity and storage these apps use needs the same capabilities. The other option is to have storage that stands apart or outside k8s cluster and control. But then tho container apps and their storage have 2 (maybe more) different control environments. Better to have everything under k8s control or nothing at all.

Red Hat Ceph storage has been available as a standalone storage solutions for a long time now and has quite the extensive customer list, many with multiple PB of storage. Rook-Ceph and all of its components run as containers underneath k8s.

Ceph supports replication (mirroring) of data 1 to N ways typically 3 way or erasure coding for data protection and also supports file, block and object protocols or access methods. Ceph normally consumes raw block DAS for it’s backend but Ceph can also support a file gateway to NFS storage behind it. Similarly, Ceph can offers an object storage gateway option. But with either of these approaches, the (NFS or object) storage exists outside k8s scaling and resiliency capabilities and Rook management.

Ceph uses storage pools that can be defined using storage performance levels, storage data protection levels, system affinity, or any combination of the above. Ceph storage pools are mapped to k8s storage classes using the Ceph CSI. Container apps that want to use storage would issue a persistent volume claim (PVC) request specifying a Ceph storage class which would allocate the Ceph storage from the pool to the container.  

Besides configuring, deploying and monitoring/managing your Ceph storage cluster, Rook can also automatically upgrade your Ceph cluster for you. 

We discussed the difference between running Rook-Ceph within k8s and running Ceph outside k8s. Both approaches depend on Ceph CSI but with Rook, Ceph and all its software is all running under k8s control as containers and Rook manages the Ceph cluster for you. When it’s run outside 1) you manage the Ceph cluster and 2) Ceph storage scaling and resilience are not automatic. 

Sébastien Han, Principal Software Engineer, Ceph Architect, Red Hat

Sebastien Han currently serves as a Senior Principal Software Engineer, Storage Architect for Red Hat. He has been involved with Ceph Storage since 2011 and has built strong expertise around it.

Curious and passionate, he loves working on bleeding edge technologies and identifying opportunities where Ceph can enhance the user experience. He did that with various technology such as OpenStack, Docker.

Now on a daily basis, he rotates between Ceph, Kubernetes, and Rook in an effort to strengthen the integration between all three. He is one of the maintainers of Rook-Ceph.

Travis Nielson, Principal Software Engineer, Red Hat

Travis Nielsen is a Senior Principal Software Engineer at Red Hat with the Ceph distributed storage system team. Travis leads the Rook project and is one of the original maintainers, integrating Ceph storage with Kubernetes.

Prior to Rook, Travis was the storage platform tech lead at Symform, a P2P storage startup, and an engineering lead for the Windows Server group at Microsoft.

123: GreyBeards talk data analytics with Sean Owen, Apache Spark committee/PMC member & Databricks, lead data scientist

The GreyBeards move up the stack this month with a talk on big data and data analytics with Sean Owen (@sean_r_owen), Data Science lead at Databricks and Apache Spark committee and PMC member. The focus of the talk was on Apache Spark.

Spark is an Apache Software Foundation open-source data analytics project and has been up and running since 2010. Sean is a long time data scientist and was extremely knowledgeable about data analytics, data science and the role that Spark has played in the analytics ecosystem. Listen to the podcast to learn more.

Spark is not an infrastructure solution as much as an application framework. It’s seems to be a data analytics solution specifically designed to address Hadoops shortcomings. At the moment, it has replaced Hadoop and become the go to solution for data analytics across the world. Essentially, Spark takes data analytic tasks/queries and runs them, very quickly against massive data sets.

Spark takes analytical tasks or queries and splits them up into stages that are run across a cluster of servers. Spark can use many different cluster managers (see below) to schedule stages across worker nodes attempting to parallelize as many as possible.

Spark has replaced Hadoop mainly because it’s faster and has a better, easier to use API. Spark was written in Scala which runs on JVM, but its API supports SQL, Java, R (R on Spark) and Python (PySpark). The latter two have become the defacto standard languages for data science and AI, respectively.

Storage for Spark data can reside on HDFS, Apache HBase, Apache Solr, Apache Kudu and (cloud) object storage. HDFS was the original storage protocol for Hadoop. HBase is the Apache Hadoop database. Apache Solr was designed to support high speed, distributed, indexed search. Apache Kudu is a high speed distributed database solution. Spark, where necessary, can also use local disk storage for interim result storage.

Spark supports three data models: RDD (resilient distributed dataset); DataFrames (column headers and rows of data, like distributed CSVs); and DataSets (distributed typed and untyped data). Spark DataFrame data can be quite large, it seems nothing to have a 100M row dataframe. Spark Datasets are a typed version of dataframes which are only usable in Java API as Python and R have no data typing capabilities.

One thing that helped speed up Spark processing over Hadoop, is its native support for in-memory data. With Hadoop, intermediate data had to be stored on disk. With in-memory data, Spark supports the option to keep it in memory, speeding up subsequent processing of this data. Spark data can be pinned or cached in memory using the API calls. And the availability of bigger servers with Intel Optane or just lots more DRAM, have made this option even more viable.

Another thing that Spark is known for is its support of multiple cluster managers. Spark currently supports Apache Mesos, Kubernetes, Apache Hadoop YARN, and Spark’s own, standalone cluster manager. In any of these, Spark has a main driver program that takes in analytics requests, breaks them into stages and schedules worker nodes to execute them..

Most data analytics work is executed in batch mode, offline, with incoming data stored on disk/flash someplace (see storage options above). But Spark can also run in real-time, streaming mode processing data streams. Indeed, Spark can be combined with Apache Kafka to process Kafka topic streams.

I asked about high availability (HA) characteristics, specifically for data. Sean mentioned that data HA is more of a storage consideration. But Spark does support HA for analytics jobs/tasks as a whole. As stages are essentially state-less tasks, analytics HA can be done by monitoring stage execution to completion and if needed, re-scheduling failed stages to run on other worker nodes.

Regarding Spark usability, it has a CLI and APIs but no GUI. Spark has a number of parameters (I counted over 20 for the driver program alone), that can be used to optimize its execution. So it’s maybe not the easiest solution to configure and optimize by hand, but that’s where other software systems, such as Databricks (see link above) comes in. Databricks supplies a managed Spark solution for customers that don’t want/need to deal with all the configuration complexity of Spark.

Sean Owen, Lead Data Scientist, Databricks and Apache Spark PMC member

Sean is a principal solutions architect focusing on machine learning and data science at Databricks. He is an Apache Spark committee and PMC member, and co-author of Advanced Analytics with Spark.

Previously, Sean was director of Data Science at Cloudera and an engineer at Google.

121: GreyBeards talk Cloud NAS with Peter Thompson, CEO & George Dochev, CTO LucidLink

GreyBeards had an amazing discussion with Peter Thompson (@Lucid_Link), CEO & co-founder and George Dochev (@GDochev), CTO & co-founder of LucidLink. Both Peter and George were very knowledgeable and easy to talk with.

LucidLink’s Cloud NAS creates a NAS storage system out of cloud (any S3 compatible AND Azure Blob) object storage. LucidLink is made up of client software, LucidLink SaaS (metadata service) and data on object storage. Their client software runs on any Linux, MacOS, or Windows desktop/laptop. LucidLink provides streaming, collaborative access to remote users for (file) data on object storage.

Just when 90% of the workforce was sent home for the pandemic, LucidLink emerged to provide all those users secure file access to any and all corporate data in the cloud. Peter mentioned one M&E customer who had just sent 300 video editors home with laptops and a disk drive which would last them all of 2 weeks. But they needed an ongoing solution for after that. The customer started with 300 users and ~100TB of file storage on LucidLink and a few months later, they had 1000 users with a PB+ of LucidLink data and was getting rid of all their NAS boxes. Listen to the podcast to learn more.

They are finding a lot of success in M&E, engineering design, Oil&Gas exploration, geo-spatial design firms and just about anywhere user collaboration on file data is required outside al data center.

LucidLink constructs a  FileSpace for customer file (object) data, which represents a drive letter or mount point that remote users can use to access files from the cloud. LucidLink supports a POSIX compliant file service for that data.

LucidLink data and user generated metadata is encrypted, using client owned/stored keys. So, data-at-rest (and -in-flight) can always be secure. They also support LDAP security and other standard SSO solutions to secure user access to data.

The LucidLink SaaS (metadata) service runs in a hyperscaler and links clients to file data on object storage. It also supports user distributed, byte range locking of file data.

One interesting nuance is that when a client locks a file, the system changes from an eventual to strongly consistent POSIX compliant file system. This ensures that the object storage is always the single source of truth.

The key that differentiates LucidLink from cloud gateways or file synch & share systems is that they 1) are not intended to operate in a data center, (yes, object storage can be located on prem but users are remote) and 2) don’t copy files from one user/access point to another. 

George said latency is enemy number one. LucidLink’s secret is prefetching. Each client uses a customer configured local persistent cache which can range from 5GB to a TB or more. LucidLink maintains a data and (in the next version) metadata working set for the user in their local cache.

Customer file data is split across multiple objects, that way LucidLink can stream data from all of them, in parallel, if needed. And doing so can supply extreme throughput when needed.

As for GDPR and data compliance, the customer controls who has access to the LucidLink SaaS as well as encryption keys.

LucidLink considers their solution “fault tolerant” or DR ready, because customers can load client software on any device and access any LucidLink file data. They also consider themselves “highly available” because their metadata/LucidLink SaaS service runs in a hyper scaler and object backing storage can be configured as highly available.

As mentioned earlier, LucidLink customers can use any S3 compatible or Azure Blob object storage, on prem or in the cloud. But when using cloud object storage, one pays egress charges. LucidLink’s local caching can minimize but cannot eliminate egress charges.

LucidLink offers two licensing models: 1) a BYO (bring your own) object storage and LucidLink provides the software to support your Cloud NAS or 2) LucidLink supplies both the object storage as well as the LucidLink service that glues it all together. The later is a combination of IBM COS and LucidLink that offers less expensive egress charges.

The LucidLink service is billed on capacity under management and user count basis. Capacity is billed on a GB/day, summed over a month. Their minimum solution is 5TB/5 users but they have customers with 1000s of users and PB+ of data. They offer a free 2-week trial period where customers can try LucidLink out.

Peter Thompson, CEO and Co-founder

Peter Thompson, co-founder, and CEO of LucidLink is a passionate and experienced leader and business builder. Thompson has over 30 years of experience in driving business expansion, key programs, and partnerships across regions such as APAC and the Americas mostly in the storage and file system market.

With over 14 years at DataCore Software, most recently as VP of Emerging and Developing Markets, Thompson drove DataCore’s expansion into China working with key industry partners, technology alliances and global teams to develop programs and business focused on emerging markets. Thompson also held the role of Managing Director, APAC responsible for the bottom-line of all Asia operations. He also was President and Representative Director of DataCore Japan, acquiring the majority of ownership and running it as a standalone entity as a beachhead of marquis customers in Japan.

Thompson studied Japanese, history, and economics at Kansai Gaidai and has a BA in International Management, Psychology, Japanese, from Gustavus Adolphus College, is a graduate of Stanford University Business School’s MSx program, with a focus on entrepreneurial finance, design thinking, and the soft skills required to build and lead world-class, high performing teams.

George Dochev, CTO and Co-Founder

George Dochev, co-founder and CTO of LucidLink, is a storage and file system expert with extensive experience in bringing emerging technologies to market. Dochev has over 20 years of success leading the development of complex virtualization products for the storage industry. He specializes in research and development in the fields of high-performance distributed systems, storage infrastructure software, and cloud technologies. 

Dochev was co-founder and principal member of the engineering team at DataCore Software for nearly 17 years. While at Datacore, Dochev helped transform that company from a start-up into a global leader in software-defined storage. Underscoring Dochev’s impact as an entrepreneur is the fact that DataCore Software now powers the data centers of 10,000+ large enterprises around the world.

Dochev holds a degree in Mathematics from Sofia University St. Kliment Ohridski in Bulgaria, and an MS in Computer Science from the University of National and World Economy, in Sofia, Bulgaria.