123: GreyBeards talk data analytics with Sean Owen, Apache Spark committee/PMC member & Databricks, lead data scientist

The GreyBeards move up the stack this month with a talk on big data and data analytics with Sean Owen (@sean_r_owen), Data Science lead at Databricks and Apache Spark committee and PMC member. The focus of the talk was on Apache Spark.

Spark is an Apache Software Foundation open-source data analytics project and has been up and running since 2010. Sean is a long time data scientist and was extremely knowledgeable about data analytics, data science and the role that Spark has played in the analytics ecosystem. Listen to the podcast to learn more.

Spark is not an infrastructure solution as much as an application framework. It’s seems to be a data analytics solution specifically designed to address Hadoops shortcomings. At the moment, it has replaced Hadoop and become the go to solution for data analytics across the world. Essentially, Spark takes data analytic tasks/queries and runs them, very quickly against massive data sets.

Spark takes analytical tasks or queries and splits them up into stages that are run across a cluster of servers. Spark can use many different cluster managers (see below) to schedule stages across worker nodes attempting to parallelize as many as possible.

Spark has replaced Hadoop mainly because it’s faster and has a better, easier to use API. Spark was written in Scala which runs on JVM, but its API supports SQL, Java, R (R on Spark) and Python (PySpark). The latter two have become the defacto standard languages for data science and AI, respectively.

Storage for Spark data can reside on HDFS, Apache HBase, Apache Solr, Apache Kudu and (cloud) object storage. HDFS was the original storage protocol for Hadoop. HBase is the Apache Hadoop database. Apache Solr was designed to support high speed, distributed, indexed search. Apache Kudu is a high speed distributed database solution. Spark, where necessary, can also use local disk storage for interim result storage.

Spark supports three data models: RDD (resilient distributed dataset); DataFrames (column headers and rows of data, like distributed CSVs); and DataSets (distributed typed and untyped data). Spark DataFrame data can be quite large, it seems nothing to have a 100M row dataframe. Spark Datasets are a typed version of dataframes which are only usable in Java API as Python and R have no data typing capabilities.

One thing that helped speed up Spark processing over Hadoop, is its native support for in-memory data. With Hadoop, intermediate data had to be stored on disk. With in-memory data, Spark supports the option to keep it in memory, speeding up subsequent processing of this data. Spark data can be pinned or cached in memory using the API calls. And the availability of bigger servers with Intel Optane or just lots more DRAM, have made this option even more viable.

Another thing that Spark is known for is its support of multiple cluster managers. Spark currently supports Apache Mesos, Kubernetes, Apache Hadoop YARN, and Spark’s own, standalone cluster manager. In any of these, Spark has a main driver program that takes in analytics requests, breaks them into stages and schedules worker nodes to execute them..

Most data analytics work is executed in batch mode, offline, with incoming data stored on disk/flash someplace (see storage options above). But Spark can also run in real-time, streaming mode processing data streams. Indeed, Spark can be combined with Apache Kafka to process Kafka topic streams.

I asked about high availability (HA) characteristics, specifically for data. Sean mentioned that data HA is more of a storage consideration. But Spark does support HA for analytics jobs/tasks as a whole. As stages are essentially state-less tasks, analytics HA can be done by monitoring stage execution to completion and if needed, re-scheduling failed stages to run on other worker nodes.

Regarding Spark usability, it has a CLI and APIs but no GUI. Spark has a number of parameters (I counted over 20 for the driver program alone), that can be used to optimize its execution. So it’s maybe not the easiest solution to configure and optimize by hand, but that’s where other software systems, such as Databricks (see link above) comes in. Databricks supplies a managed Spark solution for customers that don’t want/need to deal with all the configuration complexity of Spark.

Sean Owen, Lead Data Scientist, Databricks and Apache Spark PMC member

Sean is a principal solutions architect focusing on machine learning and data science at Databricks. He is an Apache Spark committee and PMC member, and co-author of Advanced Analytics with Spark.

Previously, Sean was director of Data Science at Cloudera and an engineer at Google.

111: GreyBeards talk data analytics with Matthew Tyrer, Sr. Mgr. Solutions Mkt & Competitive Intelligence, Commvault

Sponsored by:

I’ve known Matthew Tyrer, Senior Manager Solutions Marketing and Competitive Intelligence, Commvault for quite awhile now and he’s always been knowledgeable about the problems the enterprise has in supporting and backing up large file data repositories. But lately he’s been focused on Commvault Activate their data analytics solution.

We had a great talk with Matthew. He was easy to talk to and knew a lot about how data analytics can ease the operational burden of the enterprise growing file data environments. .Remind me not to have two Matthew’s on the same program ever again. Listen to the podcast to learn more.

Matthew mentioned that their Activate was built on the Commvault platform software stack, which has had a rich and long history of development and customer deployments. It seems that Activate data analytics had been an early part of the platform but recently was split out as a separate solution.

One capability that Activate has that many other data analytics solutions do not, is the ability to examine both online data as well as data in backups. Most analytics solution can do one or the other, only a few do both. But if a solution only has access to online or backup data, they are missing half the story.

In addition, Activate can operate across multiple data centers as well as across multiple public cloud environments to provide analytics for an enterprise’s file data where it may reside.

Given the proliferation of file data these days, data analytics has become a necessity to most large IT shops. In the past, an admin could track some data over time but with the volumes of file data today, this is no longer tenable. At PB or more of file data, located in on prem data centers as well as across multiple clouds, there’s just too much file data to keep track of manually anymore.

Activate also indexes file content to provide more visibility and tracking of the different types of data under management in the enterprise. This is in addition to the extensive metadata that is collected and analyzed so it can better understand data access rights, copies and physical locations around the enterprise.

Activate can help organizations govern their data flows in support of industry as well as government data compliance requirements. Activate Data Governance, one of the three Activate solutions, is focused exclusively on providing enterprises the tools needed to manage any and all data that exists under compliance regulation environments.

Mat Leib had worked in eDiscovery before and it had always been a pain to extract “legally relevant” data from online and backup repositories. With the Activate eDiscovery solution and Activate’s content indexing of all file data, legal can perform their own relevant data searches to create eDiscovery data sets in support of litigation activities. Self service legal extracts like this vastly reduces the admin time and cost needed for eDiscovery.

The Activate File Space Optimization solution was deployed in one environment that had ~20PB of data online. By using File Space Optimization, the customer was able to cut 20PB down to 10PB. Any customer could benefit from such a reduction but customers doing data migration would see even more benefit.

At the end of the podcast, Matthew mentioned some videos that show Activate solution use cases.

Matthew Tyrer, Senior Solutions Marketing and Competitive Intelligence

Having worked at Commvault for over twelve years, after 8 years as a Sales Engineer Matt took that technical knowledge and transitioned to marketing where he is currently serving as a Senior Manager in Commvault’s Solution Marketing team. He is also heavily involved in Competitive Intelligence initiatives, and actively participates in field enablement programs.

He brings over 20 years’ experience in the IT industry, including within the fields of data and information management, cloud, data governance, enterprise storage, disaster recovery, and ultimately both implementing and supporting those projects and endeavours for public and private sector clients across Canada and around the globe.

Matt’s passion, deep product knowledge, and broad field experiences have enabled him to translate Commvault technology and vision such that their value is easily understood in the market and amongst client and partner families.

A self-described geek-dad, Matt is an avid boardgame enthusiast, firmly believes that Han shot first, and enjoys tormenting his girls with bad dad jokes.

GreyBeards talk data-aware storage with Paula Long & Dave Siles, CEO&CTO DataGravity

In this podcast we discuss data-aware storage with Paula Long, CEO/Co-Founder and Dave Siles, CTO of DataGravity. Paula comes from EqualLogic and Dave from Veeam so they both have a lot of history in and around the storage industry, almost qualifying them as grey hairs :/

Data-aware storage is a new paradigm in storage that combines primary (block and file) storage, file and data analytics and text indexing. Just to top it off, they also add data protection to a separate storage partition. Their system is VM aware and is able to crack open VMDKs to find out what’s inside. With all their file and data analytics, DataGravity is  able to supply data leakage detection and a much better understanding of what data is actually being stored on the system.

Paula believes, in 5 years or so, this new approach to storage will become common. Their system also supports targeted data deduplication and compression as well as provide self-service restore and a “google-like” rich search experience to their data aware storage.

DataGravity was designed for mid-market but are being pulled up market by workgroups as department level storage for F500 companies. They find that once installed,  they usually uncover some exposure and then other departments take notice. Also they’re discovering an awful lot of dormant data and moving this off of primary storage can save quite a lot.

DataGravity has a 2U controller with a 24-disk drive shelf but have SSDs inside the controllers. They use spinning disks for a majority of the data storage.

DataGravity has an interesting twist on the active-passive, standard dual conttroller/HA approach to storage, which you will have to listen to the podcast to truly understand.

This months episode runs a bit over 44 minutes and wanders over a lot of high ground but dips into technical waters occasionally.

Paula Long, CEO & Co-founder, DataGravity

PaulaLong-G Paula brings over 30 years of experience to DataGravity in delivering meaningful and game changing high-tech innovation. Prior to DataGravity, Paula served as vice president of product development at Heartland Robotics. In 2001 Paula co-founded storage provider EqualLogic, resetting the bar on how customers managed and purchased data storage. EqualLogic was acquired by Dell for $1.4 billion in 2008 and Paula remained at Dell as vice president of storage until 2010. Previous to EqualLogic, she served in several engineering management positions at Allaire Corporation and oversaw all aspects of the ClusterCATS product line while at Bright Tiger Technologies.

Her executive and technical leadership has been extensively recognized, including the New Hampshire High Tech Council Entrepreneur of the Year award, the Ernst & Young 2008 Northeast Regional “Entrepreneur of the Year” and a national finalist for the same award. Her technical awards span systems designs and enterprise software including the EqualLogic and ClusterCATS product lines. She is a graduate of Westfield State College

Paula is also active in the startup community. Outside of high tech, she works with charities creating equality for professional women and girls, as well as with organizations enabling literacy for all children, regardless of economic status.

Dave Siles, CTO DataGravity

DaveSiles-colorWith more than 20 years in operations and leadership roles with growth companies, David serves as chief technology officer of DataGravity, responsible for leading the technical strategic vision for the company while guiding our product management teams and research and development efforts to better serve the needs of organizations looking for more from their data storage.

Prior to becoming CTO, David served as vice president of worldwide field operations at DataGravity. Previously, David was a member of the senior leadership team at Veeam Software, a leading data protection software provider for virtualized and cloud environments.

David also served as CTO and VP of professional services for systems integrator Hipskind TSG. He also served as CTO for Kane County, Ill., and has held technology leadership roles with various organizations. A graduate of DeVry University, he is a frequent speaker at top tier technology shows and is a recognized expert in virtualization.