123: GreyBeards talk data analytics with Sean Owen, Apache Spark committee/PMC member & Databricks, lead data scientist

The GreyBeards move up the stack this month with a talk on big data and data analytics with Sean Owen (@sean_r_owen), Data Science lead at Databricks and Apache Spark committee and PMC member. The focus of the talk was on Apache Spark.

Spark is an Apache Software Foundation open-source data analytics project and has been up and running since 2010. Sean is a long time data scientist and was extremely knowledgeable about data analytics, data science and the role that Spark has played in the analytics ecosystem. Listen to the podcast to learn more.

Spark is not an infrastructure solution as much as an application framework. It’s seems to be a data analytics solution specifically designed to address Hadoops shortcomings. At the moment, it has replaced Hadoop and become the go to solution for data analytics across the world. Essentially, Spark takes data analytic tasks/queries and runs them, very quickly against massive data sets.

Spark takes analytical tasks or queries and splits them up into stages that are run across a cluster of servers. Spark can use many different cluster managers (see below) to schedule stages across worker nodes attempting to parallelize as many as possible.

Spark has replaced Hadoop mainly because it’s faster and has a better, easier to use API. Spark was written in Scala which runs on JVM, but its API supports SQL, Java, R (R on Spark) and Python (PySpark). The latter two have become the defacto standard languages for data science and AI, respectively.

Storage for Spark data can reside on HDFS, Apache HBase, Apache Solr, Apache Kudu and (cloud) object storage. HDFS was the original storage protocol for Hadoop. HBase is the Apache Hadoop database. Apache Solr was designed to support high speed, distributed, indexed search. Apache Kudu is a high speed distributed database solution. Spark, where necessary, can also use local disk storage for interim result storage.

Spark supports three data models: RDD (resilient distributed dataset); DataFrames (column headers and rows of data, like distributed CSVs); and DataSets (distributed typed and untyped data). Spark DataFrame data can be quite large, it seems nothing to have a 100M row dataframe. Spark Datasets are a typed version of dataframes which are only usable in Java API as Python and R have no data typing capabilities.

One thing that helped speed up Spark processing over Hadoop, is its native support for in-memory data. With Hadoop, intermediate data had to be stored on disk. With in-memory data, Spark supports the option to keep it in memory, speeding up subsequent processing of this data. Spark data can be pinned or cached in memory using the API calls. And the availability of bigger servers with Intel Optane or just lots more DRAM, have made this option even more viable.

Another thing that Spark is known for is its support of multiple cluster managers. Spark currently supports Apache Mesos, Kubernetes, Apache Hadoop YARN, and Spark’s own, standalone cluster manager. In any of these, Spark has a main driver program that takes in analytics requests, breaks them into stages and schedules worker nodes to execute them..

Most data analytics work is executed in batch mode, offline, with incoming data stored on disk/flash someplace (see storage options above). But Spark can also run in real-time, streaming mode processing data streams. Indeed, Spark can be combined with Apache Kafka to process Kafka topic streams.

I asked about high availability (HA) characteristics, specifically for data. Sean mentioned that data HA is more of a storage consideration. But Spark does support HA for analytics jobs/tasks as a whole. As stages are essentially state-less tasks, analytics HA can be done by monitoring stage execution to completion and if needed, re-scheduling failed stages to run on other worker nodes.

Regarding Spark usability, it has a CLI and APIs but no GUI. Spark has a number of parameters (I counted over 20 for the driver program alone), that can be used to optimize its execution. So it’s maybe not the easiest solution to configure and optimize by hand, but that’s where other software systems, such as Databricks (see link above) comes in. Databricks supplies a managed Spark solution for customers that don’t want/need to deal with all the configuration complexity of Spark.

Sean Owen, Lead Data Scientist, Databricks and Apache Spark PMC member

Sean is a principal solutions architect focusing on machine learning and data science at Databricks. He is an Apache Spark committee and PMC member, and co-author of Advanced Analytics with Spark.

Previously, Sean was director of Data Science at Cloudera and an engineer at Google.