1 Introduction

Big data analytics is one of the most active research areas with a lot of challenges and needs for new innovations that affect a wide range of industries. To fulfill the computational requirements of massive data analysis, an efficient framework is essential to design, implement and manage the required pipelines and algorithms. In this regard, Apache Spark has emerged as a unified engine for large-scale data analysis across a variety of workloads. It has introduced a new approach for data science and engineering where a wide range of data problems can be solved using a single processing engine with general-purpose languages. Following its advanced programming model, Apache Spark has been adopted as a fast and scalable framework in both academia and industry. It has become the most active big data open source project and one of the most active projects in the Apache Software Foundation.

As an evolving project in the big data community, having good references is a key need to get the most of the Apache Spark and contribute effectively to its progress. While the official programming guideFootnote 1 is the most up-to-date source about Apache Spark, several books (e.g., [37, 45, 70]) have been published to show how Apache Spark can be used to solve big data problems. In addition, Databricks, the company founded by the creators of Apache Spark, has developed a set of reference applicationsFootnote 2 to demonstrate how Apache Spark can be used for different workloads. Other good sources are the official blogFootnote 3 at Databricks and Spark HubFootnote 4 where you can find Spark’s news, events, resources, etc. However, the rapid adoption and development of Apache Spark, coupled with an increasing research on using it for big data analytics, make it difficult for beginners to comprehend the full body of development and research behind it. To our knowledge, there is no comprehensive summary on big data analytics using Apache Spark.

In order to fill this gap, help in getting started with Apache Spark and follow such an active project,Footnote 5 the goal of this paper is to provide a concise succinct source of information about the key features of Apache Spark. Specifically, we focus on how Apache Spark can enable efficient large-scale machine learning, graph analysis and stream processing. Furthermore, we highlight the key research works behind Apache Spark and some recent research and development directions. However, this paper is not intended to be an in-depth analysis of Apache Spark.

The remainder of this paper is organized as follows. We begin with an overview of Apache Spark in Sect. 2. Then, we introduce the key components of Apache Spark stack in Sect. 3. Section 4 introduces data and computation abstractions in Apache Spark. In Sect. 5, we focus on Spark’s MLlib for machine learning. Then, we move to GraphX for graph computation in Sect. 6. After that, we show the key features of Spark Streaming in Sect. 7. In Sect. 8, we briefly review some benchmarks for Apache Spark and big data analytics. Building on the previous sections, we highlight some key issues with Apache Spark in Sect. 9. Finally, summary and conclusions of this paper are presented in Sect. 10.

2 Overview of Apache Spark

In this first section, we introduce an overview of Apache Spark project and Spark’s main components. We highlight some key characteristics which make Apache Spark a next-generation engine for big data analytics after Hadoop’s MapReduce. We also summarize some contributions and case studies from the industry.

2.1 Apache Spark project

Spark was initially started by Matei Zaharia at UC Berkeley’s AMPLab in 2009 and open sourced in 2010 under a BSD license. Then, the project was donated to the Apache Software Foundation in 2013. Several research projects have made essential contributions for building and improving Spark core and the main upper-level libraries [7, 33, 61, 83, 89, 90, 9395]. For example, the development of Spark’s MLlib began from MLbaseFootnote 6 project, and then, other projects started to contribute (e.g., KeystoneMLFootnote 7). Spark SQL started from Shark project [84], and then, it became an essential library in Apache Spark. Also, GraphX started as a research project at the AMPLab.Footnote 8 Later, it became a part of the Apache Spark project since version 0.9.0. Many packagesFootnote 9 have also been contributed to Apache Spark from both academia and industry. Furthermore, the creators of Apache Spark founded Databricks,Footnote 10 a company which is closely involved in the development of Apache Spark.

2.2 Main components and features

Apache Spark system consists of several main components including Spark core [90, 93, 94] and upper-level libraries: Spark’s MLlib for machine learning [61], GraphX [33, 83, 85] for graph analysis, Spark Streaming [95] for stream processing and Spark SQL [7] for structured data processing. It is evolving rapidly with changes to its core APIs and addition of upper-level libraries. Its core data abstraction, the Resilient Distributed Dataset (RDD), opens the door for designing scalable data algorithms and pipelines with better performance. With the RDD’s efficient data sharing and a variety of operators, different workloads can be designed and implemented efficiently. While RDD was the main abstraction introduced in Spark 1.0 through the RDD API, the representation of datasets has been an active area of improvement in the last two years. A new alternative, the DataFrame API, was introduced in Spark 1.3, followed by a preview of the new Dataset API in Spark 1.6. Moreover, a major release (Spark 2.0) was released at the time of writing this paper [81, 91].

2.3 From Hadoop’s MapReduce to Apache Spark

Apache Spark has emerged as the de facto standard for big data analytics after Hadoop’s MapReduce. As a framework, it combines a core engine for distributed computing with an advanced programming model for in-memory processing. Although it has the same linear scalability and fault tolerance capabilities as those of MapReduce, it comes with a multistage in-memory programming model comparing to the rigid map-then-reduce disk-based model. With such an advanced model, Apache Spark is much faster and easier to use. It comes with rich APIs in several languages (Scala, Java, Python, SQL and R) for performing complex distributed operations on distributed data. In addition, Apache Spark leverages the memory of a computing cluster to reduce the dependency on the underlying distributed file system, leading to dramatic performance gains in comparison with Hadoop’s MapReduce [31]. It is also considered as a general-purpose engine that goes beyond batch applications to combine different types of computations (e.g., job batches, iterative algorithms, interactive queries and streaming) which previously required different separated distributed systems [45]. It’s built upon the Resilient Distributed Datasets (RDDs) abstraction which provides an efficient data sharing between computations. Previous data flow frameworks lack such data sharing ability although it is an essential requirement for different workloads [90].

2.4 A unified engine for big data analytics

As the next-generation engine for big data analytics, Apache Spark can alleviate key challenges of data preprocessing, iterative algorithms, interactive analytics and operational analytics among others. With Apache Spark, data can be processed through a more general directed acyclic graph (DAG) of operators using rich sets of transformations and actions. It automatically distributes the data across the cluster and parallelizes the required operations. It supports a variety of transformations which make data preprocessing easier especially when it is becoming more difficult to examine big datasets. On the other hand, getting valuable insights from big data requires experimentation on different phases to select the right features, methods, parameters and evaluation metrics. Apache Spark is natively designed to handle such kind of iterative processing which requires more than one pass over the same dataset (e.g., MLlib for designing and tuning machine learning algorithms and pipelines).

In addition to iterative algorithms, Apache Spark is well suited for interactive analysis which can quickly respond to user’s queries by scanning distributed in-memory datasets. Moreover, Apache Spark is not only a unified engine for solving different data problems instead of learning and maintaining several different tools, but also a general-purpose framework which shortens the way from explanatory analytics in the laboratory to operational analytics in production data applications and frameworks [70]. Consequently, it can lead to a higher analyst productivity, especially when its upper-level libraries are combined to implement complex algorithms and pipelines.

Fig. 1
figure 1

High-level architecture of Apache Spark stack

2.5 Apache Spark in the industry

Since its initial releases, Apache Spark has seen a rapid adoption by enterprises across a wide range of industries.Footnote 11 Such fast adoption with the potential of Apache Spark as a unified processing engine, which integrates with many storage systems (e.g., HDFS, Cassandra, HBase, S3), has led to dozens of community-contributed packages that work with Apache Spark. Apache Spark has been leveraged as a core engine in many world-class companies such as IBM, Huawei [79], Tencent and Yahoo. For example, in addition to FP-growth algorithm and the Power Iteration Clustering algorithm, Huawei developed AstroFootnote 12 which provides native, optimized access to HBase data through Spark SQL/ Dataframe interfaces. With a major commitment to Apache Spark, IBM founded a Spark technology center.Footnote 13 Also, IBM SystemMLFootnote 14 was open sourced and there is a plan to collaborate with Databricks to enhance machine learning capabilities in Spark’s MLlib. Furthermore, Microsoft announced a major commitmentFootnote 15 to support Apache Spark through Microsoft’s platforms and services such as Azure HDInsightFootnote 16 and Microsoft R Server.Footnote 17

There are considerable case studies of using Apache Spark for different kinds of applications: e.g., planning and optimization of video advertising campaigns at Eyeview [18], categorizing and prioritizing social media interactions in real time at Toyota Motor Sales, USA [50], predicting the off-lining of digital media at NBC Universal [14] and real-time anomaly detection at ING banking [15]. It is used to manage the largest computing cluster (8000+ nodes) at Tencent and to process the largest Spark jobs (1 PB) at Alibaba and Databricks [92]. Also, the top streaming intake (1.2 TB/h) using Spark Streaming for large-scale neuroscience was recorded at HHMI Janelia Farm Research Campus [30]. Apache Spark has also set a new record as the fastest open source engine for large-scale sortingFootnote 18 (1 PB in 4 h) in 2014 on disk sort record [80].

3 Apache Spark stack

Apache Spark consists of several main components including Spark core and upper-level libraries (Fig 1). Spark core runs on different cluster managers and can access data in any Hadoop data source. In addition, many packages have been built to work with Spark core and the upper-level libraries. For a general overview of big data frameworks and platforms, including Apache Spark, refer to the big data landscape.Footnote 19

3.1 Spark core

Spark core is the foundation of Apache Spark. It provides a simple programming interface for processing large-scale datasets, the RDD API. Spark core is implemented in Scala, but it comes with APIs in Scala, Java, Python and R. These APIs support many operations (i.e., data transformations and actions) which are essential for data analysis algorithms in the upper-level libraries. In addition, Spark core offers main functionalities for in-memory cluster computing including memory management, job scheduling, data shuffling and fault recovery. With these functionalities, a Spark application can be developed using the CPU, memory and storage resources of a computing cluster.

3.2 Upper-level libraries

Several libraries have been built on top of Spark core for handling different workloads: Spark’s MLlib for machine learning [61], GraphX [33, 83] for graph processing, Spark Streaming [95] for streaming analysis and Spark SQL [7] for structured data processing. Improvements in Spark core lead to corresponding improvements in the upper-level libraries as these libraries are built on top of Spark core. The RDD abstraction has extensions for graph representation (i.e., Resilient Distributed Graphs in GraphX) and stream data representation (i.e., Discritized Streams in Spark Streaming). In addition, the DataFrame and Dataset APIs of Spark SQL provide a higher level of abstraction for structured data.

3.3 Cluster managers and data sources

A cluster manager is used to acquire cluster resources for executing jobs. Spark core runs over diverse cluster managers including Hadoop YARN [76], Apache Mesos [39], Amazon EC2 and Spark’s built-in cluster manager (i.e., standalone). The cluster manager handles resource sharing between Spark applications. On the other hand, Spark can access data in HDFS, Cassandra.Footnote 20 HBase, Hive, Alluxio and any Hadoop data source.

3.4 Spark applications

Running a Spark application involves five key entitiesFootnote 21 (Fig 2): a driver program, a cluster manager, workers, executors and tasks. A driver program is an application that uses Spark as a library and defines a high-level control flow of the target computation. While a worker provides CPU, memory and storage resources to a Spark application, an executer is a JVM (Java Virtual Machine) process that Spark creates on each worker for that application. A job is a set of computations (e.g., a data processing algorithm) that Spark performs on a cluster to get results to the driver program. A Spark application can launch multiple jobs. Spark splits a job into a directed acyclic graph (DAG) of stages where each stage is a collection of tasks. A task is the smallest unit of work that Spark sends to an executor. The main entry point for Spark functionalities is a SparkContext through which the driver program access Spark. A SparkContext represents a connection to a computing cluster.

3.5 Spark packages and other projects

Spark packages are open source packages and libraries that integrate with Apache Spark, but are not part of the Apache Spark project. Some packages are built to work directly with Spark core and others to work with upper-level libraries. Currently, there are more than 200 packagesFootnote 22 in different categories such as: Spark core, data sources, machine learning, graph, streaming, pySpark, deployment, applications, examples and other tools.

Fig. 2
figure 2

Key entities for running a Spark application (Source: https://spark.apache.org/docs/latest/cluster-overview.html)

While there are several projects which have contributed to building key components of Apache Spark project, many other projectsFootnote 23 and applications are built on top of Apache Spark and its upper-level libraries. We list here some supplemental and related projects for Apache Spark:

  • MLbase Footnote 24: a distributed machine learning library at scale on Spark core. It has led to the current Spark’s MLlib.

  • KeystoneML Footnote 25: a framework for large-scale machine learning pipelines on Spark core. It has contributed to the current Spark’s ML pipelines API.

  • Tungsten: fast in-memory processing for Spark applications. It is currently a key component of Spark’s execution engine [82]. This will reduce the memory overhead by leveraging off heap memory. Tungsten is expected to become the de facto memory management system [31].

  • Alluxio (formerly Tachyon) Footnote 26: an open source memory-centric distributed storage system [52].

  • SparkR Footnote 27: an R package that provides a frontend to use Spark from R [77]. It is now part of Apache Spark project.

  • BlinkDB Footnote 28: an approximate query engine on Spark SQL[1].

  • Spark Job Server Footnote 29: a RESTful interface for submitting and managing Apache Spark jobs, jars and job contexts.

4 Abstractions of data and computation

Apache Spark introduces several key abstractions for representing data and managing computation. At the low-level, data are represented as Resilient Distributed Datasets (RDDs) and computations on these RDDs are represented as either transformations or actions. In addition, there are broadcast variables and accumulators which can be used for sharing variables across a computing cluster.

4.1 Resilient Distributed Datasets

Spark core is built upon the Resilient Distributed Datasets (RDDs) abstraction [94]. An RDD is a read-only, partitioned collection of records. RDDs provide fault-tolerant, parallel data structures that let users store data explicitly on disk or in memory, control its partitioning and manipulate it using a rich set of operators [90]. It enables efficient data sharing across computations, an essential requirement for different workloads. An RDD can be created either from external data sources or from other RDDs.

As a fault-tolerant distributed memory abstraction, RDD avoids data replication by keeping the graph of operations (i.e., an RDD’s lineage—Fig. 3) that were used to construct it. It can efficiently recompute data lost on failure. The partitions of an RDD can be controlled to make it consistent across iterations where Spark core can co-partition RDDs and co-schedule tasks to avoid data movement. To avoid recomputation, RDDs must be explicitly cached when the application needs to use them multiple times.

Apache Spark offers the RDD abstraction through a simple programming interface. Each RDD is represented through a common interface with five pieces of information: partitions, dependencies, an iterator, preferred locations (data placement), and metadata about its partitioning schema. Such representation simplifies system design as a wide range of transformations were implemented without adding special logic to Spark scheduler for each one. With this representation, computations can be organized into independent fine-grained tasks. This representation can efficiently express several cluster computing models that previously required separate frameworks [90]. In addition to MapReduce model, Table 1 shows examples of models expressible using RDDs (both existing and new models).

Fig. 3
figure 3

Lazy evaluation of RDDs: transformations on RDDs are lazily evaluated, meaning that Spark will not compute RDDs until an action is called. Spark keeps track of the lineage graph of transformations, which is used to compute each RDD on demand and to recover lost data (image adapted from: http://www.slideshare.net/GirishKhanzode/apache-spark-core)

Table 1 Examples of models expressible using RDDs

Moreover, RDDs also enable the combination between these models for applications that require different processing types. This was a challenge with previous systems because it required different separate systems. As many parallel applications naturally perform coarse-grained operations (i.e., bulk transformations) on many records, RDDs are ideal for representing data in such applications.

4.2 Transformations and actions

In addition to the RDD abstraction, Spark supports a collection of parallel operations:Footnote 30 transformations and actions. Transformations are deterministic, but lazy, operations which define a new RDD without immediately computing it (Fig. 3). With a narrow transformation (e.g., map, filter, etc), each partition of the parent RDD is used by at most one partition of the child RDD. On the other hand, multiple child partitions may depend on the same partition of the parent RDD as a result of wide transformations (e.g., join, groupByKey, etc).

An action (e.g., count, first, take, etc) launches a computation on an RDD and then returns the results to the driver program or writes them to an external storage. Transformations are only executed when an action is called. At that point, Spark breaks the computation into tasks to run in parallel on separate machines. Each machine runs both its part of the transformations and the called action, returning only its answer to the driver program. With transformations and actions, computations can be organized into multiple stages of a processing pipeline. These stages are separated by distributed shuffle operations for redistributing data.

4.3 Shared variables

Although Spark uses a shared-nothing architecture where there is no global memory space that can be shared between the driver program and the tasks, it supports two types of shared variables for two specific use cases: broadcast variable and accumulators.

Broadcast variables are used to keep read-only variables cached on each machine (e.g., a copy of a large input dataset) rather than shipping a copy of them with tasks. Accumulators, on the other hand, are variables that workers can only add to through an associative operation and the driver can only read. They can be used to implement counters or sums.

4.4 DataFrames and Datasets or DataFrame and Dataset APIs

Spark core, and Apache Spark as a whole, is built upon the basic RDD API. However, as a rapidly evolving project, Apache Spark has introduced several improvements for its data abstraction which yield in a better computation model as well.

One of these improvements is the DataFrame API which is part of Spark SQL [7]. A DataFrame is conceptually equivalent to a table in a relational database or a data frame in R/Python, but Spark SQL comes with richer optimizations as Spark evaluates transformations lazily. It is a distributed collection of data, like RDD, but organized into named columns (i.e., a collection of structured records). This provides Spark with more information about the structure of both the data and the computation. Such information can be used for extra optimizations.

Although the RDD API is general, it provides limited opportunities for automatic optimizations because there is no information about the data structure or semantics of user functions. Moreover, the DataFrame API can perform relational operations on RDDs and external data sources and enables rich relational/ functional integration within Spark applications. DataFrames are now the main data representation in Spark’s ML Pipelines API. Other Spark libraries started to integrate with Spark SQL through the DataFrame API such as GraphFramesFootnote 31 [24] for GraphX.

Another improvement is the Dataset API which is a new experimental interface added in Spark 1.6. It is an extension of the DataFrame API that provides a type-safe, object-oriented programming interface. A Dataset is a strongly typed, immutable collection of objects that are mapped to a relational schema [9]. The goal is to provide the benefits of RDDs with the benefits of Spark SQL’s optimized execution engine (i.e., Spark’s Catalyst optimizer [8]) and the Tungsten’s fast in-memory encoding [82].

5 Machine learning on Apache Spark

In this section, we investigate Spark’s scalable machine learning library, MLlib. We elaborate on the key features that simplify the design and implementation of machine learning algorithms and pipelines: linear algebra and statistics packages, data preprocessing, model training, model evaluation, ensemble methods and machine learning pipelines. In addition, we summarize some research highlights on large-scale machine learning.

5.1 Spark’s MLlib: key features

Apache Spark enables the development of large-scale machine learning algorithms where data parallelism or model parallelism is essential [61]. These iterative algorithms can be handled efficiently by Spark core which is designed for efficient iterative computations. Implementing machine learning algorithms and pipelines for real applications usually requires common tasks such as feature extraction, feature transformations, model training, model evaluation and tuning. In this regard, Spark’s MLlib is designed as a distributed machine learning library to simplify the design and implementation of such algorithms and pipelines.

Fig. 4
figure 4

Key features of Spark’s MLlib: a spark.mllib is built on top of RDDs, b spark.ml is built on top of DataFrames

Spark’s MLlib is divided into two main packages (Fig. 4): spark.mllib and spark.ml.Footnote 32 While spark.mllib is built on top of RDDs, spark.ml is built on top of DataFrames. Both packages come with a variety of common machine learning tasks such as featurization, transformations, model training, model evaluation and optimization. spark.ml provides the pipelines API for building, debugging and tuning machine learning pipelines, whereas spark.mllib includes packages for linear algebra, statistics and other basic utilities for machine learning.

5.2 Data abstraction: RDDs and DataFrames

The basic design philosophy behind spark.mllib is invoking various algorithms and utilities on distributed datasets represented as RDDs. However, machine learning algorithms can be applied to different data types. Thus, spark.ml uses DataFrames to represent datasets. DataFrames can hold a variety of data types and provides an intuitive manipulation of distributed structured data. The schema of the data columns is known, and it can be used for runtime checking before actually running the pipeline. In addition, with DataFrames Spark can automatically distinguish between numeric and categorical features, and it can automatically optimize both storage and computation. The DataFrame API is fundamental to spark.ml as it also simplifies data cleaning and preprocessing by a variety of data integration functionalities in Spark SQL.

5.3 Linear algebra and statistics

In order to satisfy the requirements of distributed machine learning, the linear algebra package, linalg [89], provides abstractions and implementations for distributed matrices as well as local matrices and local vectors. It supports both dense and sparse representations of vectors and matrices. Sparse representation is essential in big data analytics, because sparse datasets are very common in big data due to different reasons: high-dimensional spaces, feature transformations, missing values, etc. As a result, it is usually recommended to use sparse vectors if at most 10 % of elements are nonzero [45]. This has important effects in terms of performance and memory usage.

As a distributed linear algebra library, linalg supports several types of distributed matrices: RowMatrix, IndexedRowMatrix, CoordinateMatrix and BlockMatrix. On the other hand, for supervised learning algorithms (e.g., classification and regression), a training example is represented as a LabeledPoint which is a local vector associated with a label to represent the label and features as a data point. In addition, spark.mllib contains the Rating data type to represent ratings of products for recommendation applications.

Another important part in spark.mllib is the statistics packages which are essential, not only for machine learning algorithms, but also for data analytics in general. For example, the mllib.stats package offers common statistical functions for exploratory data analysis: summary statistics, dependency analysis (i.e., correlation), hypothesis testing, stratified sampling, kernel density estimation and streaming significance testing. In addition, there is a special package for random data generation from various distributions.

5.4 Feature extraction, transformation and selection

Defining the right features is one of the most challenging tasks in machine learning. To simplify this task, Spark’s MLlib supports several methods for feature extraction, transformation and selection. While feature extraction is necessary to extract features from raw data (e.g., TF-IDF, Word2Vec etc), feature transformers can be used for scaling (e.g., StandardScalar and MinMaxScalar), normalization (e.g., Normalizer), converting features (e.g., PCA), modifying features (e.g., Hadamard product) and others. The library contains also some utilities for selecting subsets of features from larger sets of features (e.g., Chi-Squared, VectorSlicer and RFormula).

These methods are helpful if we already have the required dataset. However, sometimes it is not easy to get a real dataset. In this regard, spark.mllib offers several data generation methods to generate synthesized datasets for testing specific algorithms such as k-means, logistic regression, SVM and matrix factorization. In addition, the library also provides a collection of methods to validate data before applying the target algorithm (e.g., binary label validators and multilabel validators). There are also other utilities for data loading, saving and other preprocessing utilities.

5.5 Model training

The essence of any machine learning algorithm is fitting a model to the data being investigated. This is known as model training which returns a model for making predictions on new data. As Spark core has an advanced DAG execution engine and in-memory caching for iterative computations, its benefits will be evident in providing scalable implementations of learning algorithms. Spark’s MLlib comes with a number of machine learning algorithms for classification, regression, clustering, collaborative filtering and dimensionality reduction.

The library comes with two major tree ensemble algorithms which use decision trees as their base models: Gradient Boosted Trees and Random Forests. In addition, spark.ml supports OneVsRest (One-vs-All), a reduction method for performing multiclass classification given a base classifier that can efficiently perform binary classification.

5.6 Model evaluation

In general, different machine learning algorithms require different evaluation metrics according to the type of application (e.g., classification, clustering and collaborative filtering). The current supported metrics can be classified into three categories: classification metrics (binary classification, multicalss classification and multilabel classification), regression metrics and ranking metrics.

5.7 Machine learning pipelines

The pipelines API, spark.ml, was introduced in Spark 1.2 to facilitate the creation, tuning and inspection of machine learning workflows. A machine learning workflow is represented as a Pipeline, which consists of a sequence of Pipeline Stages to be run in a specific order. Each one of these stages can be either a Transformer or an Estimator. While a transformer is an algorithm which can transform one DataFrame into another (e.g., feature transformers and learned models), an estimator is an algorithm which can fit a DataFrame to produce a transformer (i.e., a learning algorithm or any algorithm that fits or trains on data). In addition, an evaluation stage in a machine learning workflow is represented as an evaluator which computes metrics from predictions. Both Estimators and Transformers use a uniform API for specifying parameters.

In general, there are two main phases to learn from data: a training phase where we build a model and a testing phase where we use the fitted model to make predictions on new data. In spark.ml, a Pipeline represents the training phase while the testing phase is represented as a Pipeline Model which is the output of a pipeline. The abstraction of pipelines and pipeline models helps to ensure that both the training and test data go through the same processing steps. As the goal of a pipeline is to fit a learning model, it is considered as an estimator. On the other hand, a pipeline model is considered as a transformer.

Pipelines and DataFrames can be used to inspect and debug machine learning workflows. In addition, complex pipelines can be built as compositions (a pipeline within a pipeline) and DAGs. Also, user-defined components can be used in pipelines.

5.8 Optimization and tuning

spark.mllib supports two main optimization methodsFootnote 33: gradient descent methods including stochastic subgradient descent (SGD) and Limited-memory BFGS (L-BFGS). Besides, spark.ml offers built-in parameter tuning techniques to optimize machine learning performance. It uses an optimization algorithmFootnote 34 called Orthant-Wise Limited-memory QuasiNewton (OWL-QN) [4] which is an extension of L-BFGS that can effectively handle L1 regularization and elastic net.

Another key aspect in the current implementation of machine learning algorithms is algorithmic optimization such as level-wise training and binning in training a decision tree model [2]. However, the details of such implementations are beyond the scope of this paper.

5.9 Research highlights

Advanced analytics, such as machine learning, is essential for getting valuable insights from large-scale datasets. However, it is difficult to design, implement, tune, manage and use machine learning algorithms and pipelines at scale. There are several research projects which focus on alleviating such challenges. While some of these projects have essential contributions to Apache Spark project, others depend on Apache Spark as a core framework for solving machine learning problems.

  • MLbase: a platform implementing distributed machine learning at scale using Spark core as a runtime engine [48]. It consists of three components: MLlib, MLI which introduces high-level ML programming abstractions for feature extraction and algorithm development, and ML Optimizer which aims to automating the construction of ML pipelines. While MLlib and MLI target ML developers, ML Optimizer targets end users. MLbase started in 2012 as a research project at UC Berkely’s AMPLab. It has led to the current Spark’s MLlib, but some MLbase’s features are not included in Spark’s MLlib. A new component called TuPAQ (Training-supported Predictive Analytic Query Planner) [74, 75] automatically finds and trains models for large-scale machine learning.

  • KeystoneML: a software framework for building and deploying large-scale machine learning pipelines with Apache Spark. KeystoneML also started as a research projectFootnote 35 at UC Berkely’s AMPLab. It has contributed to the design of spark.ml, but it includes a richer set of operators (e.g., featurizers for images, text and speech) than those currently included in spark.ml.

  • SystemML Footnote 36: a distributed and declarative machine learning platform in which ML algorithms are expressed in a higher-level language. It also provides hybrid parallelization strategies for large-scale machine learning algorithms ranging from single-node, in-memory computations, to distributed computations on Apache Hadoop and Apache Spark [12]. It started as a research project at IBM Watson Research Center to build a platform where the algorithms are compiled and executed in a MapReduce environment [32]. Open source SystemML was announced in June 2015, and it was accepted as an Apache Incubator project in November 2015.

  • Velox: as the design of machine learning frameworks lacks the ability to serve and manage large-scale data models, Velox is one research project which aims to transform statistical models trained by Spark into data products by offering a data management system for facilitating online model management, maintenance and serving [20]. It manages the lifecycle of machine learning models, from model training on raw data to the actual actions and predictions which produce additional observations leading to further model training.

  • HeteroSpark: in addition to building scalable machine learning frameworks, another challenge is to leverage the power of GPUs to achieve better performance and energy efficiency with applications that are both data and computation intensive such as machine learning algorithms. HeteroSpark [54] is a GPU-accelerated heterogeneous architecture integrated with Spark. It combines the massive computing power of GPUs and scalability of CPUs and system memory resources.

  • Splash Footnote 37: a general framework built on Apache Spark for parallelizing stochastic learning algorithms on multinode clusters [96]. It can be substantially faster than existing data analytics packages built on Apache Spark. Stochastic algorithms are efficient approaches for solving machine learning and optimization problems with large-scale datasets, parallelizing these algorithms is a key challenge, especially that they are generally defined as sequential procedures.

There are also other works which focus on implementing and testing machine learning algorithms and utilities on Apache Spark such as parallel subspace clustering [98], decision trees for time series [71], among others. However, there are other existing open source frameworks, in addition to MLlib, for machine learning with big data such as Mahout, H2O and SAMOA. As these tools have advantages and drawbacks, and many have overlapping features, deciding on which framework to use is not easy. In this regard, several papers provide comparisons between some of these tools including MLlib [51, 69].

6 Graph analysis on Apache Spark

In this section, we focus on GraphX, an upper-level library for scalable graph analysis. We introduce its key features that simplify the design and implementation of graph algorithms and pipelines: graph data representation, graph operators, graph algorithms, graph builders and other utilities. In addition, we summarize some research highlights in large-scale graph analytics.

6.1 GraphX: key features

GraphX combines the advantages of both previous graph-parallel systems and current Spark’s data-parallel framework to provide a library for large-scale graph analytics [83]. With an extension to the RDD API, GraphX offers an efficient abstraction for representing graph-oriented data. In addition, it comes with various graph transformations, common graph algorithms and a collection of graph builders (Fig. 5). It also includes a variant of the Pregel API for graph-parallel computations. With GraphX, both data transformations from Spark core and graph transformations can be used. Thus, it provides an integrated framework for complete graph analysis pipelines which consist of both data and graph computation steps.

Fig. 5
figure 5

Key features of GraphX

6.2 Data abstraction: RDGs

GraphX introduces the Resilient Distributed Graph (RDG) [83], an extension of the RDD API for graph abstraction. The core data structure is a property graph. A property graph is a directed multigraph (i.e., contains pairs of vertices connected by two or more parallel edges) with data attached to its vertices and edges. In other words, each vertex and each edge have properties or attributes. Like RDDs, property graphs are immutable, distributed and fault tolerant. Transformations are defined on graphs and each operation yields a new graph for changes in values and/or structure.

There are five data types for working with property graphs:

  • Graph: an abstraction of a property graph which is conceptually equivalent to a pair of typed RDDs: one RDD is a partitioned collection of vertices and the other one is a partitioned collection of edges.

  • VertexRDD: a distributed collection of vertices in a property graph. Each vertex is represented by a key–value pair, where the key is a unique id and the value is the data associated with the vertex.

  • Edge: an abstraction of a directed edge in a property graph. Each edge is represented by a source vertex id, destination vertex id and edge attributes.

  • EdgeRDD: a distributed collection of the edges in a property graph.

  • EdgeTriplet: a combination of an edge and the two vertices that it connects. A collection of these triplets represents a tabular view of a property graph.

At the end, graph data are represented as a pair of typed RDDs. Therefore, those RDDs can be transformed and analyzed using the basic RDD API. In other words, the same graph data can be accessed and processed as a pair of collections or as a graph. Analogous to Spark core where the essential data structure is an immutable RDD (i.e., a collection of data), the core data structure in GraphX is an immutable RDG (i.e., a graph).

6.3 Graph operators

In order to make graph analysis more flexible, GraphX extends Spark operators with a collection of specialized operators for property graphs [35]. Those operators produce new graphs with transformed properties or structure. However, the cost of such transformations is reduced by reusing substantial parts of the original graph (e.g., unaffected structure, properties and indices). The supported graph operators can be classified into the following main categories:

  • Property Operators: There are three property transformation operators: mapVertices, mapEdges and mapTriplets. The graph structure is not affected by these operators and allows the resulting graph to reuse the structural indices of the original graph.

  • Structural Operators: Current structural transformation operators include reverse, subgraph, mask and groupEdges. The associated data are not affected by the structural operators.

  • Join Operators: joinVertices and outerJoinVertices can be used to join data from other RDDs with graphs in order to update existing properties or add new ones.

  • Aggregation Operators: with these operators, data can be aggregated from a vertex’s neighborhood. This is essential to many graph algorithms such as PageRank. aggregateMessages is the core aggregation operator which aggregates values for each vertex from neighboring vertices and connecting edges. Other aggregation operators compute the degree of each vertex (maxInDegree, maxOutDegree, maxDegrees) or collect neighboring vertices and their attributes at each vertex (collectNeighborIds, collectNeighbors).

In addition, GraphX also supports graph-parallel operators which can be used to implement custom iterative graph algorithms such as pregel (supported through GraphX’s Pregel API). The pregel operator with other operators (e.g., aggregateMessages) can be used for implementing custom graph algorithms with a few lines of code. Moreover, GraphX comes with built-in implementations of several graph algorithms which will be reviewed briefly in the following section.

6.4 Graph algorithms

GraphX built-in algorithms include:

  • PageRank: In order to measure the importance of each vertex in a graph, GraphX comes with two implementations of PageRank algorithm. While staticPageRank runs for a fixed number of iterations using the RDG API, dynamic pageRank uses the Pregel API and runs until the ranks stop changing.

  • Connected Components: The connectedComponents finds the connected component membership for each vertex.

  • Strongly Connected Components: The stronglyConnectedComponents finds the strongly connected component for each vertex and returns a graph. A strongly connected component of a graph is a subgraph containing vertices that are reachable from every other vertex in the same subgraph.

  • Triangle Counting: In order to find the number of triangles passing through each vertex, the triangleCount method checks if a vertex has two neighbor vertices with an edge between them, and returns a graph in which a vertex’s property is the number of triangles containing it.

  • Label Propagation: This algorithm can be used for detecting communities in networks.

  • SVDPlusPlus: An implementation of SVD++ based on [47], which is an integrated model between neighborhood models and latent factor models.

  • Shortest Paths: This finds shortest paths from each vertex in a graph to a given set of vertices.

6.5 Graph builders and generators

GraphX comes with some utilities for building graph-oriented datasets from a collection of vertices and edges in an RDD or on disk. GraphLoader provides utilities for loading graphs from files (e.g., edge list formatted file). On the other hand, there are other methods for creating graphs from RDDs: creating a graph from RDDs of vertices and edges, creating a graph from only an RDD of edges, and creating a graph from only an RDD of edge tuples.

However, when testing graph algorithms and pipelines, it could be difficult to find the required real data with certain qualities. To alleviate this problem, GraphX offers the GraphGenerators utility which contains random edges generator and several other generators for generating specific types of graphs such as log normal graph (a graph whose vertex out degree distribution is log normal), R-MAT graph [17], grid graph and star graph.

6.6 Research highlights

Graph analytics, like machine learning, is also essential for getting valuable insights from large-scale graph data. A key need nowadays is a reliable framework to simplify the design and implementation of complex graph algorithms and pipelines. We list here some research directions on large graph analysis:

  • Graph-Parallel Frameworks: Several large-scale graph-parallel frameworks were designed to efficiently execute graph algorithms, such as GraphLab [55], Pregel [57], Kineograph [19] and PowerGraph [34]. However, these frameworks have different views on graph computation and lack effective functionalities of ETL, the key challenges in big graph mining, too. Also, they offer limited fault tolerance and support for interactive data mining. GraphX, on the other hand, has a fault-tolerant, distributed graph computation, and it enables ETL and interactive analysis of massive graphs as it is built on top of Spark core and offers new graph abstraction. This can simplify the design, implementation and application of complex graph algorithms and pipelines. For more information about existing frameworks and techniques for big graph mining, refer to works such as [6].

  • Querying Big Graphs: GraphFrames [24] integrates pattern matching and graph algorithms with Spark SQL to simplify the graph analysis pipeline and enable optimizations across graph and relational queries. This unifies graph algorithms, graph queries and DataFrames. In addition, Portal [62, 63] is a declarative language built on top of GraphX to support efficient querying and exploratory analysis of evolving graphs. Another example is Quegel [86], a distributed system for large-scale graph querying.

  • Graph-based Representation: It is clear that a reliable representation (e.g., RDGs or property graphs) of graph-oriented data is essential for efficient processing of large graphs. There are other works in this direction such as MedGraph [44] which presents a graph-based representation and computation for large sets of images.

7 Stream processing on Apache Spark

In this section, we focus on Spark Streaming, an upper-level library for large-scale stream processing. We elaborate on some key features and components: stream data abstraction, data sources and receivers, streaming computational model. We also review some examples of how Spark Streaming can be used with other Spark libraries. Then, we summarize some research highlights.

7.1 Spark Streaming: key features

Most traditional stream processing systems are designed to process records one at a time. This is known as the continuous operator model, a simple model which works very well at small scales, but it faces some challenges with large-scale and real-time analysis. In order to alleviate these challenges, Spark Streaming uses a micro-batch architecture [45] where a stream is treated as a sequence of small batches of data and the streaming computation is done through a continuous series of batch computations on these batches of data.

Fig. 6
figure 6

Key features of Spark Streaming

To achieve such a micro-batch architecture, Spark Streaming comes with several packages or components (Fig. 6) for stream processing. The Streaming Context is the main entry point for all streaming functionality. For a Streaming Context, one parameter (the batch interval) must be set based on the latency requirements of the target application and available cluster resources. DStream is the basic programming abstraction in Spark Streaming for micro-batch stream processing. Data sources represent different kinds of streaming sources which Spark Streaming can be linked to. A receiver is like an interface, between a data source and Spark Streaming, which gets data and stores it in Spark’s memory for later processing. A scheduler provides listener interface for receiving information about an ongoing streaming computation such as receiver status and processing times. In addition, Spark Streaming supports a variety of transformations, output operations and utilities for stream processing.

7.2 Data abstraction: DStreams

The basic programming abstraction in Spark Streaming is Discretized Streams (DStreams) [95]. DStream is a high-level abstraction which represents a continuous stream of data as a sequence of small batches of data. Internally, a DStream is a sequence of RDDs, each RDD has one time slice of the input stream, and these RDDs can be processed using normal Spark jobs. As a result, DStreams have the same fault tolerance properties as those of RDDs and streaming data can be processed using Spark core and other upper-level rich libraries. RDD abstraction itself is a convenient way to design the computations on data as a sequence of small, independent steps [70]. In this way, computations can be structured as a set of short, stateless, deterministic tasks instead of continuous, stateful operators. This can avoid problems in traditional record-at-a-time stream processing.

7.3 Data sources and receivers

Input DStreams are DStreams representing the stream of input data received from streaming sources. Each input DStream (except for those coming from file streams) is associated with a receiver.

7.3.1 Streaming sources

Spark Streaming provides two main categories of built-in streaming sources (basic sources and advanced sources) in addition to custom sources.

  • Basic Sources: these sources are directly available in Spark Streaming API: file systems, socket connections and AkkaFootnote 38 actor streams. In addition, a DStream can be created based on a queue of RDDs which is usually used for testing a Spark Streaming application with test data.

  • Advanced Sources: these sources require extra packages for interfacing with external non-Spark libraries (e.g., Kafka, Flume and Twitter).

  • Custom Sources: Spark Streaming supports creating input DStreams from custom data sources by implementing a user-defined receiverFootnote 39 that is customized for receiving data from the target data source.

7.3.2 Receivers

As its name implies, a receiver gets the data from a streaming source and stores it in Spark’s memory for processing. As data sources can be reliable (i.e., allow system receiving data to acknowledge the received data correctly) or unreliable, there are two kinds of receivers as well. A reliable receiver correctly sends an acknowledgment to a reliable source when the data have been received and stored in Spark with replication. On the other hand, an unreliable receiver does not send acknowledgment to a source. However, unreliable receivers can be used for sources that do not support acknowledgment or for those which do. The receiver package has an interface which can be run on worker nodes to receive external data. In addition, there are several packages which provide Spark Streaming receivers for advanced sources such as Kafka,Footnote 40 Flume,Footnote 41 Kinesis,Footnote 42 TwitterFootnote 43

7.4 Discretized stream processing

With the micro-batch architecture, Spark Streaming can receive data from different sources and divide it into small batches. New input batches are created at regular time intervals (i.e., batch interval parameter). After getting data from a streaming source and storing it in Spark’s memory, Spark core is used as a batch processing engine to process each batch of data. As the computation model behind Spark Streaming is based on DStreams, input data streams are discretized into batches and represented in a DStream which is stored as a sequence of RDDs in Spark’s memory. Then, streaming computations are executed by generating Spark jobs to process those RDDs. This yields in other RDDs representing program outputs or intermediate states. The results of such processing can be pushed out to external systems in batches too. In order to achieve this computational model, Spark Streaming depends on the following:

7.4.1 Transformations

There are two categories of transformations on DStreams: stateless and stateful. Stateless transformations (i.e., normal RDD transformations) of each batch do not depend on the data of its previous batches. On the other hand, stateful transformations (i.e., based on sliding windows and on tracking state across time) use data or intermediate results from previous batches to compute the results of the current batch. In addition, stateless transformations are applied on each batch separately (i.e., each RDD) in a DStream. In other words, they are simple RDD transformations that apply to data within each time slice, but not across time slices. However, stateful transformations are operations which can be applied on data across time and can be divided into two main types: windowed transformations and updateStateByKey transformations.

Windowed transformations combine results from multiple batches. As the name indicates, these transformations can compute results within a window (i.e., within a longer time period than the batch interval). These transformations require two main parameters: window duration which controls how many previous batches are considered and sliding duration which controls how frequently the new DStream computes results (i.e., the interval at which the window operation is performed). On the other hand, an updateStateByKey transformation is useful to maintain a state across the batches in a DStream while continuously updating it with new information. For example, if we need to keep track of the last 10 pages each user visited on a Web site, our state object will be a list of the last 10 pages, and we will update it upon each event (i.e., accessing a web page).

7.4.2 Output operations

The actual execution of all the DStream transformations is triggered by output operations (similar to normal RDD actions). With these operations, we can specify what should be done with the final results of a stream processing, the output DStreams. For example, printing the results is usually used for debugging. In addition, there are other output operations for pushing the results to an external storage, such as saving them as text files or Hadoop files. Moreover, Spark Streaming supports a generic output operator, foreachRDD, which applies a function to each RDD on the DStream.

7.4.3 Backpressure

A mechanism that allows Spark Streaming to dynamically control the rate of receiving data when the system is in an unstable state or the processing conditions change (e.g., a large burst in input and a temporary delay in writing output). Spark Streaming started to support this mechanism in Spark 1.5.

7.4.4 DStream checkpointing

If an executer fails, tasks and receivers are restarted by Spark automatically. However, if the driver fails, Spark Streaming recovers the driver by periodically saving the DAG of DStreams to fault-tolerant storage. Then, the failed driver can be restarted from the checkpoint information.

7.5 Batch, streaming and interactive analytics

As we discussed before, Apache Spark provides a single engine for batch, streaming and interactive workloads. This makes it unique comparing to traditional streaming systems, especially regarding fast failure, straggler recovery and load balancing. In addition, Apache Spark can be used for applications which work with both streaming and static data taking advantage of the native support for interactive analysis and native integration with advanced analysis upper-level libraries. As a DStream is just a series of RDDs, the basic data abstraction for a distributed dataset in Spark, Spark Streaming has the same data abstraction with Spark core and other Spark libraries. This allows unification of batch, streaming and interactive analysis. Thus, it can simplify building real-time data pipelines which is a crucial need in many domains to get real-time insights. In the following subsections, we list some examples of using Spark Streaming with other Spark libraries.

7.5.1 Spark Streaming and Spark SQL

As RDDs generated by DStreams can be converted to DataFrames, SQL can be used to query streaming data. Some Spark reference applications [23] demonstrate how different Spark libraries can be used together. For example, log analysis application uses both Spark SQL and Spark Streaming.

7.5.2 Spark Streaming and MLlib

There are two main cases where Spark Streaming and MLlib can be used together. Machine learning models generated offline with MLlib can be applied on streaming data (i.e., Offline training, online prediction). On the other hand, machine learning models can be trained from labeled data streams (i.e., Online training and prediction). One reference application which uses Spark Streaming with Spark MLlib is Twitter Streaming Language Classifier [23]. Another one is a platform for large-scale neuroscience [28] at HHMI Janelia Farm Research Campus where Spark Streaming is integrated with MLlib to develop streaming machine learning algorithms and perform analyses online during experiments. In addition, Spark MLlib supports some streaming machine learning algorithms such as Streaming Linear Regression and Streaming K-means [29].

7.5.3 Spark Streaming and GraphX

One example of using Spark Streaming with GraphX is dynamic community detection [40] where Spark Streaming is used for online incremental community detection and GraphX is used for offline daily update. GraphTau [43] is a time-evolving graph processing framework built on top of GraphX and Spark Streaming to support efficient computations on dynamic graphs.

7.6 Research highlights

Spark Streaming is considered as one of the most widely used libraries in Spark [22]. As streaming analysis is essential in today’s big data industry, it is necessary to have a reliable framework for building end-to-end analysis pipelines which integrates streaming with other workloads. In addition to the examples listed in the previous section, we list here some recent endeavors in this direction:

  • StreamDM Footnote 44: an open source data mining and machine learning library developed at Huawei Noah’s Ark Lab and designed on top of Spark Streaming for big data stream learning.

  • IncApprox: a data analytics system based on Spark Streaming. It combines incremental and approximating computing for real-time processing over the input data stream and emits the query result along with the confidence interval or error bounds [49].

There are other projects for stream data processing. For example, Apache FlinkFootnote 45 is another open source project for distributed stream and batch data processing. Flink started as a fork of the StratosphereFootnote 46 research project. It became an Apache Incubator project in March 2014 and then was accepted as an Apache top-level project in December 2014. Another project is Apache Storm,Footnote 47 an open source distributed real-time computation system. A comparison between Spark streaming and other projects for stream processing is beyond the scope of this paper.

8 Benchmarks for Apache Spark

With such a rapidly evolving big data framework, reliable and comprehensive benchmarks are essential to reveal Spark’s real efficiency with different workloads. The following is a summary of some works in this direction which includes benchmarking for Apache Spark and big data in general.

  • SparkBench [53]: a Spark benchmarking suite from IBM TJ Watson Research CenterFootnote 48 which covers different workloads on Apache Spark.

  • HiBench Footnote 49: a big data microbenchmark suite from Intel to evaluate big data frameworks such as Hadoop’s MapReduce and Apache Spark.

  • Spark-perf Footnote 50: a performance testing framework for Apache Spark from Databricks.

  • BigBench Footnote 51: a specification-based benchmark for big data. It was recently used to evaluate Spark SQL [42].

  • Yahoo Streaming Benchmarks Footnote 52: benchmarks of three stream processing frameworks at YahooFootnote 53: Apache Flink, Apache Spark and Apache Storm.

  • Spark SQL Performance Tests Footnote 54: a performance testing framework from Databricks for Spark SQL in Apache Spark 1.6+.

  • BigDataBench Footnote 55: a benchmark suite from the Chinese Academy of Sciences for evaluating different workloads using Apache Spark and other frameworks.

  • Spark Performance Analysis Footnote 56: a project for quantifying performance bottlenecks in distributed computation frameworks, and using it to analyze Spark’s performance on different benchmarks [66].

While some of these works are technology agnostic benchmarks (e.g., BigDataBench), others are technology-specific benchmarks which focus on Spark or some of its components (e.g., SparkBench).

9 Discussion

Currently, Apache Spark is adopted and supported by both academia and industry. The community of contributors is growing around the world, and dozens of code changes are made to the project everyday. A major release of Apache Spark (Spark 2.0) was released while writing this paper [81, 91]. This paper provides a concise summary about Apache Spark from both research and development point of views. The key features of Apache Spark and the variety of applications which can be developed using this framework are clearer now. For those who want to start developing Spark applications or trying some sample programs, the Databricks community editionFootnote 57 is one place to go. However, as Apache Spark is becoming the de facto standard for big data analytics, it is also important to understand the key differences from the previous Hadoop’s MapReduce model, and important research and development directions, as well as related challenges. These issues are discussed briefly in this section.

9.1 In-memory big data processing

It is clear that in-memory data abstraction is fundamental in Spark core and all its upper-level libraries, which is a key difference from the disk-based Hadoop’s MapReduce model. It allows storing intermediate data in memory instead of storing it on disks and then retrieving it from disks for the subsequent transformations and actions. However, this makes memory a precious resource for most workloads on Apache Spark. On the other hand, although Apache Spark offers a flexible and advanced DAG model after the simple map/reduce model, scheduling of spark jobs is much more difficult than MapReduce jobs. Furthermore, Apache Spark is more sensitive to data quality as accessing data from remote memory is more expensive than accessing data from remote disks [53]. That is why data partitioning requires careful settings for complex applications. Moreover, optimizing shuffle operations is essential as these operations are expensive. For example, in a benchmark test using SparkBench [53], the majority workloads required more than 50 % of the total execution time for shuffle tasks.

9.2 Data analysis workloads

Apache Spark has another key advantage which is supporting a wide range of data applications such as machine learning, graph analysis, streaming and structured data processing. While Apache Spark offers a single framework for all these workloads, different frameworks and platforms were needed for data processing with the Hadoop’s MapReduce model. In addition, some of the main projects (e.g., MLbase, KeystoneML), which contributed to Spark’s MLlib and ML libraries, have more features which have not yet included as part of the official releases of Apache Spark. Although Spark Streaming has been improved a lot recently, for truly low-latency, high-throughput applications, Spark may not necessarily be the right tool unless the new Structured Streaming feature is practically proved to be efficient. For detailed comparisons between Spark Streaming and other stream analysis frameworks (e.g., Apache Flink), refer to recent works such as [59]. In this regard, it’s also worth noting that Apache Spark was fundamentally designed for batch processing (i.e., ETL operations).

9.3 APIs and data abstraction

Apache Spark provides easy to use APIs for operating on large data sets across different programming languages (Scala, Java, Python and R) and with different levels of data abstraction. This makes it easier for data engineers and scientists to build data algorithms and workflows with less development efforts. There are three main sets of APIs in Apache Spark, but with different levels of abstraction. Two of these APIs, the DataFrame and Dataset APIs, have been recently merged in one API in Spark 2.0.Footnote 58 This will help in unifying data processing capabilities across the upper-level libraries. The RDD API will remain the low-level abstraction which is the best choice for having a better control of low-level operations, especially when working with unstructured data. However, RDDs cannot take advantages of Spark’s advanced optimizers (i.e., catalyst optimizer and Tungsten execution engine) and do not infer the schema of structured data. It is recommended to use the DataFrame and Dataset APIs when working with structured and semi-structured data. These APIs are built on top of Spark SQL which uses the Catalyst optimizer (to generate an optimized logical and physical query plan) and the Tungsten fast in-memory encoding. For a better type safety at compile time, the Dataset API is the best choice.

DataFrames and Datasets are essential for other libraries such as ML pipelines and the new Structured Streaming (i.e., Streaming DataFrames) and GraphFrames APIs. The Structured Streaming engine is developed as a core component in Spark 2.0. It is a declarative API that extends DataFrames and Datasets. With this high-level, SQL-like API, various analytic workloads (e.g., ad hoc queries and machine learning algorithms) can be run directly against a stream, for example state tracking using a stream and then running SQL queries, or training a machine learning model and then updating it. On the other hand, GraphFrames is a new API which integrates graph queries and graph algorithms with Spark SQL (or, in other words, with the DataFrame API) [24, 25]. One key component of GraphFrames is a graph-aware query planner. GraphFrames are to DataFrames as RDGs are to RDDs.

All that said, choosing the right API to use may also depend on the programming language. For example, while the Dataset API is designed to work equally well with both Java and Scala, the DataFrame API is very Scala-centric. In addition, since R and Python have no compile-time type safety, the DataFrame API is suitable when working with these languages [21]. There is no doubt that data abstraction has been improved recently in Apache Spark, but having those different levels of abstractions with frequent updates may mislead developers especially when working with production applications. We believe that those APIs still need time to mature and prove their efficiency on real big data applications.

9.4 Tungsten project for memory management

While the DataFrame and Datasets APIs make Spark more accessible, the Tungsten project [82] aims to make Apache Spark faster by improving the usage efficiency of memory and CPU for Spark applications. This is essential for big data processing especially when CPU is increasingly becoming the performance bottleneck in data analysis frameworks [66]. Apache Spark included some features from the Tungsten project since Spark 1.4. Currently, the Tungsten engine is one of the core components in Spark 2.0. It is built upon ideas from modern compilers and Massively Parallel processing (MPP) databases [81].

9.5 Debugging of Spark applications

Although Apache Spark has evolved as a replacement for MapReduce by creating a framework to simplify the difficult task of writing parallelizable programs, Spark is not yet a perfectly engineered system [31]. A crucial challenge in such a framework for large-scale data processing is debugging. Developers need to understand the internals of Spark engine, the low-level architecture, in order to better identify the root causes of application failures. One recent work on this problem is the BigDebugFootnote 59 project which aims to provide a set of interactive, real-time debugging primitives for frameworks like Apache Spark [38]. An essential part of this project is Titian [41], a data provenance library for tracking data through transformations in Apache Spark.

9.6 Related research

In addition to the research highlights we presented in the previous sections, there are other research works which have been done using Apache Spark as a core engine for solving data problems in machine learning and data mining [5, 36], graph processing [16], genomic analysis [60, 65], time series data [71], smart grid data [73], spatial data processing [87], scientific computations of satellite data [67], large-scale biological sequence alignment [97] and data discretization [68]. There are also some recent works on using Apache Spark for deep learning [46, 64]. CaffeOnSpark is an open source projectFootnote 60 from YahooFootnote 61 for distributed deep learning on big data with Apache Spark.

Other works compare Apache Spark with other frameworks such as MapReduce [72], study the performance of Apache Spark for specific scenarios such as scale-up configuration [10], analyze the performance of Spark’s programming model for large-scale data analytics [78] and identify the performance bottlenecks in Apache Spark [66] [11]. In addition, as Apache Spark offers language-integrated APIs, there are some efforts to provide the APIs in other languages. MobiusFootnote 62 (formerly Spark-CLR) is a cross-company open source project at Microsoft Research that aims to provide \(\hbox {C}\#\) language bindings for Apache Spark. There is also a considerable body of research on distributed frameworks, including Apache Spark, for solving big data challenges [3, 27].

10 Conclusions

In this paper, we have introduced a review on the key features of Apache Spark for big data analytics. Apache Spark is a general-purpose cluster computing framework with an optimized engine that supports advanced execution DAGs and APIs in Java, Scala, Python and R. Spark’s MLlib, including the ML pipelines API, provides a variety of functionalities for designing, implementing and tuning machine learning algorithms and pipelines. GraphX is built on top of property graphs (i.e., an extension of RDDs for graph representation) and comes with a collection of operators to simplify graph analysis. Spark Streaming is built on top of DStreams (i.e., an extension of RDDs for stream data) and supports a wide range of operations and data sources.

While RDD is the basic abstraction and the RDD API will remain the low-level API, two other alternatives are under active development now: the DataFrame API and the Dataset API. These alternatives are becoming the backbone of Apache Spark for better data representation and computation optimization. Current efforts in this regard include, but are not limited to GraphFrames, Structured Streaming, and the Tungsten project as a whole.

Considering the upper-level libraries which are built on top of Spark core, Apache Spark provides a unified engine which goes beyond batch processing to combine different workloads such as iterative algorithms, streaming and interactive queries. It is apparent that Apache Spark project, supported by other projects from academia and industry, has already done an essential contribution for solving key challenges of big data analytics. However, the big data community still needs more in-depth analyses of Apache Spark’s performance in different scenarios, although there are several endeavors for Apache Spark’s benchmarking.