Apache Spark

Apache Spark is a cluster computing solution and in-memory processing framework that extends the MapReduce model to support other types of computations such as interactive queries and stream processing (Zaharia et al. 2012). Designed to cover a variety of workloads, Spark introduces an abstraction called RDD!s (RDD!s) that enables running computations in memory in a fault-tolerant manner. RDD!s, which are immutable and partitioned collections of records, provide a programming interface for performing operations, such as map, filter, and join, over multiple data items. For fault-tolerance purposes, Spark records all transformations carried out to build a dataset, thus forming a lineage graph.


Spark (Zaharia et al. 2016) is an open-source big data framework originally developed at the University of California at Berkeley and later adopted by the Apache Foundation, which has maintained it ever since. Spark was designed to address some of the limitations of the...

