Encyclopedia of Big Data Technologies

Living Edition
| Editors: Sherif Sakr, Albert Zomaya

Apache Spark

Living reference work entry
DOI: https://doi.org/10.1007/978-3-319-63962-8_37-1


Apache Spark is a cluster computing solution and in-memory processing framework that extends the MapReduce model to support other types of computations such as interactive queries and stream processing (Zaharia et al. 2012). Designed to cover a variety of workloads, Spark introduces an abstraction called RDD!s (RDD!s) that enables running computations in memory in a fault-tolerant manner. RDD!s, which are immutable and partitioned collections of records, provide a programming interface for performing operations, such as map, filter, and join, over multiple data items. For fault-tolerance purposes, Spark records all transformations carried out to build a dataset, thus forming a lineage graph.


Spark (Zaharia et al. 2016) is an open-source big data framework originally developed at the University of California at Berkeley and later adopted by the Apache Foundation, which has maintained it ever since. Spark was designed to address some of the limitations of the...

This is a preview of subscription content, log in to check access.


  1. Alsheikh MA, Niyato D, Lin S, Tan H-P, Han Z (2016) Mobile big data analytics using deep learning and Apache Spark. IEEE Netw 30(3):22–29CrossRefGoogle Scholar
  2. Armbrust M, Xin RS, Lian C, Huai Y, Liu D, Bradley JK, Meng X, Kaftan T, Franklin MJ, Ghodsi A, Zaharia M (2015) Spark SQL: relational data processing in Spark. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data (SIGMOD’15). ACM, New York, pp 1383–1394Google Scholar
  3. Freeman J, Vladimirov N, Kawashima T, Mu Y, Sofroniew NJ, Bennett DV, Rosen J, Yang C-T, Looger LL, Ahrens MB (2014) Mapping brain activity at scale with cluster computing. Nat Methods 11(9):941–950CrossRefGoogle Scholar
  4. Gonzalez JE, Xin RS, Dave A, Crankshaw D, Franklin MJ, Stoica I (2014) Graphx: graph processing in a distributed dataflow framework. In: OSDI, vol 14, pp 599–613Google Scholar
  5. Ha K, Chen Z, Hu W, Richter W, Pillai P, Satyanarayanan M (2014) Towards wearable cognitive assistance. In: 12th annual international conference on mobile systems, applications, and services, MobiSys’14. ACM, New York, pp 68–81. http://dx.doi.org/10.1145/2594368.2594383 Google Scholar
  6. Hindman B, Konwinski A, Zaharia M, Ghodsi A, Joseph AD, Katz R, Shenker S, Stoica I (2011) Mesos: a platform for fine-grained resource sharing in the data center. In: NSDI, vol 11, pp 22–22Google Scholar
  7. Hu YC, Patel M, Sabella D, Sprecher N, Young V (2015) Mobile edge computing – a key technology towards 5G. ETSI White Paper 11(11):1–16Google Scholar
  8. Karau H, Konwinski A, Wendell P, Zaharia M (2015) Learning Spark: lightning-fast big data analysis. O’Reilly Media, Inc., BeijingGoogle Scholar
  9. Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai D, Amde M, Owen S et al (2016) Mllib: machine learning in Apache Spark. J Mach Learn Res 17(1):1235–1241MathSciNetzbMATHGoogle Scholar
  10. Ryza S, Laserson U, Owen S, Wills J (2017) Advanced analytics with Spark: patterns for learning from data at scale. O’Reilly Media, Inc., SebastopolGoogle Scholar
  11. Shah MA, Hellerstein JM, Chandrasekaran S, Franklin MJ (2003) Flux: an adaptive partitioning operator for continuous query systems. In: 19th international conference on data engineering (ICDE 2003). IEEE Computer Society, pp 25–36Google Scholar
  12. Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar M, Evans R, Graves T, Lowe J, Shah H, Seth S, Saha B, Curino C, O’Malley O, Radia S, Reed B, Baldeschwieler E (2013) Apache hadoop YARN: yet another resource negotiator. In: 4th annual symposium on cloud computing (SOCC’13). ACM, New York, pp 5:1–5:16. http://dx.doi.org/10.1145/2523616.2523633
  13. Wu Y, Tan KL (2015) ChronoStream: elastic stateful stream computation in the cloud. In: 2015 IEEE 31st international conference on data engineering, pp 723–734Google Scholar
  14. Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: 9th USENIX conference on networked systems design and implementation (NSDI’12). USENIX Association, Berkeley, pp 2–2Google Scholar
  15. Zaharia M, Das T, Li H, Hunter T, Shenker S, Stoica I (2013) Discretized streams: fault-tolerant streaming computation at scale. In: 24th ACM symposium on operating systems principles (SOSP’13). ACM, New York, pp 423–438CrossRefGoogle Scholar
  16. Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J, Venkataraman S, Franklin MJ, Ghodsi A, Gonzalez J, Shenker S, Stoica I (2016) Apache Spark: a unified engine for big data processing. Commun ACM 59(11):56–65CrossRefGoogle Scholar

Authors and Affiliations

  1. 1.Inria Avalon, LIP Laboratory, ENS LyonUniversity of LyonLyonFrance

Section editors and affiliations

  • Rodrigo N. Calheiros
    • 1
  • Marcos Dias de Assuncao
    • 2
  1. 1.School of Computing, Engineering and MathematicsWestern Sydney UniversityPenrithAustralia
  2. 2.Inria, LIP, ENS LyonLyonFrance