Encyclopedia of Big Data Technologies

Living Edition
| Editors: Sherif Sakr, Albert Zomaya

SparkBench

  • John Poelman
  • Emily May Curtin
Living reference work entry
DOI: https://doi.org/10.1007/978-3-319-63962-8_300-1

Synonyms

Overview

SparkBench is a flexible framework for benchmarking, simulating, comparing, and testing versions of Apache Spark and Spark applications. It provides users three levels of parallelism and a variety of built-in data generators and workloads that allow users to finely tune their setup and get the benchmarking results they need.

Definition

A framework for benchmarking Apache Spark.

Historical Background

Apache Spark began in 2010 as a research project by Matei Zaharia and others in the Berkeley AMPLab. Following the landmark success of Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing by Zaharia et al. (2012), Spark continued to gain popularity and usage as its performance gains over traditional MapReduce workflows became evident. Spark continued to grow as well, introducing Python and R APIs, machine learning, graph computation, SQL, and...

This is a preview of subscription content, log in to check access.

References

  1. AMPLab Big Data Benchmark. https://amplab.cs.berkeley.edu/benchmark. Accessed 23 Feb 2018
  2. Apache Airflow. http://airbnb.io/projects/airflow/. Accessed 23 Feb 2018
  3. Apache Spark. https://spark.apache.org/. Accessed 23 Feb 2018
  4. Apache Zeppelin. https://zeppelin.apache.org/. Accessed 23 Feb 2018
  5. Azkaban. https://azkaban.github.io/. Accessed 23 Feb 2018
  6. HOCON (Human-Optimized Config Object Notation). https://github.com/lightbend/config/blob/master/HOCON.md. Accessed 23 Feb 2018
  7. IBM Spark-Tacing. https://github.com/CODAI/spark-tracing. Accessed 23 Feb 2018
  8. Intel HiBench Suite. https://github.com/intel-hadoop/HiBench. Accessed 23 Feb 2018
  9. Li M et al (2015) SparkBench: a comprehensive benchmarking suite for in memory data analytic platform Spark. https://research.spec.org/fileadmin/user_upload/documents/wg_bd/BD-20150401-spark_benchmark-v1.3-spec.pdf. Accessed 23 Feb 2018
  10. Project Jupyter. http://jupyter.org/. Accessed 23 Feb 2018
  11. TPC Decision Support Benchmark. http://www.tpc.org/tpcds/default.asp. Accessed 23 Feb 2018
  12. YourKit Java Profiler. https://www.yourkit.com/java/profiler/features/. Accessed 23 Feb 2018
  13. Zaharia M et al (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. http://people.csail.mit.edu/matei/papers/2012/nsdi_spark.pdf. Accessed 23 Feb 2018

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.IBMNew YorkUSA

Section editors and affiliations

  • Meikel Poess
    • 1
  • Tilmann Rabl
    • 2
  1. 1.Server TechnologiesOracleRedwood ShoresUSA
  2. 2.Database Systems and Information Management GroupTechnische Universität BerlinBerlinGermany