SparkBench: a spark benchmarking suite characterizing large-scale in-memory data analytics

Li, Min; Tan, Jian; Wang, Yandong; Zhang, Li; Salapura, Valentina

doi:10.1007/s10586-016-0723-1

SparkBench: a spark benchmarking suite characterizing large-scale in-memory data analytics

Published: 31 January 2017

Volume 20, pages 2575–2589, (2017)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Min Li¹,
Jian Tan²,
Yandong Wang¹,
Li Zhang¹ &
…
Valentina Salapura¹

1718 Accesses
24 Citations
Explore all metrics

Abstract

Spark has been increasingly employed by industries for big data analytics recently, due to its resilience, scalability and efficient in-memory distributed programming model. Meanwhile, the rapid growing community is also actively incubating a rich ecosystem around Spark to tackle various big data challenges. The current benchmarks fall short in providing guidance of development, optimization, configuration and deployment of Spark. To this end, we introduce SparkBench, a Spark specific benchmarking suite. It selectively embraces a set of representative applications to identify various performance bottlenecks and reveals the resource consumption behaviors across execution phases. Overall, SparkBench covers four critical usage patterns of Spark, including machine learning, graph processing, stream computations and SQL query processing. We present comprehensive characterization of resource consumptions, data flows and timing information under different execution patterns and demonstrate that SparkBench can effectively guide the optimization of data analytic platforms to better suit for various workloads.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data analytics on Apache Spark

Article 13 October 2016

Big Data: An Introduction

Big data analytics in Cloud computing: an overview

Article Open access 06 August 2022

References

Agrawal, D., Butt, A., Kshitij, D., Larriba-Pey, J.-L., Li, M., Reiss, F.R., Raab, F., Schiefer, B., Xia, Y.: Sparkbench: a spark performance testing suite. In Proceedings of TPCTC (2015)
Amazon Movie Review. http://snap.stanford.edu/data/web-Movies.html
AMPLab Big Data Benchmark. https://amplab.cs.berkeley.edu/benchmark/
Apache GridMix. http://hadoop.apache.org/docs/r1.2.1/gridmix.html
Apache Spark. http://spark.apache.org/
Armstrong, T.G., Ponnekanti, V., Borthakur, D., Callaghan, M.: Linkbench: a database benchmark based on the facebook social graph. In Proceedings of the 2013 ACM SIGMOD, pp. 1185–1196 (2013)
Avery, C.: Giraph: large-scale graph processing infrastructure on hadoop. In: Proceedings of the Hadoop Summit, Santa Clara (2011)
Batarfi, O., El Shawi, R., Fayoumi, A.G., Nouri, R., Barnawi, A., Sakr, S., et al.: Large scale graph processing systems: survey and an experimental evaluation. Clust. Comput. 18(3), 1189–1213 (2015)
Article Google Scholar
Chaimov, N., Malony, A., Canon, S., Iancu, C., Ibrahim, K.Z., Srinivasan, J.: Scaling spark on HPC systems. In: HPDC ’16, pp. 97–110. ACM, New York (2016)
Cooper, B.F., Silberstein, A., Tam, E., Ramakrishnan, R., Sears, R.: Benchmarking cloud serving systems with YCSB. In: Proceedings of the 1st ACM SOCC, pp. 143–154 (2010)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Ferdman, M., Adileh, A., Kocberber, O., Volos, S., Alisafaee, M., Jevdjic, D., Kaynak, C., Popescu, A.D., Ailamaki, A., Falsafi, B.: Clearing the clouds: a study of emerging scale-out workloads on modern hardware. In: Proceedings of the 17th ACM ASPLOS, pp. 37–48 (2012)
Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., Jacobsen, H.-A.: Bigbench: towards an industry standard benchmark for big data analytics. In: Proc of ACM SIGMOD (2013)
Google Web Graph. http://snap.stanford.edu/data/web-Google.html
Hu, Y., Koren, Y., Volinsky, C.: Collaborative filtering for implicit feedback datasets. In: Proceedings of the 8th IEEE ICDM (2008)
Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The hibench benchmark suite: characterization of the mapreduce-based data analysis. In 26th IEEE ICDEW, pp. 41–51 (2010)
IBM. Big Data and Analytics Hub. http://www.ibmbigdatahub.com/infographic/four-vs-big-data
IBM SoftLayer. http://www.softlayer.com/
James, G., Witten, D., Hastie, T., Tibshirani, R.: An Introduction to Statistical Learning. Springer, New York (2013)
Book MATH Google Scholar
Kolountzakis, M.N., Miller, G.L., Peng, R., Tsourakakis, C.E.: Efficient triangle counting in large graphs via degree-based vertex partitioning. Internet Math. 8(1–2), 161–185 (2012)
Article MathSciNet MATH Google Scholar
Koren, Y.: Factorization meets the neighborhood: a multifaceted collaborative filtering model. In: Proceedings of ACM SIGKDD (2008)
Kryo: a fast and efficient Object Graph Serialization Framework for Java. https://github.com/EsotericSoftware/kryo
Li, M., Tan, J., Wang, Y., Zhang, L., Salapura, V.: Sparkbench: a comprehensive benchmarking suite for in memory data analytic platform spark. In: Proceedings of Workshop on Analytics Platforms for the Cloud (2015)
Ming, Z., Luo, C., Gao, W., Han, R., Yang, Q., Wang, L., Zhan, J.: Bdgs: a scalable big data generator suite in big data benchmarking. In: Advancing Big Data Benchmarks, pp. 138–154. Springer, New York (2014)
Nyberg, C., Shah, M., Govindaraju, N.: Sort Benchmark. http://sortbenchmark.org/
Ousterhout, K., Rasti, R., Ratnasamy, S., Shenker, S., Chun, B.-G., VICSI: Making sense of performance in data analytics frameworks. In: Proceedings of USENIX NSDI (2015)
Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: bringing order to the web. Technical Report 1999-66, Stanford InfoLab (1999)
Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of ACM SIGMOD (2009)
Peng, J., Choo, K.-K.R., Ashman, H.: Bit-level n-gram based forensic authorship analysis on social media: identifying individuals from linguistic profiles. J. Netw. Comput. Appl. 70, 171–182 (2016)
Article Google Scholar
pigmix. Apache PigMix. https://cwiki.apache.org/confluence/display/PIG/PigMix
Powered By Spark. https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
Quick, D., Choo, K.-K.R.: Big forensic data reduction: digital forensic images and electronic evidence. Clust. Comput. 19(2), 723–740 (2016)
Shi, J., Qui, Y., Minhas, U.F., Jiao, L., Wang, C., Reinwald, B., Ozcan, F.: Clash of the titans: mapreduce vs. spark for large scale data analytics. In: Proceedings of the VLDB Endowment (2015)
Spark Technology Center. https://github.com/SparkTC
SparkBench: A Comprehensive Spark Benchmarking Suite, Anonymized for double blind review. https://goo.gl/woHxxK
Spark-perf:Spark performance tests. https://github.com/databricks/spark-perf
TPC-DS. http://www.tpc.org/tpcds/
TPC-H. http://www.tpc.org/tpch/
Twitter4j: a Java Library for the Twitter API. http://twitter4j.org
Wang, L., Zhan, J., Luo, C., Zhu, Y., Yang, Q., He, Y., Gao, W., Jia, Z., Shi, Y., Zhang, S., Zheng, C., Lu, G., Zhan, K., Li, X., Qiu, B.: BigDataBench. http://prof.ict.ac.cn/BigDataBench/
Wang, L., Zhan, J., Luo, C., Zhu, Y., Yang, Q., He, Y., Gao, W., Jia, Z., Shi, Y., Zhang, S., Zheng, C., Lu, G., Zhan, K., Li, X., Qiu, B.: Bigdatabench: a big data benchmark suite from internet services. In: IEEE 20th HPCA, pp. 488–499 (2014)
Wikipedia Data Dumps. http://dumps.wikimedia.org/enwiki/
WikiXMLJ. https://code.google.com/p/wikixmlj/
Xiong, W., Yu, Z., Bei, Z., Zhao, J., Zhang, F., Zou, Y., Bai, X., Li, Y., Xu, C.: A characterization of big data benchmarks. In: IEEE International Conference on Big Data, pp. 118–125 (2013)
Xu, Z., Luo, X., Liu, Y., Choo, K.K.R., Sugumaran, V., Yen, N., Mei, L., Hu, C.: From latency, through outbreak, to decline: detecting different states of emergency events using web resources. IEEE Trans. Big Data PP(99):1–1 (2016)
Xu, Z., Xuan, J., Liu, Y., Choo, K.-K.R., Mei, L., Hu, C.: Building spatial temporal relation graph of concepts pair using web repository. In: Information Systems Frontiers, pp. 1–10 (2016)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX NSDI, Berkeley, CA (2012)
Zhang, F., Liu, M., Gui, F., Shen, W., Shami, A., Ma, Y.: A distributed frequent itemset mining algorithm using spark for big data analytics. Clust. Comput. 18(4), 1493–1501 (2015)
Article Google Scholar
Zhu, J., Xu, C., Li, Z., Fung, G., Lin, X., Huang, J., Huang, C.: An examination of on-line machine learning approaches for pseudo-random generated data. Clust. Comput. 19(3), 1309–1321 (2016)
Article Google Scholar

Download references

Author information

Authors and Affiliations

IBM Almaden Research Center, San Jose, USA
Min Li, Yandong Wang, Li Zhang & Valentina Salapura
Ohio State University, Columbus, USA
Jian Tan

Authors

Min Li
View author publications
You can also search for this author in PubMed Google Scholar
Jian Tan
View author publications
You can also search for this author in PubMed Google Scholar
Yandong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Li Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Valentina Salapura
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Min Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, M., Tan, J., Wang, Y. et al. SparkBench: a spark benchmarking suite characterizing large-scale in-memory data analytics. Cluster Comput 20, 2575–2589 (2017). https://doi.org/10.1007/s10586-016-0723-1

Download citation

Received: 07 August 2016
Revised: 09 October 2016
Accepted: 28 December 2016
Published: 31 January 2017
Issue Date: September 2017
DOI: https://doi.org/10.1007/s10586-016-0723-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SparkBench: a spark benchmarking suite characterizing large-scale in-memory data analytics

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

Big Data: An Introduction

Big data analytics in Cloud computing: an overview

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

SparkBench: a spark benchmarking suite characterizing large-scale in-memory data analytics

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

Big Data: An Introduction

Big data analytics in Cloud computing: an overview

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation