Skip to main content
Log in

SparkBench: a spark benchmarking suite characterizing large-scale in-memory data analytics

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Spark has been increasingly employed by industries for big data analytics recently, due to its resilience, scalability and efficient in-memory distributed programming model. Meanwhile, the rapid growing community is also actively incubating a rich ecosystem around Spark to tackle various big data challenges. The current benchmarks fall short in providing guidance of development, optimization, configuration and deployment of Spark. To this end, we introduce SparkBench, a Spark specific benchmarking suite. It selectively embraces a set of representative applications to identify various performance bottlenecks and reveals the resource consumption behaviors across execution phases. Overall, SparkBench covers four critical usage patterns of Spark, including machine learning, graph processing, stream computations and SQL query processing. We present comprehensive characterization of resource consumptions, data flows and timing information under different execution patterns and demonstrate that SparkBench can effectively guide the optimization of data analytic platforms to better suit for various workloads.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

References

  1. Agrawal, D., Butt, A., Kshitij, D., Larriba-Pey, J.-L., Li, M., Reiss, F.R., Raab, F., Schiefer, B., Xia, Y.: Sparkbench: a spark performance testing suite. In Proceedings of TPCTC (2015)

  2. Amazon Movie Review. http://snap.stanford.edu/data/web-Movies.html

  3. AMPLab Big Data Benchmark. https://amplab.cs.berkeley.edu/benchmark/

  4. Apache GridMix. http://hadoop.apache.org/docs/r1.2.1/gridmix.html

  5. Apache Spark. http://spark.apache.org/

  6. Armstrong, T.G., Ponnekanti, V., Borthakur, D., Callaghan, M.: Linkbench: a database benchmark based on the facebook social graph. In Proceedings of the 2013 ACM SIGMOD, pp. 1185–1196 (2013)

  7. Avery, C.: Giraph: large-scale graph processing infrastructure on hadoop. In: Proceedings of the Hadoop Summit, Santa Clara (2011)

  8. Batarfi, O., El Shawi, R., Fayoumi, A.G., Nouri, R., Barnawi, A., Sakr, S., et al.: Large scale graph processing systems: survey and an experimental evaluation. Clust. Comput. 18(3), 1189–1213 (2015)

    Article  Google Scholar 

  9. Chaimov, N., Malony, A., Canon, S., Iancu, C., Ibrahim, K.Z., Srinivasan, J.: Scaling spark on HPC systems. In: HPDC ’16, pp. 97–110. ACM, New York (2016)

  10. Cooper, B.F., Silberstein, A., Tam, E., Ramakrishnan, R., Sears, R.: Benchmarking cloud serving systems with YCSB. In: Proceedings of the 1st ACM SOCC, pp. 143–154 (2010)

  11. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  12. Ferdman, M., Adileh, A., Kocberber, O., Volos, S., Alisafaee, M., Jevdjic, D., Kaynak, C., Popescu, A.D., Ailamaki, A., Falsafi, B.: Clearing the clouds: a study of emerging scale-out workloads on modern hardware. In: Proceedings of the 17th ACM ASPLOS, pp. 37–48 (2012)

  13. Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., Jacobsen, H.-A.: Bigbench: towards an industry standard benchmark for big data analytics. In: Proc of ACM SIGMOD (2013)

  14. Google Web Graph. http://snap.stanford.edu/data/web-Google.html

  15. Hu, Y., Koren, Y., Volinsky, C.: Collaborative filtering for implicit feedback datasets. In: Proceedings of the 8th IEEE ICDM (2008)

  16. Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The hibench benchmark suite: characterization of the mapreduce-based data analysis. In 26th IEEE ICDEW, pp. 41–51 (2010)

  17. IBM. Big Data and Analytics Hub. http://www.ibmbigdatahub.com/infographic/four-vs-big-data

  18. IBM SoftLayer. http://www.softlayer.com/

  19. James, G., Witten, D., Hastie, T., Tibshirani, R.: An Introduction to Statistical Learning. Springer, New York (2013)

    Book  MATH  Google Scholar 

  20. Kolountzakis, M.N., Miller, G.L., Peng, R., Tsourakakis, C.E.: Efficient triangle counting in large graphs via degree-based vertex partitioning. Internet Math. 8(1–2), 161–185 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  21. Koren, Y.: Factorization meets the neighborhood: a multifaceted collaborative filtering model. In: Proceedings of ACM SIGKDD (2008)

  22. Kryo: a fast and efficient Object Graph Serialization Framework for Java. https://github.com/EsotericSoftware/kryo

  23. Li, M., Tan, J., Wang, Y., Zhang, L., Salapura, V.: Sparkbench: a comprehensive benchmarking suite for in memory data analytic platform spark. In: Proceedings of Workshop on Analytics Platforms for the Cloud (2015)

  24. Ming, Z., Luo, C., Gao, W., Han, R., Yang, Q., Wang, L., Zhan, J.: Bdgs: a scalable big data generator suite in big data benchmarking. In: Advancing Big Data Benchmarks, pp. 138–154. Springer, New York (2014)

  25. Nyberg, C., Shah, M., Govindaraju, N.: Sort Benchmark. http://sortbenchmark.org/

  26. Ousterhout, K., Rasti, R., Ratnasamy, S., Shenker, S., Chun, B.-G., VICSI: Making sense of performance in data analytics frameworks. In: Proceedings of USENIX NSDI (2015)

  27. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: bringing order to the web. Technical Report 1999-66, Stanford InfoLab (1999)

  28. Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of ACM SIGMOD (2009)

  29. Peng, J., Choo, K.-K.R., Ashman, H.: Bit-level n-gram based forensic authorship analysis on social media: identifying individuals from linguistic profiles. J. Netw. Comput. Appl. 70, 171–182 (2016)

    Article  Google Scholar 

  30. pigmix. Apache PigMix. https://cwiki.apache.org/confluence/display/PIG/PigMix

  31. Powered By Spark. https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark

  32. Quick, D., Choo, K.-K.R.: Big forensic data reduction: digital forensic images and electronic evidence. Clust. Comput. 19(2), 723–740 (2016)

  33. Shi, J., Qui, Y., Minhas, U.F., Jiao, L., Wang, C., Reinwald, B., Ozcan, F.: Clash of the titans: mapreduce vs. spark for large scale data analytics. In: Proceedings of the VLDB Endowment (2015)

  34. Spark Technology Center. https://github.com/SparkTC

  35. SparkBench: A Comprehensive Spark Benchmarking Suite, Anonymized for double blind review. https://goo.gl/woHxxK

  36. Spark-perf:Spark performance tests. https://github.com/databricks/spark-perf

  37. TPC-DS. http://www.tpc.org/tpcds/

  38. TPC-H. http://www.tpc.org/tpch/

  39. Twitter4j: a Java Library for the Twitter API. http://twitter4j.org

  40. Wang, L., Zhan, J., Luo, C., Zhu, Y., Yang, Q., He, Y., Gao, W., Jia, Z., Shi, Y., Zhang, S., Zheng, C., Lu, G., Zhan, K., Li, X., Qiu, B.: BigDataBench. http://prof.ict.ac.cn/BigDataBench/

  41. Wang, L., Zhan, J., Luo, C., Zhu, Y., Yang, Q., He, Y., Gao, W., Jia, Z., Shi, Y., Zhang, S., Zheng, C., Lu, G., Zhan, K., Li, X., Qiu, B.: Bigdatabench: a big data benchmark suite from internet services. In: IEEE 20th HPCA, pp. 488–499 (2014)

  42. Wikipedia Data Dumps. http://dumps.wikimedia.org/enwiki/

  43. WikiXMLJ. https://code.google.com/p/wikixmlj/

  44. Xiong, W., Yu, Z., Bei, Z., Zhao, J., Zhang, F., Zou, Y., Bai, X., Li, Y., Xu, C.: A characterization of big data benchmarks. In: IEEE International Conference on Big Data, pp. 118–125 (2013)

  45. Xu, Z., Luo, X., Liu, Y., Choo, K.K.R., Sugumaran, V., Yen, N., Mei, L., Hu, C.: From latency, through outbreak, to decline: detecting different states of emergency events using web resources. IEEE Trans. Big Data PP(99):1–1 (2016)

  46. Xu, Z., Xuan, J., Liu, Y., Choo, K.-K.R., Mei, L., Hu, C.: Building spatial temporal relation graph of concepts pair using web repository. In: Information Systems Frontiers, pp. 1–10 (2016)

  47. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX NSDI, Berkeley, CA (2012)

  48. Zhang, F., Liu, M., Gui, F., Shen, W., Shami, A., Ma, Y.: A distributed frequent itemset mining algorithm using spark for big data analytics. Clust. Comput. 18(4), 1493–1501 (2015)

    Article  Google Scholar 

  49. Zhu, J., Xu, C., Li, Z., Fung, G., Lin, X., Huang, J., Huang, C.: An examination of on-line machine learning approaches for pseudo-random generated data. Clust. Comput. 19(3), 1309–1321 (2016)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Min Li.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, M., Tan, J., Wang, Y. et al. SparkBench: a spark benchmarking suite characterizing large-scale in-memory data analytics. Cluster Comput 20, 2575–2589 (2017). https://doi.org/10.1007/s10586-016-0723-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-016-0723-1

Keywords

Navigation