Skip to main content

Abstract

Spark has emerged as an easy to use, scalable, robust and fast system for analytics with a rapidly growing and vibrant community of users and contributors. It is multipurpose—with extensive and modular infrastructure for machine learning, graph processing, SQL, streaming, statistical processing, and more. Its rapid adoption therefore calls for a performance assessment suite that supports agile development, measurement, validation, optimization, configuration, and deployment decisions across a broad range of platform environments and test cases.

Recognizing the need for such comprehensive and agile testing, this paper proposes going beyond existing performance tests for Spark and creating an expanded Spark performance testing suite. This proposal describes several desirable properties flowing from the larger scale, greater and evolving variety, and nuanced requirements of different applications of Spark. The paper identifies the major areas of performance characterization, and the key methodological aspects that should be factored into the design of the proposed suite. The objective is to capture insights from industry and academia on how to best characterize capabilities of Spark-based analytic platforms and provide cost-effective assessment of optimization opportunities in a timely manner.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Current Spark Streaming is not recommended for sub-second response time, however, we discuss this here in the anticipation of future improvements.

References

  1. DataBricks. https://databricks.com/

  2. Mahout. http://mahout.apache.org/

  3. Huppler, K.: The art of building a good benchmark. In: Nambiar, R., Poess, M. (eds.) TPCTC 2009. LNCS, vol. 5895, pp. 18–30. Springer, Heidelberg (2009)

    Google Scholar 

  4. Boncz, P., Neumann, T., Erling, O.: TPC-H analyzed: hidden messages and lessons learned from an influential benchmark. In: Nambiar, R., Poess, M. (eds.) TPCTC 2013. LNCS, vol. 8391, pp. 61–76. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

  5. Jacob, B., Mudge, T.N.: Notes on calculating computer performance. University of Michigan, Computer Science and Engineering Division, Department of Electrical Engineering and Computer Science (1995)

    Google Scholar 

  6. Transaction Processing Performance Council. http://www.tpc.org/

  7. Standard Performance Evaluation Corporation. https://www.spec.org/

  8. Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., Jacobsen, H.-A.: BigBench: towards an industry standard benchmark for big data analytics. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, pp. 1197–1208. ACM, New York, NY, USA (2013)

    Google Scholar 

  9. Alexandrov, A., Tzoumas, K., Markl, V.: Myriad: scalable and expressive data generation. Proc. VLDB Endow. 5(12), 1890–1893 (2012)

    Article  Google Scholar 

  10. Rabl, T., Frank, M., Sergieh, H.M., Kosch, H.: A data generator for cloud-scale benchmarking. In: Nambiar, R., Poess, M. (eds.) TPCTC 2010. LNCS, vol. 6417, pp. 41–56. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  11. Linked Data Benchmark Council Social Network Benchmark (LDBC-SNB) Generator. https://github.com/ldbc/ldbc_snb_datagen

  12. Graph500 generator. http://www.graph500.org/specifications

  13. DOTS: Database Opensource Test Suite. http://ltp.sourceforge.net/documentation/how-to/dots.php

  14. SAP. http://www.sap.com

  15. Infor LN Baan. www.infor.com/product_summary/erp/ln/

  16. Spark-perf. https://github.com/databricks/spark-perf

  17. Sort Benchmark. http://sortbenchmark.org/

  18. Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In 26th IEEE ICDEW, pp. 41–51, March 2010

    Google Scholar 

  19. Performance portal for Apache Spark. http://01org.github.io/sparkscore/plaf1.html

  20. Wang, L., Zhan, J., Luo, C., Zhu, Y., Yang, Q., He, Y., Gao, W., Jia, Z., Shi, Y., Zhang, S., Zheng, C., Lu, G., Zhan, K., Li, X., Qiu, B.: Bigdatabench: a big data benchmark suite from internet services. In: IEEE 20th HPCA, pp. 488–499, February 2014

    Google Scholar 

  21. AMPLab Big Data Benchmark. https://amplab.cs.berkeley.edu/benchmark/

  22. Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., Jacobsen, H.-A.: Bigbench: towards an industry standard benchmark for big data analytics. In: Proceedings of the 2013 ACM SIGMOD, pp. 1197–1208 (2013)

    Google Scholar 

  23. Cooper, B.F., Silberstein, A., Tam, E., Ramakrishnan, R., Sears, R.: Benchmarking cloud serving systems with YCSB. In: Proceedings of the 1st ACM SOCC, pp. 143–154 (2010)

    Google Scholar 

  24. Li, M., Tan, J., Wang, Y., Zhang, L., Salapura, V.: SparkBench: a comprehensive benchmarking suite for in memory data analytic platform Spark. In: Proceedings of the 12th ACM International Conference on Computing Frontiers, CF 2015, Article 53, ACM, New York, NY, USA (2015)

    Google Scholar 

  25. Erling, O., Averbuch, A., Larriba-Pey, J.L., Chafi, H., Gubichev, A., Prat, A., Pham, M.-D., Boncz, P.: The LDBC social network benchmark: interactive workload. In: Proceedings of SIGMOD 2015, Melbourne (2015)

    Google Scholar 

  26. Capotă, M., Hegeman, T., Iosup, A., Prat, A., Erling, O., Boncz, P.: Graphalytics: a big data benchmark for graph-processing platforms. In: Proceedings of GRADES2015, co-located with ACM SIGMOD/PODS (2015)

    Google Scholar 

  27. Angles, R., Boncz, P.A., Larriba-Pey, J.-L., Fundulaki, I., Neumann, T., Erling, O., Neubauer, P., Martínez-Bazan, N., Kotsev, V., Toma, I.: The linked data benchmark council: a graph and RDF industry benchmarking effort. SIGMOD Record 43(1), 27–31 (2014)

    Article  Google Scholar 

  28. PigMix. https://cwiki.apache.org/confluence/display/PIG/PigMix

  29. Kim, K., Jeon, K., Han, H., Kim, S.,x Jung, S., Yeom, H.Y.: MRBench: a benchmark for MapReduce framework. In: IEEE ICPADS (2008)

    Google Scholar 

Download references

Acknowledgements

The authors would like to acknowledge all those who contributed with suggestions, ideas and provided valuable feedback during earlier drafts of this document. In particular we would like to thank Alan Bivens, Michael Hind, David Grove, Steve Rees, Shankar Venkataraman, Randy Swanberg, Ching-Yung Lin, and John Poelman.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Min Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Agrawal, D. et al. (2016). SparkBench – A Spark Performance Testing Suite. In: Nambiar, R., Poess, M. (eds) Performance Evaluation and Benchmarking: Traditional to Big Data to Internet of Things. TPCTC 2015. Lecture Notes in Computer Science(), vol 9508. Springer, Cham. https://doi.org/10.1007/978-3-319-31409-9_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-31409-9_3

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-31408-2

  • Online ISBN: 978-3-319-31409-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics