Skip to main content

A Survey of Big Data, High Performance Computing, and Machine Learning Benchmarks

  • Conference paper
  • First Online:
Performance Evaluation and Benchmarking (TPCTC 2021)

Abstract

In recent years, there has been a convergence of Big Data (BD), High Performance Computing (HPC), and Machine Learning (ML) systems. This convergence is due to the increasing complexity of long data analysis pipelines on separated software stacks. With the increasing complexity of data analytics pipelines comes a need to evaluate their systems, in order to make informed decisions about technology selection, sizing and scoping of hardware. While there are many benchmarks for each of these domains, there is no convergence of these efforts. As a first step, it is also necessary to understand how the individual benchmark domains relate.

In this work, we analyze some of the most expressive and recent benchmarks of BD, HPC, and ML systems. We propose a taxonomy of those systems based on individual dimensions such as accuracy metrics and common dimensions such as workload type. Moreover, we aim at enabling the usage of our taxonomy in identifying adapted benchmarks for their BD, HPC, and ML systems. Finally, we identify challenges and research directions related to the future of converged BD, HPC, and ML system benchmarking.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 44.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 59.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://hadoop.apache.org/.

  2. 2.

    https://spark.apache.org/.

  3. 3.

    https://flink.apache.org/.

  4. 4.

    https://storm.apache.org/.

  5. 5.

    https://kafka.apache.org/.

  6. 6.

    https://www.benchcouncil.org/BigDataBench/.

  7. 7.

    https://www.benchcouncil.org/BigDataBench/files/BigDataBench5.0-User-Manual.pdf.

  8. 8.

    https://github.com/brianfrankcooper/YCSB.

  9. 9.

    https://graphalytics.org/.

  10. 10.

    https://www.ldbcouncil.org/ldbc_snb_docs/ldbc-snb-specification.pdf.

  11. 11.

    https://giraph.apache.org/.

  12. 12.

    https://www.nas.nasa.gov/software/npb.html.

  13. 13.

    https://www.spec.org/benchmarks.html.

  14. 14.

    https://icl.utk.edu/hpcc/.

  15. 15.

    https://repository.prace-ri.eu/git/UEABS/ueabs/.

  16. 16.

    https://asc.llnl.gov/coral-2-benchmarks.

  17. 17.

    https://www.hpcg-benchmark.org/index.html.

  18. 18.

    https://www.netlib.org/benchmark/hpl/.

  19. 19.

    https://icl.bitbucket.io/hpl-ai/.

  20. 20.

    https://github.com/baidu-research/DeepBench.

  21. 21.

    https://github.com/mlcommons/training.

  22. 22.

    https://github.com/rdadolf/fathom.

  23. 23.

    https://github.com/TalwalkarLab/leaf.

  24. 24.

    https://github.com/chu-data-lab/CleanML.

  25. 25.

    https://mlcommons.org/en/training-hpc-07/.

References

  1. Computer architecture is back - the Berkeley view on the parallel computing landscape. https://web.stanford.edu/class/ee380/Abstracts/070131-BerkeleyView1.7.pdf. Accessed 18 Aug 2021

  2. Coral procurement benchmarks. https://asc.llnl.gov/sites/asc/files/2020-06/CORALBenchmarksProcedure-v26.pdf. Accessed 30 June 2021

  3. High performance conjugate gradient benchmark (HPCG). https://github.com/hpcg-benchmark/hpcg/. Accessed 04 July 2021

  4. High performance conjugate gradient benchmark (HPCG). http://www.netlib.org/benchmark/hpl/. Accessed 04 July 2021

  5. HPCG benchmark. https://icl.bitbucket.io/hpl-ai/. Accessed 06 July 2021

  6. Parallel graph analytix (PGX). https://www.oracle.com/middleware/technologies/parallel-graph-analytix.html. Accessed 01 July 2021

  7. SPEC ACCEL: Read me first. https://www.spec.org/accel/docs/readme1st.html#Q13. Accessed 29 June 2021

  8. SPEC OMP 2012. https://www.spec.org/omp2012/. Accessed 07 July 2021

  9. SPECMPI. https://www.spec.org/mpi2007/. Accessed 07 July 2021

  10. Standard performance evaluation corporation, SPEC CPU (2017). https://www.spec.org/cpu2017/Docs/overview.html#suites. Accessed 29 June 2021

  11. Unified European applications benchmark suite. https://repository.prace-ri.eu/git/UEABS/ueabs. Accessed 29 June 2021

  12. Adolf, R., Rama, S., Reagen, B., Wei, G.Y., Brooks, D.: Fathom: reference workloads for modern deep learning methods. In: 2016 IEEE International Symposium on Workload Characterization (IISWC), pp. 1–10. IEEE (2016)

    Google Scholar 

  13. Asanovic, K., et al.: A view of the parallel computing landscape. Commun. ACM 52(10), 56–67 (2009)

    Article  Google Scholar 

  14. Bailey, D., et al.: The NAS parallel benchmarks. Technical report, RNR-94-007, NASA Ames Research Center, Moffett Field, CA, March 1994 (1994)

    Google Scholar 

  15. Bailey, D., Harris, T., Saphir, W., van der Wijngaart, R., Woo, A., Yarrow, M.: The NAS parallel benchmarks 2.0. Technical report, RNR-95-020, NASA Ames Research Center, Moffett Field, CA, March 1995 (1995)

    Google Scholar 

  16. Bajaber, F., Sakr, S., Batarfi, O., Altalhi, A., Barnawi, A.: Benchmarking big data systems: a survey. Comput. Commun. 149, 241–251 (2020). https://doi.org/10.1016/j.comcom.2019.10.002. https://www.sciencedirect.com/science/article/pii/S0140366419312344

  17. Barata, M., Bernardino, J., Furtado, P.: YCSB and TPC-H: big data and decision support benchmarks. In: 2014 IEEE International Congress on Big Data, pp. 800–801. IEEE (2014)

    Google Scholar 

  18. Baru, C., et al.: Discussion of BigBench: a proposed industry standard performance benchmark for big data. In: Nambiar, R., Poess, M. (eds.) TPCTC 2014. LNCS, vol. 8904, pp. 44–63. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-15350-6_4

    Chapter  Google Scholar 

  19. Bienia, C., Kumar, S., Singh, J.P., Li, K.: The PARSEC benchmark suite: characterization and architectural implications. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, pp. 72–81 (2008)

    Google Scholar 

  20. Bonawitz, K., et al.: Towards federated learning at scale: system design. arXiv preprint arXiv:1902.01046 (2019)

  21. Bonifati, A., Fletcher, G., Hidders, J., Iosup, A.: A survey of benchmarks for graph-processing systems. In: Fletcher, G., Hidders, J., Larriba-Pey, J. (eds.) Graph Data Management. DSA, pp. 163–186. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-96193-4_6

    Chapter  Google Scholar 

  22. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30(1), 107–117 (1998). https://doi.org/10.1016/S0169-7552(98)00110-X. https://www.sciencedirect.com/science/article/pii/S016975529800110X. Proceedings of the Seventh International World Wide Web Conference

  23. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30(1–7), 107–117 (1998)

    Article  Google Scholar 

  24. Caldas, S., et al.: Leaf: a benchmark for federated settings. arXiv preprint arXiv:1812.01097 (2018)

  25. Capotă, M., Hegeman, T., Iosup, A., Prat-Pérez, A., Erling, O., Boncz, P.: Graphalytics: a big data benchmark for graph-processing platforms. In: Proceedings of the GRADES 2015, pp. 1–6 (2015)

    Google Scholar 

  26. Cheng, P., Lu, Y., Du, Y., Chen, Z.: Experiences of converging big data analytics frameworks with high performance computing systems. In: Yokota, R., Wu, W. (eds.) SCFA 2018. LNCS, vol. 10776, pp. 90–106. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-69953-0_6

    Chapter  Google Scholar 

  27. Cooper, B.F., Silberstein, A., Tam, E., Ramakrishnan, R., Sears, R.: Benchmarking cloud serving systems with YCSB. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp. 143–154 (2010)

    Google Scholar 

  28. Czarnul, P., Proficz, J., Krzywaniak, A., Weglarz, J.: Energy-aware high-performance computing: survey of state-of-the-art tools, techniques, and environments. Sci. Program. 2019 (2019). https://doi.org/10.1155/2019/8348791

  29. Dongarra, J., Luszczek, P., Heroux, M.: HPCG technical specification. Sandia National Laboratories, Sandia Report SAND2013-8752 (2013)

    Google Scholar 

  30. Fox, G.C., Jha, S., Qiu, J., Ekanazake, S., Luckow, A.: Towards a comprehensive set of big data benchmarks. Big Data High Perform. Comput. 26, 47 (2015)

    Google Scholar 

  31. Fox, G.C., Jha, S., Qiu, J., Luckow, A.: Ogres: a systematic approach to big data benchmarks. Big Data Extreme-scale Comput. (BDEC) 29–30 (2015). Barcelona, Spain

    Google Scholar 

  32. Frumkin, M.A., Shabanov, L.: Arithmetic data cube as a data intensive benchmark. Technical report, NAS-03-005, NASA Ames Research Center, Moffett Field, CA, March 2003 (2003)

    Google Scholar 

  33. Fuller, A., Fan, Z., Day, C., Barlow, C.: Digital twin: enabling technologies, challenges and open research. IEEE Access 8, 108952–108971 (2020)

    Article  Google Scholar 

  34. Gao, W., et al.: BigDataBench: a scalable and unified big data and AI benchmark suite. arXiv preprint arXiv:1802.08254 (2018)

  35. Gao, W., et al.: BigDataBench: a big data benchmark suite from web search engines. arXiv preprint arXiv:1307.0320 (2013)

  36. Ghazal, A., et al.: BigBench: towards an industry standard benchmark for big data analytics. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, pp. 1197–1208. Association for Computing Machinery, New York (2013). https://doi.org/10.1145/2463676.2463712

  37. Guo, Y., Varbanescu, A.L., Iosup, A., Martella, C., Willke, T.L.: Benchmarking graph-processing platforms: a vision. In: Proceedings of the 5th ACM/SPEC International Conference on Performance Engineering, pp. 289–292 (2014)

    Google Scholar 

  38. Han, R., et al.: BigDataBench-MT: a benchmark tool for generating realistic mixed data center workloads. In: Zhan, J., Han, R., Zicari, R.V. (eds.) BPOE 2015. LNCS, vol. 9495, pp. 10–21. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-29006-5_2

    Chapter  Google Scholar 

  39. Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010), pp. 41–51. IEEE (2010)

    Google Scholar 

  40. Huang, S., Huang, J., Liu, Y., Yi, L., Dai, J.: HiBench: a representative and comprehensive Hadoop benchmark suite. In: Proceedings of the ICDE Workshops, pp. 41–51 (2010)

    Google Scholar 

  41. Intel: Hibench (2021). https://github.com/Intel-bigdata/HiBench

  42. Iosup, A., et al.: LDBC graphalytics: a benchmark for large-scale graph analysis on parallel and distributed platforms. Proc. VLDB Endow. 9(13), 1317–1328 (2016)

    Article  Google Scholar 

  43. Jack Dongarra, P.L.: HPC Challenge: Design, History, and Implementation Highlights, chap. 2. Chapman and Hall/CRC (2013)

    Google Scholar 

  44. Dongarra, J., Heroux, M., Luszczek, P.: BOF HPCG benchmark update and a look at the HPL-AI benchmark (2021)

    Google Scholar 

  45. Kambatla, K., Kollias, G., Kumar, V., Grama, A.: Trends in big data analytics. J. Parallel Distrib. Comput. 74(7), 2561–2573 (2014)

    Article  Google Scholar 

  46. Li, P., Rao, X., Blase, J., Zhang, Y., Chu, X., Zhang, C.: CleanML: a benchmark for joint data cleaning and machine learning [experiments and analysis]. arXiv preprint arXiv:1904.09483, p. 75 (2019)

  47. Luszczek, P., et al.: Introduction to the HPC challenge benchmark suite, December 2004

    Google Scholar 

  48. Dixit, K.M.: Overview of the SPEC benchmark. In: Gray, J. (ed.) The Benchmark Handbook, chap. 10, pp. 266–290. Morgan Kaufmann Publishers Inc. (1993)

    Google Scholar 

  49. Mattson, P., et al.: MLPerf training benchmark. arXiv preprint arXiv:1910.01500 (2019)

  50. Mattson, P., et al.: MLPerf: an industry standard benchmark suite for machine learning performance. IEEE Micro 40(2), 8–16 (2020)

    Article  Google Scholar 

  51. Ming, Z., et al.: BDGS: a scalable big data generator suite in big data benchmarking. In: Rabl, T., Jacobsen, H.-A., Raghunath, N., Poess, M., Bhandarkar, M., Baru, C. (eds.) WBDB 2013. LNCS, vol. 8585, pp. 138–154. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10596-3_11

    Chapter  Google Scholar 

  52. Müller, M., Whitney, B., Henschel, R., Kumaran, K.: SPEC Benchmarks, pp. 1886–1893. Springer, Boston (2011)

    Google Scholar 

  53. Narang, S.: Deepbench. https://svail.github.io/DeepBench/. Accessed 03 July 2021

  54. Narang, S., Diamos, G.: An update to deepbench with a focus on deep learning inference. https://svail.github.io/DeepBench-update/. Accessed 03 July 2021

  55. Ngai, W.L., Hegeman, T., Heldens, S., Iosup, A.: Granula: toward fine-grained performance analysis of large-scale graph processing platforms. In: Proceedings of the Fifth International Workshop on Graph Data-Management Experiences & Systems, pp. 1–6 (2017)

    Google Scholar 

  56. Poess, M., Nambiar, R.O., Walrath, D.: Why you should run TPC-DS: a workload analysis. In: Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB 2007, pp. 1138–1149. VLDB Endowment (2007)

    Google Scholar 

  57. Rabl, T., Frank, M., Sergieh, H.M., Kosch, H.: A data generator for cloud-scale benchmarking. In: Nambiar, R., Poess, M. (eds.) TPCTC 2010. LNCS, vol. 6417, pp. 41–56. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-18206-8_4

    Chapter  Google Scholar 

  58. Radulovic, M., Asifuzzaman, K., Carpenter, P., Radojković, P., Ayguadé, E.: HPC benchmarking: scaling right and looking beyond the average. In: Aldinucci, M., Padovani, L., Torquati, M. (eds.) Euro-Par 2018. LNCS, vol. 11014, pp. 135–146. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-96983-1_10

    Chapter  Google Scholar 

  59. Reed, D.A., Dongarra, J.: Exascale computing and big data. Commun. ACM 58(7), 56–68 (2015)

    Article  Google Scholar 

  60. von Rueden, L., Mayer, S., Sifa, R., Bauckhage, C., Garcke, J.: Combining machine learning and simulation to a hybrid modelling approach: current and future directions. In: Berthold, M.R., Feelders, A., Krempl, G. (eds.) IDA 2020. LNCS, vol. 12080, pp. 548–560. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-44584-3_43

    Chapter  Google Scholar 

  61. Tian, X., et al.: BigDataBench-S: an open-source scientific big data benchmark suite. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1068–1077. IEEE (2017)

    Google Scholar 

  62. Vazhkudai, S.S., et al.: The design, deployment, and evaluation of the coral pre-exascale systems. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 661–672 (2018)

    Google Scholar 

  63. Lioen, W., et al.: Evaluation of accelerated and non-accelerated benchmarks (2019)

    Google Scholar 

  64. Wang, L., et al.: BigDataBench: a big data benchmark suite from internet services. In: 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pp. 488–499. IEEE (2014)

    Google Scholar 

  65. van der Wijngaart, R., Jin, H.: NAS parallel benchmarks, multi-zone versions. Technical report, NAS-03-010, NASA Ames Research Center, Moffett Field, CA, March 2003 (2003)

    Google Scholar 

  66. Wong, P., van der Wijngaart, R.: NAS parallel benchmarks i/o version 2.4. Technical report, NAS-03-020, NASA Ames Research Center, Moffett Field, CA, March 2003 (2003)

    Google Scholar 

  67. Zhang, Q., et al.: A survey on deep learning benchmarks: do we still need new ones? In: Zheng, C., Zhan, J. (eds.) Bench 2018. LNCS, vol. 11459, pp. 36–49. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32813-9_5

    Chapter  Google Scholar 

Download references

Acknowledgement

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 957407 as DAPHNE. This work has also been supported through the German Research Foundation as FONDA.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ilin Tolovski .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ihde, N. et al. (2022). A Survey of Big Data, High Performance Computing, and Machine Learning Benchmarks. In: Nambiar, R., Poess, M. (eds) Performance Evaluation and Benchmarking. TPCTC 2021. Lecture Notes in Computer Science(), vol 13169. Springer, Cham. https://doi.org/10.1007/978-3-030-94437-7_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-94437-7_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-94436-0

  • Online ISBN: 978-3-030-94437-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics