Skip to main content
Log in

An experimental and comparative benchmark study examining resource utilization in managed Hadoop context

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Transitioning cloud-based Hadoop frameworks from IaaS to PaaS, which are commercially conceptualized as pay-as-you-go or pay-per-use, often reduces the associated system costs. However, the managed Hadoop systems obscure the inner performance dynamics of the platform and present a black-box behavior to the end-users. The aim of this study was to investigate the resource utilization of current managed Hadoop platforms. Thus, we explored three prominent Hadoop-on-PaaS proposals as they come out-of-the-box and conducted Hadoop-specific workloads using the HiBench Benchmark Suite. During the benchmark executions, the system resource utilization data from the worker nodes were collected and analyzed. The results indicated that the same property specifications among cloud services neither do guarantee similar performance outputs, nor produce consistent results based on different workloads within themselves. We anticipate that the managed systems’ architectures and pre-configurations play a crucial role in the performance outcomes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Data availibility

All data generated or analysed during this study are included in this published article: “Huang et al. [18]” The files and codes used in the evaluations along with detailed documentation for the experimental environment setup are made available to interested researchers in our GitHub repository entire results: https://github.com/emretto/benchmark-hadoop-on-paas.

Notes

  1. https://bit.ly/31IcS7F.

  2. https://bit.ly/3yoQFHS.

  3. https://bit.ly/3m2nntA.

References

  1. Apache Hadoop. https://hadoop.apache.org/. Accessed 22 May 2022

  2. Announcing Amazon Elastic Compute Cloud (Amazon EC2)—beta. https://aws.amazon.com/about-aws/whats-new/2006/08/24/announcing-amazon-elastic-compute-cloud-amazon-ec2---beta/. Accessed 22 May 2022

  3. TPC-History. http://tpc.org/information/about/history5.asp. Accessed 22 May 2022

  4. SPEC—Standard Performance Evaluation Corporation. https://www.spec.org/. Accessed 22 May 2022

  5. Han, R., John, L.K., Zhan, J.: Benchmarking Big Data systems: a review. IEEE Trans. Serv. Comput. 11, 580–597 (2018). https://doi.org/10.1109/TSC.2017.2730882

    Article  Google Scholar 

  6. Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google file system. In: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, pp. 29–43 (2003). https://doi.org/10.1145/1165389.945450

  7. White, T.: Hadoop: The Definitive Guide. O’Reilly, Beijing (2015)

    Google Scholar 

  8. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Presented at the OSDI 2004—6th Symposium on Operating Systems Design and Implementation (2004)

  9. Schätzle, T.H., Przyjaciel-Zablocki, M., Alexander: Giant Data: MapReduce and Hadoop, ADMIN Magazine. http://www.admin-magazine.com/HPC/Articles/MapReduce-and-Hadoop/. Accessed 30 Oct 2020

  10. Ramel, B.D.: 08/04/2021: what are Gartner’s “Cautions” about big 3 cloud providers? https://virtualizationreview.com/articles/2021/08/04/gartner-cloud-2021.aspx. Accessed 15 Apr 2021

  11. Azure HDInsight—Hadoop, Spark, & Kafka Service—Microsoft Azure. https://azure.microsoft.com/en-us/services/hdinsight/. Accessed 8 Jan 2021

  12. Announcing general availability of Azure HDInsight 3.6. https://azure.microsoft.com/en-us/blog/announcing-general-availability-of-azure-hdinsight-3-6/. Accessed 14 Jan 2021

  13. Dataproc. https://cloud.google.com/dataproc. Accessed 8 Jan 2021

  14. Compute Engine: Virtual Machines (VMs). https://cloud.google.com/compute. Accessed 8 Jan 2021

  15. What is E-MapReduce?—Product Introduction—Alibaba Cloud Documentation Center. https://www.alibabacloud.com/help/doc-detail/28068.htm?spm=a2c63.l28256.b99.4.65e270b2YXyKDV. Accessed 14 Jan 2021

  16. Elastic Compute Service (ECS): Elastic & Secure Cloud Servers—Alibaba Cloud. https://www.alibabacloud.com/product/ecs. Accessed 17 Jan 2021

  17. Alibaba Cloud Linux OS. https://alibaba.github.io/cloud-kernel/os.html. Accessed 14 Jan 2021

  18. Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010), pp. 41–51 (2010). https://doi.org/10.1109/ICDEW.2010.5452747

  19. GitHub—Intel-bigdata/HiBench. HiBench is a big data benchmark suite. https://github.com/Intel-bigdata/HiBench. Accessed 8 Jan 2021

  20. Guo, Q., Xie, Y., Li, Q., Zhu, Y.: XDataExplorer: a three-stage comprehensive self-tuning tool for Big Data platforms. Big Data Res. 29, 100329 (2022). https://doi.org/10.1016/j.bdr.2022.100329

    Article  Google Scholar 

  21. Sfaxi, L., Aissa, M.M.B.: Babel: a generic benchmarking platform for Big Data architectures. Big Data Res. 24, 100186 (2021)

    Article  Google Scholar 

  22. Prieto, P., Abad, P., Gregorio, J.A., Puente, V.: Fast, accurate processor evaluation through heterogeneous, sample-based benchmarking. IEEE Trans. Parallel Distrib. Syst. 32(12), 2983–2995 (2021)

    Article  Google Scholar 

  23. Ghazali, R., Adabi, S., Down, D.G., Movaghar, A.: A classification of Hadoop job schedulers based on performance optimization approaches. Clust. Comput. 24(4), 3381–3403 (2021)

    Article  Google Scholar 

  24. Ghafari, R., Kabutarkhani, F.H., Mansouri, N.: Task scheduling algorithms for energy optimization in cloud environment: a comprehensive review. Clust. Comput. 25, 1035–1093 (2022). https://doi.org/10.1007/s10586-021-03512-z

    Article  Google Scholar 

  25. Cheng, D., Wang, Y., Dai, D.: Dynamic resource provisioning for iterative workloads on Apache Spark. IEEE Trans. Cloud Comput. (2021). https://doi.org/10.1109/TCC.2021.3108043

    Article  Google Scholar 

  26. Li, C., Cai, Q., Luo, Y.: Dynamic data replacement and adaptive scheduling policies in spark. Clust. Comput. 25(2), 1421–1439 (2022). https://doi.org/10.1007/s10586-022-03541-2

    Article  Google Scholar 

  27. Costa, R.L.D.C., Moreira, J., Pintor, P., dos Santos, V., Lifschitz, S.: A survey on data-driven performance tuning for big data analytics platforms. Big Data Res. 25, 100206 (2021)

    Article  Google Scholar 

  28. Poggi, N., Montero, A., Carrera, D.: Characterizing BigBench queries, hive, and spark in multi-cloud environments. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 10661 LNCS, pp. 55–74 (2018). https://doi.org/10.1007/978-3-319-72401-0_5

  29. Wang, H., Shen, H., Reiss, C., Jain, A., Zhang, Y.: Improved intermediate data management for MapReduce frameworks. Presented at the Proceedings—2020 IEEE 34th International Parallel and Distributed Processing Symposium, IPDPS 2020 (2020). https://doi.org/10.1109/IPDPS47924.2020.00062

  30. Hwang, K., Bai, X., Shi, Y., Li, M., Chen, W.-G., Wu, Y.: Cloud performance modeling with benchmark evaluation of elastic scaling strategies. IEEE Trans. Parallel Distrib. Syst. 27, 130–143 (2016). https://doi.org/10.1109/TPDS.2015.2398438

    Article  Google Scholar 

  31. Ahn, H., Kim, H., You, W.: Performance study of spark on YARN cluster using HiBench. Presented at the 2018 IEEE International Conference on Consumer Electronics—Asia, ICCE-Asia 2018 (2018). https://doi.org/10.1109/ICCE-ASIA.2018.8552137

  32. Han, S., Choi, W., Muwafiq, R., Nah, Y.: Impact of memory size on bigdata processing based on Hadoop and Spark. Presented at the Proceedings of the 2017 Research in Adaptive and Convergent Systems, RACS 2017 (2017). https://doi.org/10.1145/3129676.3129688

  33. Samadi, Y., Zbakh, M., Tadonki, C.: Performance comparison between Hadoop and spark frameworks using HiBench benchmarks. Concurr. Comput. (2018). https://doi.org/10.1002/cpe.4367

    Article  Google Scholar 

  34. Ahmed, N., Barczak, A.L., Rashid, M.A., Susnjak, T.: A parallelization model for performance characterization of Spark Big Data jobs on Hadoop clusters. J. Big Data 8(1), 1–28 (2021)

    Article  Google Scholar 

  35. Shih, W.C., Yang, C.T., Ranjan, R., Chiang, C.I: Implementation and evaluation of a container management platform on Docker: Hadoop deployment as an example. Clust. Comput. 24(4), 3421–3430 (2021). https://doi.org/10.1007/s10586-021-03337-w

  36. GitHub Repository of the study. https://github.com/emretto/benchmark-hadoop-on-paas. Accessed 24 May 2022

  37. Jota juliojsb/sarviewer. https://github.com/juliojsb/sarviewer. Accessed 12 Dec 2020

Download references

Funding

The authors did not receive support from any organization for the submitted work.

Author information

Authors and Affiliations

Authors

Contributions

UEO: conceptualization, data collection, development of methodology, programming, writing-draft preparation, writing—review & editing. SA: conceptualization, development of methodology, supervision, writing—review & editing

Corresponding author

Correspondence to Serkan Ayvaz.

Ethics declarations

Conflict of interest

Author Serkan Ayvaz and Author Uluer Emre Ozdil declare that they have no conflict of interest.

Ethical approval

The authors consciously assure that this material is the authors’ own original work, which is not currently being considered for publication elsewhere. This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

This article does not contain any studies with human participants or animals performed by any of the authors. The consent is not a requirement for this study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Özdil, U.E., Ayvaz, S. An experimental and comparative benchmark study examining resource utilization in managed Hadoop context. Cluster Comput 26, 1891–1915 (2023). https://doi.org/10.1007/s10586-022-03728-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-022-03728-7

Keywords

Navigation