An experimental and comparative benchmark study examining resource utilization in managed Hadoop context

Özdil, Uluer Emre; Ayvaz, Serkan

doi:10.1007/s10586-022-03728-7

An experimental and comparative benchmark study examining resource utilization in managed Hadoop context

Published: 05 September 2022

Volume 26, pages 1891–1915, (2023)
Cite this article

Cluster Computing Aims and scope Submit manuscript

231 Accesses
1 Citation
2 Altmetric
Explore all metrics

Abstract

Transitioning cloud-based Hadoop frameworks from IaaS to PaaS, which are commercially conceptualized as pay-as-you-go or pay-per-use, often reduces the associated system costs. However, the managed Hadoop systems obscure the inner performance dynamics of the platform and present a black-box behavior to the end-users. The aim of this study was to investigate the resource utilization of current managed Hadoop platforms. Thus, we explored three prominent Hadoop-on-PaaS proposals as they come out-of-the-box and conducted Hadoop-specific workloads using the HiBench Benchmark Suite. During the benchmark executions, the system resource utilization data from the worker nodes were collected and analyzed. The results indicated that the same property specifications among cloud services neither do guarantee similar performance outputs, nor produce consistent results based on different workloads within themselves. We anticipate that the managed systems’ architectures and pre-configurations play a crucial role in the performance outcomes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

TPCx-HS on the Cloud!

Multi-tenancy Performance Benchmark for Web Application Platforms

Experimenting with Application-Based Benchmarks on Different Cloud Providers via a Multi-cloud Execution and Modeling Framework

Data availibility

All data generated or analysed during this study are included in this published article: “Huang et al. [18]” The files and codes used in the evaluations along with detailed documentation for the experimental environment setup are made available to interested researchers in our GitHub repository entire results: https://github.com/emretto/benchmark-hadoop-on-paas.

Notes

References

Apache Hadoop. https://hadoop.apache.org/. Accessed 22 May 2022
Announcing Amazon Elastic Compute Cloud (Amazon EC2)—beta. https://aws.amazon.com/about-aws/whats-new/2006/08/24/announcing-amazon-elastic-compute-cloud-amazon-ec2---beta/. Accessed 22 May 2022
TPC-History. http://tpc.org/information/about/history5.asp. Accessed 22 May 2022
SPEC—Standard Performance Evaluation Corporation. https://www.spec.org/. Accessed 22 May 2022
Han, R., John, L.K., Zhan, J.: Benchmarking Big Data systems: a review. IEEE Trans. Serv. Comput. 11, 580–597 (2018). https://doi.org/10.1109/TSC.2017.2730882
Article Google Scholar
Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google file system. In: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, pp. 29–43 (2003). https://doi.org/10.1145/1165389.945450
White, T.: Hadoop: The Definitive Guide. O’Reilly, Beijing (2015)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Presented at the OSDI 2004—6th Symposium on Operating Systems Design and Implementation (2004)
Schätzle, T.H., Przyjaciel-Zablocki, M., Alexander: Giant Data: MapReduce and Hadoop, ADMIN Magazine. http://www.admin-magazine.com/HPC/Articles/MapReduce-and-Hadoop/. Accessed 30 Oct 2020
Ramel, B.D.: 08/04/2021: what are Gartner’s “Cautions” about big 3 cloud providers? https://virtualizationreview.com/articles/2021/08/04/gartner-cloud-2021.aspx. Accessed 15 Apr 2021
Azure HDInsight—Hadoop, Spark, & Kafka Service—Microsoft Azure. https://azure.microsoft.com/en-us/services/hdinsight/. Accessed 8 Jan 2021
Announcing general availability of Azure HDInsight 3.6. https://azure.microsoft.com/en-us/blog/announcing-general-availability-of-azure-hdinsight-3-6/. Accessed 14 Jan 2021
Dataproc. https://cloud.google.com/dataproc. Accessed 8 Jan 2021
Compute Engine: Virtual Machines (VMs). https://cloud.google.com/compute. Accessed 8 Jan 2021
What is E-MapReduce?—Product Introduction—Alibaba Cloud Documentation Center. https://www.alibabacloud.com/help/doc-detail/28068.htm?spm=a2c63.l28256.b99.4.65e270b2YXyKDV. Accessed 14 Jan 2021
Elastic Compute Service (ECS): Elastic & Secure Cloud Servers—Alibaba Cloud. https://www.alibabacloud.com/product/ecs. Accessed 17 Jan 2021
Alibaba Cloud Linux OS. https://alibaba.github.io/cloud-kernel/os.html. Accessed 14 Jan 2021
Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010), pp. 41–51 (2010). https://doi.org/10.1109/ICDEW.2010.5452747
GitHub—Intel-bigdata/HiBench. HiBench is a big data benchmark suite. https://github.com/Intel-bigdata/HiBench. Accessed 8 Jan 2021
Guo, Q., Xie, Y., Li, Q., Zhu, Y.: XDataExplorer: a three-stage comprehensive self-tuning tool for Big Data platforms. Big Data Res. 29, 100329 (2022). https://doi.org/10.1016/j.bdr.2022.100329
Article Google Scholar
Sfaxi, L., Aissa, M.M.B.: Babel: a generic benchmarking platform for Big Data architectures. Big Data Res. 24, 100186 (2021)
Article Google Scholar
Prieto, P., Abad, P., Gregorio, J.A., Puente, V.: Fast, accurate processor evaluation through heterogeneous, sample-based benchmarking. IEEE Trans. Parallel Distrib. Syst. 32(12), 2983–2995 (2021)
Article Google Scholar
Ghazali, R., Adabi, S., Down, D.G., Movaghar, A.: A classification of Hadoop job schedulers based on performance optimization approaches. Clust. Comput. 24(4), 3381–3403 (2021)
Article Google Scholar
Ghafari, R., Kabutarkhani, F.H., Mansouri, N.: Task scheduling algorithms for energy optimization in cloud environment: a comprehensive review. Clust. Comput. 25, 1035–1093 (2022). https://doi.org/10.1007/s10586-021-03512-z
Article Google Scholar
Cheng, D., Wang, Y., Dai, D.: Dynamic resource provisioning for iterative workloads on Apache Spark. IEEE Trans. Cloud Comput. (2021). https://doi.org/10.1109/TCC.2021.3108043
Article Google Scholar
Li, C., Cai, Q., Luo, Y.: Dynamic data replacement and adaptive scheduling policies in spark. Clust. Comput. 25(2), 1421–1439 (2022). https://doi.org/10.1007/s10586-022-03541-2
Article Google Scholar
Costa, R.L.D.C., Moreira, J., Pintor, P., dos Santos, V., Lifschitz, S.: A survey on data-driven performance tuning for big data analytics platforms. Big Data Res. 25, 100206 (2021)
Article Google Scholar
Poggi, N., Montero, A., Carrera, D.: Characterizing BigBench queries, hive, and spark in multi-cloud environments. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 10661 LNCS, pp. 55–74 (2018). https://doi.org/10.1007/978-3-319-72401-0_5
Wang, H., Shen, H., Reiss, C., Jain, A., Zhang, Y.: Improved intermediate data management for MapReduce frameworks. Presented at the Proceedings—2020 IEEE 34th International Parallel and Distributed Processing Symposium, IPDPS 2020 (2020). https://doi.org/10.1109/IPDPS47924.2020.00062
Hwang, K., Bai, X., Shi, Y., Li, M., Chen, W.-G., Wu, Y.: Cloud performance modeling with benchmark evaluation of elastic scaling strategies. IEEE Trans. Parallel Distrib. Syst. 27, 130–143 (2016). https://doi.org/10.1109/TPDS.2015.2398438
Article Google Scholar
Ahn, H., Kim, H., You, W.: Performance study of spark on YARN cluster using HiBench. Presented at the 2018 IEEE International Conference on Consumer Electronics—Asia, ICCE-Asia 2018 (2018). https://doi.org/10.1109/ICCE-ASIA.2018.8552137
Han, S., Choi, W., Muwafiq, R., Nah, Y.: Impact of memory size on bigdata processing based on Hadoop and Spark. Presented at the Proceedings of the 2017 Research in Adaptive and Convergent Systems, RACS 2017 (2017). https://doi.org/10.1145/3129676.3129688
Samadi, Y., Zbakh, M., Tadonki, C.: Performance comparison between Hadoop and spark frameworks using HiBench benchmarks. Concurr. Comput. (2018). https://doi.org/10.1002/cpe.4367
Article Google Scholar
Ahmed, N., Barczak, A.L., Rashid, M.A., Susnjak, T.: A parallelization model for performance characterization of Spark Big Data jobs on Hadoop clusters. J. Big Data 8(1), 1–28 (2021)
Article Google Scholar
Shih, W.C., Yang, C.T., Ranjan, R., Chiang, C.I: Implementation and evaluation of a container management platform on Docker: Hadoop deployment as an example. Clust. Comput. 24(4), 3421–3430 (2021). https://doi.org/10.1007/s10586-021-03337-w
GitHub Repository of the study. https://github.com/emretto/benchmark-hadoop-on-paas. Accessed 24 May 2022
Jota juliojsb/sarviewer. https://github.com/juliojsb/sarviewer. Accessed 12 Dec 2020

Download references

Funding

The authors did not receive support from any organization for the submitted work.

Author information

Authors and Affiliations

Big Data Analytics and Management, Bahcesehir University, Istanbul, Turkey
Uluer Emre Özdil
Capgemini, Insights & Data, Cologne, Germany
Uluer Emre Özdil
Department of Computer Engineering, Yildiz Technical University, Istanbul, Turkey
Serkan Ayvaz

Authors

Uluer Emre Özdil
View author publications
You can also search for this author in PubMed Google Scholar
Serkan Ayvaz
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

UEO: conceptualization, data collection, development of methodology, programming, writing-draft preparation, writing—review & editing. SA: conceptualization, development of methodology, supervision, writing—review & editing

Corresponding author

Correspondence to Serkan Ayvaz.

Ethics declarations

Conflict of interest

Author Serkan Ayvaz and Author Uluer Emre Ozdil declare that they have no conflict of interest.

Ethical approval

The authors consciously assure that this material is the authors’ own original work, which is not currently being considered for publication elsewhere. This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

This article does not contain any studies with human participants or animals performed by any of the authors. The consent is not a requirement for this study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Özdil, U.E., Ayvaz, S. An experimental and comparative benchmark study examining resource utilization in managed Hadoop context. Cluster Comput 26, 1891–1915 (2023). https://doi.org/10.1007/s10586-022-03728-7

Download citation

Received: 23 December 2021
Revised: 15 August 2022
Accepted: 25 August 2022
Published: 05 September 2022
Issue Date: June 2023
DOI: https://doi.org/10.1007/s10586-022-03728-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An experimental and comparative benchmark study examining resource utilization in managed Hadoop context

Abstract

Access this article

Similar content being viewed by others

TPCx-HS on the Cloud!

Multi-tenancy Performance Benchmark for Web Application Platforms

Experimenting with Application-Based Benchmarks on Different Cloud Providers via a Multi-cloud Execution and Modeling Framework

Data availibility

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An experimental and comparative benchmark study examining resource utilization in managed Hadoop context

Abstract

Access this article

Similar content being viewed by others

TPCx-HS on the Cloud!

Multi-tenancy Performance Benchmark for Web Application Platforms

Experimenting with Application-Based Benchmarks on Different Cloud Providers via a Multi-cloud Execution and Modeling Framework

Data availibility

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation