Abstract
Transitioning cloud-based Hadoop frameworks from IaaS to PaaS, which are commercially conceptualized as pay-as-you-go or pay-per-use, often reduces the associated system costs. However, the managed Hadoop systems obscure the inner performance dynamics of the platform and present a black-box behavior to the end-users. The aim of this study was to investigate the resource utilization of current managed Hadoop platforms. Thus, we explored three prominent Hadoop-on-PaaS proposals as they come out-of-the-box and conducted Hadoop-specific workloads using the HiBench Benchmark Suite. During the benchmark executions, the system resource utilization data from the worker nodes were collected and analyzed. The results indicated that the same property specifications among cloud services neither do guarantee similar performance outputs, nor produce consistent results based on different workloads within themselves. We anticipate that the managed systems’ architectures and pre-configurations play a crucial role in the performance outcomes.
Similar content being viewed by others
Data availibility
All data generated or analysed during this study are included in this published article: “Huang et al. [18]” The files and codes used in the evaluations along with detailed documentation for the experimental environment setup are made available to interested researchers in our GitHub repository entire results: https://github.com/emretto/benchmark-hadoop-on-paas.
References
Apache Hadoop. https://hadoop.apache.org/. Accessed 22 May 2022
Announcing Amazon Elastic Compute Cloud (Amazon EC2)—beta. https://aws.amazon.com/about-aws/whats-new/2006/08/24/announcing-amazon-elastic-compute-cloud-amazon-ec2---beta/. Accessed 22 May 2022
TPC-History. http://tpc.org/information/about/history5.asp. Accessed 22 May 2022
SPEC—Standard Performance Evaluation Corporation. https://www.spec.org/. Accessed 22 May 2022
Han, R., John, L.K., Zhan, J.: Benchmarking Big Data systems: a review. IEEE Trans. Serv. Comput. 11, 580–597 (2018). https://doi.org/10.1109/TSC.2017.2730882
Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google file system. In: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, pp. 29–43 (2003). https://doi.org/10.1145/1165389.945450
White, T.: Hadoop: The Definitive Guide. O’Reilly, Beijing (2015)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Presented at the OSDI 2004—6th Symposium on Operating Systems Design and Implementation (2004)
Schätzle, T.H., Przyjaciel-Zablocki, M., Alexander: Giant Data: MapReduce and Hadoop, ADMIN Magazine. http://www.admin-magazine.com/HPC/Articles/MapReduce-and-Hadoop/. Accessed 30 Oct 2020
Ramel, B.D.: 08/04/2021: what are Gartner’s “Cautions” about big 3 cloud providers? https://virtualizationreview.com/articles/2021/08/04/gartner-cloud-2021.aspx. Accessed 15 Apr 2021
Azure HDInsight—Hadoop, Spark, & Kafka Service—Microsoft Azure. https://azure.microsoft.com/en-us/services/hdinsight/. Accessed 8 Jan 2021
Announcing general availability of Azure HDInsight 3.6. https://azure.microsoft.com/en-us/blog/announcing-general-availability-of-azure-hdinsight-3-6/. Accessed 14 Jan 2021
Dataproc. https://cloud.google.com/dataproc. Accessed 8 Jan 2021
Compute Engine: Virtual Machines (VMs). https://cloud.google.com/compute. Accessed 8 Jan 2021
What is E-MapReduce?—Product Introduction—Alibaba Cloud Documentation Center. https://www.alibabacloud.com/help/doc-detail/28068.htm?spm=a2c63.l28256.b99.4.65e270b2YXyKDV. Accessed 14 Jan 2021
Elastic Compute Service (ECS): Elastic & Secure Cloud Servers—Alibaba Cloud. https://www.alibabacloud.com/product/ecs. Accessed 17 Jan 2021
Alibaba Cloud Linux OS. https://alibaba.github.io/cloud-kernel/os.html. Accessed 14 Jan 2021
Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010), pp. 41–51 (2010). https://doi.org/10.1109/ICDEW.2010.5452747
GitHub—Intel-bigdata/HiBench. HiBench is a big data benchmark suite. https://github.com/Intel-bigdata/HiBench. Accessed 8 Jan 2021
Guo, Q., Xie, Y., Li, Q., Zhu, Y.: XDataExplorer: a three-stage comprehensive self-tuning tool for Big Data platforms. Big Data Res. 29, 100329 (2022). https://doi.org/10.1016/j.bdr.2022.100329
Sfaxi, L., Aissa, M.M.B.: Babel: a generic benchmarking platform for Big Data architectures. Big Data Res. 24, 100186 (2021)
Prieto, P., Abad, P., Gregorio, J.A., Puente, V.: Fast, accurate processor evaluation through heterogeneous, sample-based benchmarking. IEEE Trans. Parallel Distrib. Syst. 32(12), 2983–2995 (2021)
Ghazali, R., Adabi, S., Down, D.G., Movaghar, A.: A classification of Hadoop job schedulers based on performance optimization approaches. Clust. Comput. 24(4), 3381–3403 (2021)
Ghafari, R., Kabutarkhani, F.H., Mansouri, N.: Task scheduling algorithms for energy optimization in cloud environment: a comprehensive review. Clust. Comput. 25, 1035–1093 (2022). https://doi.org/10.1007/s10586-021-03512-z
Cheng, D., Wang, Y., Dai, D.: Dynamic resource provisioning for iterative workloads on Apache Spark. IEEE Trans. Cloud Comput. (2021). https://doi.org/10.1109/TCC.2021.3108043
Li, C., Cai, Q., Luo, Y.: Dynamic data replacement and adaptive scheduling policies in spark. Clust. Comput. 25(2), 1421–1439 (2022). https://doi.org/10.1007/s10586-022-03541-2
Costa, R.L.D.C., Moreira, J., Pintor, P., dos Santos, V., Lifschitz, S.: A survey on data-driven performance tuning for big data analytics platforms. Big Data Res. 25, 100206 (2021)
Poggi, N., Montero, A., Carrera, D.: Characterizing BigBench queries, hive, and spark in multi-cloud environments. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 10661 LNCS, pp. 55–74 (2018). https://doi.org/10.1007/978-3-319-72401-0_5
Wang, H., Shen, H., Reiss, C., Jain, A., Zhang, Y.: Improved intermediate data management for MapReduce frameworks. Presented at the Proceedings—2020 IEEE 34th International Parallel and Distributed Processing Symposium, IPDPS 2020 (2020). https://doi.org/10.1109/IPDPS47924.2020.00062
Hwang, K., Bai, X., Shi, Y., Li, M., Chen, W.-G., Wu, Y.: Cloud performance modeling with benchmark evaluation of elastic scaling strategies. IEEE Trans. Parallel Distrib. Syst. 27, 130–143 (2016). https://doi.org/10.1109/TPDS.2015.2398438
Ahn, H., Kim, H., You, W.: Performance study of spark on YARN cluster using HiBench. Presented at the 2018 IEEE International Conference on Consumer Electronics—Asia, ICCE-Asia 2018 (2018). https://doi.org/10.1109/ICCE-ASIA.2018.8552137
Han, S., Choi, W., Muwafiq, R., Nah, Y.: Impact of memory size on bigdata processing based on Hadoop and Spark. Presented at the Proceedings of the 2017 Research in Adaptive and Convergent Systems, RACS 2017 (2017). https://doi.org/10.1145/3129676.3129688
Samadi, Y., Zbakh, M., Tadonki, C.: Performance comparison between Hadoop and spark frameworks using HiBench benchmarks. Concurr. Comput. (2018). https://doi.org/10.1002/cpe.4367
Ahmed, N., Barczak, A.L., Rashid, M.A., Susnjak, T.: A parallelization model for performance characterization of Spark Big Data jobs on Hadoop clusters. J. Big Data 8(1), 1–28 (2021)
Shih, W.C., Yang, C.T., Ranjan, R., Chiang, C.I: Implementation and evaluation of a container management platform on Docker: Hadoop deployment as an example. Clust. Comput. 24(4), 3421–3430 (2021). https://doi.org/10.1007/s10586-021-03337-w
GitHub Repository of the study. https://github.com/emretto/benchmark-hadoop-on-paas. Accessed 24 May 2022
Jota juliojsb/sarviewer. https://github.com/juliojsb/sarviewer. Accessed 12 Dec 2020
Funding
The authors did not receive support from any organization for the submitted work.
Author information
Authors and Affiliations
Contributions
UEO: conceptualization, data collection, development of methodology, programming, writing-draft preparation, writing—review & editing. SA: conceptualization, development of methodology, supervision, writing—review & editing
Corresponding author
Ethics declarations
Conflict of interest
Author Serkan Ayvaz and Author Uluer Emre Ozdil declare that they have no conflict of interest.
Ethical approval
The authors consciously assure that this material is the authors’ own original work, which is not currently being considered for publication elsewhere. This article does not contain any studies with human participants or animals performed by any of the authors.
Informed consent
This article does not contain any studies with human participants or animals performed by any of the authors. The consent is not a requirement for this study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Özdil, U.E., Ayvaz, S. An experimental and comparative benchmark study examining resource utilization in managed Hadoop context. Cluster Comput 26, 1891–1915 (2023). https://doi.org/10.1007/s10586-022-03728-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-022-03728-7