Designing a MapReduce performance model in distributed heterogeneous platforms based on benchmarking approach

  • Abolfazl Gandomi
  • Ali MovagharEmail author
  • Midia Reshadi
  • Ahmad Khademzadeh


MapReduce framework is an effective method for big data parallel processing. Enhancing the performance of MapReduce clusters, along with reducing their job execution time, is a fundamental challenge to this approach. In fact, one is faced with two challenges here: how to maximize the execution overlap between jobs and how to create an optimum job scheduling. Accordingly, one of the most critical challenges to achieving these goals is developing a precise model to estimate the job execution time due to the large number and high volume of the submitted jobs, limited consumable resources, and the need for proper Hadoop configuration. This paper presents a model based on MapReduce phases for predicting the execution time of jobs in a heterogeneous cluster. Moreover, a novel heuristic method is designed, which significantly reduces the makespan of the jobs. In this method, first by providing the job profiling tool, we obtain the execution details of the MapReduce phases through log analysis. Then, using machine learning methods and statistical analysis, we propose a relevant model to predict runtime. Finally, another tool called job submission and monitoring tool is used for calculating makespan. Different experiments were conducted on the benchmarks under identical conditions for all jobs. The results show that the average makespan speedup for the proposed method was higher than an unoptimized case.


MapReduce YARN Hadoop Scheduling Modeling Makespan 



  1. 1.
    Dittrich J, Quiané-Ruiz J (2012) Efficient big data processing in Hadoop MapReduce. Proc VLDB Endow 5(12):2014–2015. CrossRefGoogle Scholar
  2. 2.
    Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113. CrossRefGoogle Scholar
  3. 3.
    Babu S (2010) Towards automatic optimization of MapReduce programs. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp 137–142.
  4. 4.
    Lee K, Lee Y et al (2012) Parallel data processing with MapReduce. ACM SIGMOD Record 40(4):11–20. CrossRefGoogle Scholar
  5. 5.
    White T, Cutting D (2015) Hadoop: the definitive guide. O’Reilly Media, YahooGoogle Scholar
  6. 6.
    Arora A, Mehrotra S (2015) Learning YARN. Packt Publishing Ltd, BirminghamGoogle Scholar
  7. 7.
    Vavilapalli VK, Murthy AC et al (2013) Apache Hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th ACM Annual Symposium on Cloud Computing, p 5.
  8. 8.
    Hashem IA, Anuar NB, Marjani M, Ahmed E, Chiroma H, Firdaus A, Abdullah MT, Alotaibi F, Ali WK, Yaqoob I, Gani A (2018) MapReduce scheduling algorithms: a review. J Supercomput. CrossRefGoogle Scholar
  9. 9.
    Zikopoulos P, Eaton C (2011) Understanding big data: analytics for enterprise class Hadoop and streaming data. McGraw-Hill Osborne Media, New York CityGoogle Scholar
  10. 10.
    Lin JC, Lee MC (2016) Performance evaluation of job schedulers on Hadoop YARN. Concurr Comput Practice Exp 28(9):2711–2728. MathSciNetCrossRefGoogle Scholar
  11. 11.
    Zaharia M, Borthakur D et al (2009) Job scheduling for multi-user MapReduce clusters. EECS Department University of California Berkeley Technical Report UCB/EECS-2009-55 Apr, (UCB/EECS-2009-55), vol 47, p 131Google Scholar
  12. 12.
    Gautam J, Prajapati H et al (2015) A survey on job scheduling algorithms in Big data processing. In: IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT), pp 1–11.
  13. 13.
    Shabestari F, Rahmani AM, Navimipour NJ, Jabbehdari S (2019) A taxonomy of software-based and hardware-based approaches for energy efficiency management in the Hadoop. J Netw Comput Appl 126:162–177. CrossRefGoogle Scholar
  14. 14.
    Witt C, Bux M, Gusew W, Leser U (2019) Predictive performance modeling for distributed batch processing using black box monitoring and machine learning. Inf Syst. CrossRefGoogle Scholar
  15. 15.
    Dong B, Zheng Q, Tian F, Chao KM, Godwin N, Ma T, Xu H (2014) Performance models and dynamic characteristics analysis for HDFS write and read operations: a systematic view. J Syst Softw 93:132–151. CrossRefGoogle Scholar
  16. 16.
    Khan M, Jin Y, Li M, Xiang Y, Jiang C (2016) Hadoop performance modeling for job estimation and resource provisioning. IEEE Trans Parallel Distrib Syst 27(2):441–454. CrossRefGoogle Scholar
  17. 17.
    Ataie E, Gianniti E, Ardagna D, Movaghar A (2017) A combined analytical modeling machine learning approach for performance prediction of MapReduce jobs in Hadoop clusters. In: MICAS 2017 Management of Resources and Services in Cloud and Sky Computing, pp 0–7.
  18. 18.
    Wang N, Yang J, Lu Z, Li X, Wu J (2016) Comparison and improvement of Hadoop MapReduce performance prediction models in the private cloud. In: Asia-Pacific Services Computing Conference. Springer, Cham, pp 77–91. Google Scholar
  19. 19.
    Herodotou H, Babu S (2011) Profiling, what-if analysis, and cost-based optimization of MapReduce programs. In: Proceedings of the VLDB Endowment, vol 4, no. 11, pp 1111–1122Google Scholar
  20. 20.
    Karimian-Aliabadi S, Ardagna D, Entezari-Maleki R, Gianniti E, Movaghar A (2019) Analytical composite performance models for Big Data applications. J Netw Comput Appl. CrossRefGoogle Scholar
  21. 21.
    Herodotou H, Lim H, Luo G, Borisov N, Dong L, Cetin F, Babu S (2011) Starfish: a self-tuning system for big data analytics. In: CIDR, vol 11, no 2011, pp 261–272Google Scholar
  22. 22.
    Herodotou H (2011) Hadoop performance models. Technical Report, CS-2011-05 Computer Science Department Duke University, p 19Google Scholar
  23. 23.
    Vianna E, Comarela G, Pontes T et al (2013) Analytical performance models for MapReduce workloads. Int J Parallel Prog 41(4):495–525. CrossRefGoogle Scholar
  24. 24.
    Liang DR, Tripathi SK (2000) On performance prediction of parallel computations with precedent constraints. IEEE Trans Parallel Distrib Syst 11(5):491–508. CrossRefGoogle Scholar
  25. 25.
    Glushkova D, Jovanovic P, Abelló A (2019) MapReduce performance model for Hadoop 2. x. Inf Syst 79:32–43. CrossRefGoogle Scholar
  26. 26.
    Liu Q, Cai W, Jin D, Shen J, Fu Z, Liu X, Linge N (2016) Estimation accuracy on execution time of run-time tasks in a heterogeneous distributed environment. Sensors 16(9):1386. CrossRefGoogle Scholar
  27. 27.
    Hammoud M, Sakr M (2011) Locality-aware reduce task scheduling for MapReduce. In: 2011 IEEE 3rd International Conference on Cloud Computing Technology and Science (CloudCom), pp 570–576.
  28. 28.
    Zhang X, Feng Y et al (2011) An effective data locality aware task scheduling method for MapReduce framework in heterogeneous environments. In: International Conference on Cloud and Service Computing (CSC), pp 235–242.
  29. 29.
    Wang G, Khasymski A, Krish KR, Butt AR (2013) Towards improving MapReduce task scheduling using online simulation based predictions. In: IEEE International Conference on Parallel and Distributed Systems (ICPADS), pp 299–306.
  30. 30.
    Yong M, Garegrat N, Mohan S (2009) Towards a resource aware scheduler in Hadoop. In: Proceedings of ICWS, pp 102–109Google Scholar
  31. 31.
    Zaharia M, Konwinski A, Joseph A, Katz R, Stoica I (2008) Improving MapReduce performance in heterogeneous environments. In: OSDI, vol 8, no 4, p 7.
  32. 32.
    Chen Q, Zhang D et al (2010) SAMR: a self-adaptive MapReduce scheduling algorithm in heterogeneous environment. In: 2010 IEEE 10th International Conference on Computer and Information Technology (CIT), pp 2736–2743.
  33. 33.
    Tang Z, Liu M, Ammar A, Li K, Li K (2016) An optimized MapReduce workflow scheduling algorithm for heterogeneous computing. J Supercomput 72(6):2059–2079. CrossRefGoogle Scholar
  34. 34.
    Zhang Q, Zhani MF, Yang Y, Boutaba R, Wong B (2015) PRISM: fine-grained resource-aware scheduling for MapReduce. IEEE Trans Cloud Comput 3(2):182–194. CrossRefGoogle Scholar
  35. 35.
    Polo J, Castillo C et al (2011) Resource-aware adaptive scheduling for MapReduce clusters. In: Middleware 2011, pp 187–207. CrossRefGoogle Scholar
  36. 36.
    Lama P, Zhou X (2012) AROMA: automated resource allocation and configuration of MapReduce environment in the cloud. In: Proceedings of the 9th ACM International Conference on AUTONOMIC COMPUTING, pp 63–72.
  37. 37.
    Verma A, Cherkasova L, Campbell RH (2011) ARIA: automatic resource inference and allocation for MapReduce environments. In: Proceedings of the 8th ACM International Conference on Autonomic Computing, pp 235–244.
  38. 38.
    Chen Q, Liu C, Xiao Z (2014) Improving MapReduce performance using smart speculative execution strategy. IEEE Trans Comput 63(4):954–967. MathSciNetCrossRefzbMATHGoogle Scholar
  39. 39.
    Wang Y et al (2015) Improving MapReduce performance with partial speculative execution. J Grid Comput 13(4):587–604. CrossRefGoogle Scholar
  40. 40.
    Tang S, Lee BS, He B (2014) DynamicMR: a dynamic slot allocation optimization framework for MapReduce clusters. IEEE Trans Cloud Comput 2(3):333–347. CrossRefGoogle Scholar
  41. 41.
    Verma A, Cherkasova L, Campbell RH (2013) Orchestrating an ensemble of MapReduce jobs for minimizing their makespan. IEEE Trans Dependable Secure Comput 10(5):314–327. CrossRefGoogle Scholar
  42. 42.
    Tian W, Li G, Yang W, Buyya R (2016) HScheduler: an optimal approach to minimize the makespan of multiple MapReduce jobs. J Supercomput 72(6):2376–2393. CrossRefGoogle Scholar
  43. 43.
    Tang S, Lee B, He B (2016) Dynamic job ordering and slot configurations for MapReduce workloads. IEEE Trans Serv Comput 9(1):4–17. CrossRefGoogle Scholar
  44. 44.
    Zhang Z, Cherkasova L, Loo BT (2013) Benchmarking approach for designing a MapReduce performance model. In: Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering, pp 253–258.
  45. 45.
    Yao Y, Wang J, Sheng B, Lin J, Mi N (2014) HASTE: Hadoop YARN scheduling based on task-dependency and resource-demand. In: 2014 IEEE 7th International Conference on Cloud Computing (CLOUD), pp 184–191.
  46. 46.
    Wasi-ur-Rahman M, Lu X, Islam NS, Rajachandrasekar R, Panda DK (2015) High-performance design of YARN MapReduce on modern HPC clusters with Lustre and RDMA. In: 2015 IEEE International Parallel and Distributed Processing Symposium, pp 291–300.
  47. 47.
    Verma A, Cherkasova L, Campbell RH (2011) Resource provisioning framework for MapReduce jobs with performance goals. In: ACM/IFIP/USENIX International Conference on Distributed Systems Platforms and Open Distributed Processing. Springer, Berlin, pp 165–186. CrossRefGoogle Scholar
  48. 48.
    Hamooni H, Debnath B, Xu J et al (2016) LogMine: fast pattern recognition for log analytics. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp 1573–1582.
  49. 49.
    Sheu RK, Yuan SM, Lo WT, Ku CI (2014) Design and implementation of file deduplication framework on HDFS. Int J Distrib Sens Netw 10(4):561340. CrossRefGoogle Scholar
  50. 50.
    Huang S, Huang J, Dai J, Xie T, Huang B (2010) The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010), pp 41–51.

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2020

Authors and Affiliations

  1. 1.Department of Computer EngineeringScience and Research Branch, Islamic Azad UniversityTehranIran
  2. 2.Department of Computer EngineeringSharif University of TechnologyTehranIran
  3. 3.Iran Telecommunication Research Center, ITRCTehranIran

Personalised recommendations