Designing a MapReduce performance model in distributed heterogeneous platforms based on benchmarking approach

Abstract

MapReduce framework is an effective method for big data parallel processing. Enhancing the performance of MapReduce clusters, along with reducing their job execution time, is a fundamental challenge to this approach. In fact, one is faced with two challenges here: how to maximize the execution overlap between jobs and how to create an optimum job scheduling. Accordingly, one of the most critical challenges to achieving these goals is developing a precise model to estimate the job execution time due to the large number and high volume of the submitted jobs, limited consumable resources, and the need for proper Hadoop configuration. This paper presents a model based on MapReduce phases for predicting the execution time of jobs in a heterogeneous cluster. Moreover, a novel heuristic method is designed, which significantly reduces the makespan of the jobs. In this method, first by providing the job profiling tool, we obtain the execution details of the MapReduce phases through log analysis. Then, using machine learning methods and statistical analysis, we propose a relevant model to predict runtime. Finally, another tool called job submission and monitoring tool is used for calculating makespan. Different experiments were conducted on the benchmarks under identical conditions for all jobs. The results show that the average makespan speedup for the proposed method was higher than an unoptimized case.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

References

  1. 1.

    Dittrich J, Quiané-Ruiz J (2012) Efficient big data processing in Hadoop MapReduce. Proc VLDB Endow 5(12):2014–2015. https://doi.org/10.14778/2367502.2367562

    Article  Google Scholar 

  2. 2.

    Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113. https://doi.org/10.1145/1327452.1327492

    Article  Google Scholar 

  3. 3.

    Babu S (2010) Towards automatic optimization of MapReduce programs. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp 137–142. https://doi.org/10.1145/1807128.1807150

  4. 4.

    Lee K, Lee Y et al (2012) Parallel data processing with MapReduce. ACM SIGMOD Record 40(4):11–20. https://doi.org/10.1145/2094114.2094118

    Article  Google Scholar 

  5. 5.

    White T, Cutting D (2015) Hadoop: the definitive guide. O’Reilly Media, Yahoo

    Google Scholar 

  6. 6.

    Arora A, Mehrotra S (2015) Learning YARN. Packt Publishing Ltd, Birmingham

    Google Scholar 

  7. 7.

    Vavilapalli VK, Murthy AC et al (2013) Apache Hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th ACM Annual Symposium on Cloud Computing, p 5. https://doi.org/10.1145/2523616.2523633

  8. 8.

    Hashem IA, Anuar NB, Marjani M, Ahmed E, Chiroma H, Firdaus A, Abdullah MT, Alotaibi F, Ali WK, Yaqoob I, Gani A (2018) MapReduce scheduling algorithms: a review. J Supercomput. https://doi.org/10.1007/s11227-018-2719-5

    Article  Google Scholar 

  9. 9.

    Zikopoulos P, Eaton C (2011) Understanding big data: analytics for enterprise class Hadoop and streaming data. McGraw-Hill Osborne Media, New York City

    Google Scholar 

  10. 10.

    Lin JC, Lee MC (2016) Performance evaluation of job schedulers on Hadoop YARN. Concurr Comput Practice Exp 28(9):2711–2728. https://doi.org/10.1002/cpe.3736

    MathSciNet  Article  Google Scholar 

  11. 11.

    Zaharia M, Borthakur D et al (2009) Job scheduling for multi-user MapReduce clusters. EECS Department University of California Berkeley Technical Report UCB/EECS-2009-55 Apr, (UCB/EECS-2009-55), vol 47, p 131

  12. 12.

    Gautam J, Prajapati H et al (2015) A survey on job scheduling algorithms in Big data processing. In: IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT), pp 1–11. https://doi.org/10.1109/ICECCT.2015.7226035

  13. 13.

    Shabestari F, Rahmani AM, Navimipour NJ, Jabbehdari S (2019) A taxonomy of software-based and hardware-based approaches for energy efficiency management in the Hadoop. J Netw Comput Appl 126:162–177. https://doi.org/10.1016/j.jnca.2018.11.007

    Article  Google Scholar 

  14. 14.

    Witt C, Bux M, Gusew W, Leser U (2019) Predictive performance modeling for distributed batch processing using black box monitoring and machine learning. Inf Syst. https://doi.org/10.1016/j.is.2019.01.006

    Article  Google Scholar 

  15. 15.

    Dong B, Zheng Q, Tian F, Chao KM, Godwin N, Ma T, Xu H (2014) Performance models and dynamic characteristics analysis for HDFS write and read operations: a systematic view. J Syst Softw 93:132–151. https://doi.org/10.1016/j.jss.2014.02.038

    Article  Google Scholar 

  16. 16.

    Khan M, Jin Y, Li M, Xiang Y, Jiang C (2016) Hadoop performance modeling for job estimation and resource provisioning. IEEE Trans Parallel Distrib Syst 27(2):441–454. https://doi.org/10.1109/TPDS.2015.2405552

    Article  Google Scholar 

  17. 17.

    Ataie E, Gianniti E, Ardagna D, Movaghar A (2017) A combined analytical modeling machine learning approach for performance prediction of MapReduce jobs in Hadoop clusters. In: MICAS 2017 Management of Resources and Services in Cloud and Sky Computing, pp 0–7. https://doi.org/10.1109/synasc.2016.072

  18. 18.

    Wang N, Yang J, Lu Z, Li X, Wu J (2016) Comparison and improvement of Hadoop MapReduce performance prediction models in the private cloud. In: Asia-Pacific Services Computing Conference. Springer, Cham, pp 77–91. https://doi.org/10.1007/978-3-319-49178-3_6

  19. 19.

    Herodotou H, Babu S (2011) Profiling, what-if analysis, and cost-based optimization of MapReduce programs. In: Proceedings of the VLDB Endowment, vol 4, no. 11, pp 1111–1122

  20. 20.

    Karimian-Aliabadi S, Ardagna D, Entezari-Maleki R, Gianniti E, Movaghar A (2019) Analytical composite performance models for Big Data applications. J Netw Comput Appl. https://doi.org/10.1016/j.jnca.2019.06.009

    Article  Google Scholar 

  21. 21.

    Herodotou H, Lim H, Luo G, Borisov N, Dong L, Cetin F, Babu S (2011) Starfish: a self-tuning system for big data analytics. In: CIDR, vol 11, no 2011, pp 261–272

  22. 22.

    Herodotou H (2011) Hadoop performance models. Technical Report, CS-2011-05 Computer Science Department Duke University, p 19

  23. 23.

    Vianna E, Comarela G, Pontes T et al (2013) Analytical performance models for MapReduce workloads. Int J Parallel Prog 41(4):495–525. https://doi.org/10.1007/s10766-012-0227-4

    Article  Google Scholar 

  24. 24.

    Liang DR, Tripathi SK (2000) On performance prediction of parallel computations with precedent constraints. IEEE Trans Parallel Distrib Syst 11(5):491–508. https://doi.org/10.1109/71.852402

    Article  Google Scholar 

  25. 25.

    Glushkova D, Jovanovic P, Abelló A (2019) MapReduce performance model for Hadoop 2. x. Inf Syst 79:32–43. https://doi.org/10.1016/j.is.2017.11.006

    Article  Google Scholar 

  26. 26.

    Liu Q, Cai W, Jin D, Shen J, Fu Z, Liu X, Linge N (2016) Estimation accuracy on execution time of run-time tasks in a heterogeneous distributed environment. Sensors 16(9):1386. https://doi.org/10.3390/s16091386

    Article  Google Scholar 

  27. 27.

    Hammoud M, Sakr M (2011) Locality-aware reduce task scheduling for MapReduce. In: 2011 IEEE 3rd International Conference on Cloud Computing Technology and Science (CloudCom), pp 570–576. https://doi.org/10.1109/CloudCom.2011.87

  28. 28.

    Zhang X, Feng Y et al (2011) An effective data locality aware task scheduling method for MapReduce framework in heterogeneous environments. In: International Conference on Cloud and Service Computing (CSC), pp 235–242. https://doi.org/10.1109/CSC.2011.6138527

  29. 29.

    Wang G, Khasymski A, Krish KR, Butt AR (2013) Towards improving MapReduce task scheduling using online simulation based predictions. In: IEEE International Conference on Parallel and Distributed Systems (ICPADS), pp 299–306. https://doi.org/10.1109/ICPADS.2013.50

  30. 30.

    Yong M, Garegrat N, Mohan S (2009) Towards a resource aware scheduler in Hadoop. In: Proceedings of ICWS, pp 102–109

  31. 31.

    Zaharia M, Konwinski A, Joseph A, Katz R, Stoica I (2008) Improving MapReduce performance in heterogeneous environments. In: OSDI, vol 8, no 4, p 7. https://dl.acm.org/doi/10.5555/1855741.1855744

  32. 32.

    Chen Q, Zhang D et al (2010) SAMR: a self-adaptive MapReduce scheduling algorithm in heterogeneous environment. In: 2010 IEEE 10th International Conference on Computer and Information Technology (CIT), pp 2736–2743. https://doi.org/10.1109/CIT.2010.458

  33. 33.

    Tang Z, Liu M, Ammar A, Li K, Li K (2016) An optimized MapReduce workflow scheduling algorithm for heterogeneous computing. J Supercomput 72(6):2059–2079. https://doi.org/10.1007/s1122

    Article  Google Scholar 

  34. 34.

    Zhang Q, Zhani MF, Yang Y, Boutaba R, Wong B (2015) PRISM: fine-grained resource-aware scheduling for MapReduce. IEEE Trans Cloud Comput 3(2):182–194. https://doi.org/10.1109/tcc.2014.2379096

    Article  Google Scholar 

  35. 35.

    Polo J, Castillo C et al (2011) Resource-aware adaptive scheduling for MapReduce clusters. In: Middleware 2011, pp 187–207. https://dl.acm.org/doi/10.5555/2414338.2414352

  36. 36.

    Lama P, Zhou X (2012) AROMA: automated resource allocation and configuration of MapReduce environment in the cloud. In: Proceedings of the 9th ACM International Conference on AUTONOMIC COMPUTING, pp 63–72. https://doi.org/10.1145/2371536.2371547

  37. 37.

    Verma A, Cherkasova L, Campbell RH (2011) ARIA: automatic resource inference and allocation for MapReduce environments. In: Proceedings of the 8th ACM International Conference on Autonomic Computing, pp 235–244. https://doi.org/10.1145/1998582.1998637

  38. 38.

    Chen Q, Liu C, Xiao Z (2014) Improving MapReduce performance using smart speculative execution strategy. IEEE Trans Comput 63(4):954–967. https://doi.org/10.1109/tc.2013.15

    MathSciNet  Article  MATH  Google Scholar 

  39. 39.

    Wang Y et al (2015) Improving MapReduce performance with partial speculative execution. J Grid Comput 13(4):587–604. https://doi.org/10.1007/s10723-015-9350-y

    Article  Google Scholar 

  40. 40.

    Tang S, Lee BS, He B (2014) DynamicMR: a dynamic slot allocation optimization framework for MapReduce clusters. IEEE Trans Cloud Comput 2(3):333–347. https://doi.org/10.1109/tcc.2014.2329299

    Article  Google Scholar 

  41. 41.

    Verma A, Cherkasova L, Campbell RH (2013) Orchestrating an ensemble of MapReduce jobs for minimizing their makespan. IEEE Trans Dependable Secure Comput 10(5):314–327. https://doi.org/10.1109/TDSC.2013.14

    Article  Google Scholar 

  42. 42.

    Tian W, Li G, Yang W, Buyya R (2016) HScheduler: an optimal approach to minimize the makespan of multiple MapReduce jobs. J Supercomput 72(6):2376–2393. https://doi.org/10.1007/s11227-016-1737-4

    Article  Google Scholar 

  43. 43.

    Tang S, Lee B, He B (2016) Dynamic job ordering and slot configurations for MapReduce workloads. IEEE Trans Serv Comput 9(1):4–17. https://doi.org/10.1109/TSC.2015.2426186

    Article  Google Scholar 

  44. 44.

    Zhang Z, Cherkasova L, Loo BT (2013) Benchmarking approach for designing a MapReduce performance model. In: Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering, pp 253–258. https://doi.org/10.1145/2479871.2479906

  45. 45.

    Yao Y, Wang J, Sheng B, Lin J, Mi N (2014) HASTE: Hadoop YARN scheduling based on task-dependency and resource-demand. In: 2014 IEEE 7th International Conference on Cloud Computing (CLOUD), pp 184–191. https://doi.org/10.1109/CLOUD.2014.34

  46. 46.

    Wasi-ur-Rahman M, Lu X, Islam NS, Rajachandrasekar R, Panda DK (2015) High-performance design of YARN MapReduce on modern HPC clusters with Lustre and RDMA. In: 2015 IEEE International Parallel and Distributed Processing Symposium, pp 291–300. https://doi.org/10.1109/IPDPS.2015.83

  47. 47.

    Verma A, Cherkasova L, Campbell RH (2011) Resource provisioning framework for MapReduce jobs with performance goals. In: ACM/IFIP/USENIX International Conference on Distributed Systems Platforms and Open Distributed Processing. Springer, Berlin, pp 165–186. https://doi.org/10.1007/978-3-642-25821-3_9

  48. 48.

    Hamooni H, Debnath B, Xu J et al (2016) LogMine: fast pattern recognition for log analytics. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp 1573–1582. https://doi.org/10.1145/2983323.2983358

  49. 49.

    Sheu RK, Yuan SM, Lo WT, Ku CI (2014) Design and implementation of file deduplication framework on HDFS. Int J Distrib Sens Netw 10(4):561340. https://doi.org/10.1155/2014/561340

    Article  Google Scholar 

  50. 50.

    Huang S, Huang J, Dai J, Xie T, Huang B (2010) The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010), pp 41–51. https://doi.org/10.1109/ICDEW.2010.5452747

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Ali Movaghar.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Gandomi, A., Movaghar, A., Reshadi, M. et al. Designing a MapReduce performance model in distributed heterogeneous platforms based on benchmarking approach. J Supercomput 76, 7177–7203 (2020). https://doi.org/10.1007/s11227-020-03162-9

Download citation

Keywords

  • MapReduce
  • YARN
  • Hadoop
  • Scheduling
  • Modeling
  • Makespan