Advertisement

MapReduce scheduling algorithms: a review

  • Ibrahim Abaker Targio HashemEmail author
  • Nor Badrul Anuar
  • Mohsen Marjani
  • Ejaz Ahmed
  • Haruna Chiroma
  • Ahmad Firdaus
  • Muhamad Taufik Abdullah
  • Faiz Alotaibi
  • Waleed Kamaleldin Mahmoud Ali
  • Ibrar Yaqoob
  • Abdullah Gani
Article
  • 64 Downloads

Abstract

Recent trends in big data have shown that the amount of data continues to increase at an exponential rate. This trend has inspired many researchers over the past few years to explore new research direction of studies related to multiple areas of big data. The widespread popularity of big data processing platforms using MapReduce framework is the growing demand to further optimize their performance for various purposes. In particular, enhancing resources and jobs scheduling are becoming critical since they fundamentally determine whether the applications can achieve the performance goals in different use cases. Scheduling plays an important role in big data, mainly in reducing the execution time and cost of processing. This paper aims to survey the research undertaken in the field of scheduling in big data platforms. Moreover, this paper analyzed scheduling in MapReduce on two aspects: taxonomy and performance evaluation. The research progress in MapReduce scheduling algorithms is also discussed. The limitations of existing MapReduce scheduling algorithms and exploit future research opportunities are pointed out in the paper for easy identification by researchers. Our study can serve as the benchmark to expert researchers for proposing a novel MapReduce scheduling algorithm. However, for novice researchers, the study can be used as a starting point.

Keywords

Big data Hadoop MapReduce Cloud computing Scheduling algorithms 

Notes

Acknowledgements

This paper is financially supported by University Malaya Research Grant Programme (Equitable Society) under Grant RP032B-16SBS.

References

  1. 1.
    Chen M et al (2014) Big data: a survey. Mob Netw Appl 19(2):171–209CrossRefGoogle Scholar
  2. 2.
    Maass W et al (2017) Big data and theory. In: Schintler LA, McNeely CL (eds) Encyclopedia of big data, Springer International Publishing, Cham, pp 1–5Google Scholar
  3. 3.
    Wang Y et al (2018) Big data analytics: understanding its capabilities and potential benefits for healthcare organizations. Technol Forecast Soc Change 126:3–13CrossRefGoogle Scholar
  4. 4.
    Tahmassebi A et al (2018) Deep learning in medical imaging: fMRI big data analysis via convolutional neural networks. In: Proceedings of the Practice and Experience on Advanced Research Computing. ACMGoogle Scholar
  5. 5.
    Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRefGoogle Scholar
  6. 6.
    Lee K-H et al (2012) Parallel data processing with MapReduce: a survey. AcM sIGMoD Rec 40(4):11–20MathSciNetCrossRefGoogle Scholar
  7. 7.
    Chang H et al (2011) Scheduling in MapReduce-like systems for fast completion time. In: 2011 Proceedings IEEE INFOCOM. IEEEGoogle Scholar
  8. 8.
    Yoo D, Sim KM (2011) A comparative review of job scheduling for MapReduce. In: 2011 IEEE International Conference on Cloud Computing and Intelligence Systems (CCIS). CiteseerGoogle Scholar
  9. 9.
    Althebyan Q et al (2017) A scalable MapReduce tasks scheduling: a threading-based approach. Int J Comput Sci Eng 14(1):44–54Google Scholar
  10. 10.
    Tang Z et al (2012) MTSD: a task scheduling algorithm for MapReduce base on deadline constraints. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & Ph.D. Forum (IPDPSW). IEEEGoogle Scholar
  11. 11.
    Jayasena K, Li L, Xie Q (2017) Multi-modal multimedia big data analyzing architecture and resource allocation on cloud platform. Neurocomputing 253:135CrossRefGoogle Scholar
  12. 12.
    Page AJ, Naughton TJ (2005) Framework for task scheduling in heterogeneous distributed computing using genetic algorithms. Artif Intell Rev 24(3–4):415–429CrossRefGoogle Scholar
  13. 13.
    Rao BT, Reddy L (2012) Survey on improved scheduling in Hadoop MapReduce in cloud environments. arXiv preprint arXiv:1207.0780
  14. 14.
    Tiwari N et al (2015) Classification framework of MapReduce scheduling algorithms. ACM Comput Surv (CSUR) 47(3):49CrossRefGoogle Scholar
  15. 15.
    Doulkeridis C, Nørvåg K (2014) A survey of large-scale analytical query processing in MapReduce. VLDB J 23(3):355–380CrossRefGoogle Scholar
  16. 16.
    Arora S, Goel DM (2014) Survey paper on scheduling in Hadoop. Int J Adv Res Comput Sci Softw Eng 4(5):4886Google Scholar
  17. 17.
    Chen C-H, Lin J-W, Kuo S-Y (2018) MapReduce scheduling for deadline-constrained jobs in heterogeneous cloud computing systems. IEEE Trans Cloud Comput 6(1):127–140CrossRefGoogle Scholar
  18. 18.
    Nagarajan V et al. (2018) Malleable scheduling for flows of jobs and applications to MapReduce. J Sched 752:1–19Google Scholar
  19. 19.
    Duan N et al (2018) Scheduling MapReduce tasks based on estimated workload distribution. Google PatentsGoogle Scholar
  20. 20.
    Tang Y et al (2018) OEHadoop: accelerate Hadoop applications by co-designing Hadoop with data center network. IEEE Access 6:25849–25860CrossRefGoogle Scholar
  21. 21.
    Hadoop A (2011) Apache Hadoop. https://hadoop.apache.org/. Accessed 3 May 2017
  22. 22.
    Vavilapalli VK et al (2013) Apache Hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing. ACMGoogle Scholar
  23. 23.
    Hindman B et al (2011) Mesos: a platform for fine-grained resource sharing in the data center. In: NSDIGoogle Scholar
  24. 24.
    Facebook (2012) Facebook engineering. Under the hood: scheduling MapReduce jobs more efficiently with Corona. 2012 [cited 2015 5 March]. https://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-corona/10151142560538920
  25. 25.
    Scott J (2015) A tale of two clusters: Mesos and YARN. [cited 2016 1/6/2016]. http://radar.oreilly.com/2015/02/a-tale-of-two-clusters-mesos-and-yarn.html
  26. 26.
    Shabeera T, Kumar SM, Chandran P (2016) Curtailing job completion time in MapReduce clouds through improved Virtual Machine allocation. Comput Electr Eng 58:190–202CrossRefGoogle Scholar
  27. 27.
    Pulgar-Rubio F et al (2017) MEFASD-BD: multi-objective evolutionary fuzzy algorithm for subgroup discovery in big data environments-a MapReduce solution. Knowl-Based Syst 117:70–78CrossRefGoogle Scholar
  28. 28.
    Casavant TL, Kuhl JG (1988) A taxonomy of scheduling in general-purpose distributed computing systems. IEEE Trans Softw Eng 14(2):141–154CrossRefGoogle Scholar
  29. 29.
    Gao Y, Rong H, Huang JZ (2005) Adaptive grid job scheduling with genetic algorithms. Future Gener Comput Syst 21(1):151–161CrossRefGoogle Scholar
  30. 30.
    Hadoop A (2009) Fair scheduler. https://hadoop.apache.org/docs/stable1/fair_scheduler.html. Accessed 13 June 2017
  31. 31.
    Hadoop A Capacity scheduler guide. https://hadoop.apache.org/docs/r1.2.1/capacity_scheduler.html. Accessed 13 June 2017
  32. 32.
    Zaharia M et al (2010) Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: Proceedings of the 5th European Conference on Computer Systems. ACMGoogle Scholar
  33. 33.
    Tan J, Meng X, Zhang L (2012) Delay tails in MapReduce scheduling. ACM SIGMETRICS Perform Eval Rev 40(1):5–16CrossRefGoogle Scholar
  34. 34.
    Hadoop A Apache Hadoop. https://hadoop.apache.org/. Accessed 3 May 2017
  35. 35.
    Casas I et al (2016) GA-ETI: an enhanced genetic algorithm for the scheduling of scientific workflows in cloud environments. J Comput Sci 26:318–331CrossRefGoogle Scholar
  36. 36.
    Zaharia M et al (2008) Improving MapReduce performance in heterogeneous environments. In: OSDIGoogle Scholar
  37. 37.
    Isard M et al (2009) Quincy: fair scheduling for distributed computing clusters. In: Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles. ACMGoogle Scholar
  38. 38.
    Qi C, Cheng L, Zhen X (2014) Improving MapReduce performance using smart speculative execution strategy. IEEE Trans Comput 63(4):954–967MathSciNetzbMATHCrossRefGoogle Scholar
  39. 39.
    Gu R et al (2014) SHadoop: improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters. J Parallel Distrib Comput 74(3):2166–2179CrossRefGoogle Scholar
  40. 40.
    Anjos JC et al (2015) MRA++: scheduling and data placement on MapReduce for heterogeneous environments. Future Gener Comput Syst 42:22–35CrossRefGoogle Scholar
  41. 41.
    Ibrahim S et al (2012) Maestro: Replica-aware map scheduling for MapReduce. In: 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). IEEEGoogle Scholar
  42. 42.
    Verma A, Cherkasova L, Campbell RH (2011) ARIA: automatic resource inference and allocation for MapReduce environments. In: Proceedings of the 8th ACM International Conference on Autonomic Computing. ACMGoogle Scholar
  43. 43.
    Wolf J et al (2010) Flex: a slot allocation scheduling optimizer for MapReduce workloads. In: Middleware 2010. Springer, pp 1–20Google Scholar
  44. 44.
    Polo J et al (2010) Performance management of accelerated MapReduce workloads in heterogeneous clusters. In: 2010 39th International Conference on Parallel Processing (ICPP). IEEEGoogle Scholar
  45. 45.
    Lopes R, Menascé D (2015) A taxonomy of job scheduling on distributed computing systems. http://cs.gmu.edu. Accessed 3 Sept 2017
  46. 46.
    Ahmad F et al (2012) Tarazu: optimizing MapReduce on heterogeneous clusters. In: ACM SIGARCH Computer Architecture News. ACMGoogle Scholar
  47. 47.
    Krish K, Anwar A, Butt AR (2014) [phi] Sched: a heterogeneity-aware Hadoop workflow scheduler. In: 2014 IEEE 22nd International Symposium on Modelling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS). IEEEGoogle Scholar
  48. 48.
    Dong F, Akl SG (2007) PFAS: a resource-performance-fluctuation-aware workflow scheduling algorithm for grid computing. In: IEEE International Parallel and Distributed Processing Symposium. IPDPS 2007. IEEEGoogle Scholar
  49. 49.
    Cheng D, Rao J, Guo Y, Jiang C, Zhou X (2017) Improving performance of heterogeneous mapreduce clusters with adaptive task tuning. IEEE Trans Parallel Distrib Syst 28(3):774–786CrossRefGoogle Scholar
  50. 50.
    Murthy AC et al (2011) Architecture of next generation Apache Hadoop MapReduce framework. Technical report, Apache HadoopGoogle Scholar
  51. 51.
    Ghit B et al (2014) Balanced resource allocations across multiple dynamic MapReduce clusters. In: ACM SIGMETRICSGoogle Scholar
  52. 52.
    Barham P et al (2003) Xen and the art of virtualization. ACM SIGOPS Oper Syst Rev 37(5):164–177CrossRefGoogle Scholar
  53. 53.
    Chen F, Kodialam M, Lakshman T (2012) Joint scheduling of processing and shuffle phases in MapReduce systems. In: Proceedings IEEE INFOCOM. IEEEGoogle Scholar
  54. 54.
    Polo J et al (2011) Resource-aware adaptive scheduling for MapReduce clusters. In: Middleware 2011. Springer, pp 187–207Google Scholar
  55. 55.
    Sousa E et al (2014) Resource-aware computer vision application on heterogeneous multi-tile architecture. In: Proceedings of the Hardware and Software Demo at the University Booth at Design, Automation and Test in Europe (DATE), DresdenGoogle Scholar
  56. 56.
    Yong M, Garegrat N, Mohan S (2009) Towards a resource aware scheduler in Hadoop. In: Proceedings of the 2009 IEEE International Conference on Web Services, Los Angeles, CA, USAGoogle Scholar
  57. 57.
    Guo Z et al (2012) Improving resource utilization in MapReduce. In: 2012 IEEE International Conference on Cluster Computing (CLUSTER). IEEEGoogle Scholar
  58. 58.
    Rasooli A, Down DG (2014) COSHH: a classification and optimization based scheduler for heterogeneous Hadoop systems. Future Gener Comput Syst 36:1–15CrossRefGoogle Scholar
  59. 59.
    Guo Z, Fox G, Zhou M (2012) Investigation of data locality in MapReduce. In: Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID 2012). IEEE Computer SocietyGoogle Scholar
  60. 60.
    Park J et al (2012) Locality-aware dynamic VM reconfiguration on MapReduce clouds. In: Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing. ACMGoogle Scholar
  61. 61.
    Li J-J et al (2011) Survey of MapReduce parallel programming model. Dianzi Xuebao (Acta Electron Sin) 39(11):2635–2642Google Scholar
  62. 62.
    He C, Lu Y, Swanson D (2011) Matchmaking: a new MapReduce scheduling technique. In: 2011 IEEE Third International Conference on Cloud Computing Technology and Science (CloudCom). IEEEGoogle Scholar
  63. 63.
    Abad CL, Lu Y, Campbell RH (2011) DARE: adaptive data replication for efficient cluster scheduling. In: 2011 IEEE International Conference on Cluster Computing (CLUSTER). IEEEGoogle Scholar
  64. 64.
    Zhang X et al (2011) Improving data locality of MapReduce by scheduling in homogeneous computing environments. In: 2011 IEEE 9th International Symposium on Parallel and Distributed Processing with Applications (ISPA). IEEEGoogle Scholar
  65. 65.
    Jin J et al (2011) Bar: an efficient data locality driven task scheduling algorithm for cloud computing. In: Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. IEEE Computer SocietyGoogle Scholar
  66. 66.
    Wang W, Zhu K, Ying L, Tan J, Zhang L (2016) Maptask scheduling in mapreduce with data locality: Throughput and heavy-traffic optimality. IEEE/ACM Trans Networking (TON) 24(1):190–203CrossRefGoogle Scholar
  67. 67.
    Lim N, Majumdar S, Ashwood-Smith P (2014) Engineering resource management middleware for optimizing the performance of clouds processing MapReduce jobs with deadlines. In: Proceedings of the 5th ACM/SPEC International Conference on Performance Engineering. ACMGoogle Scholar
  68. 68.
    Sandholm T, Lai K (2010) Dynamic proportional share scheduling in hadoop. In: Workshop on Job Scheduling Strategies for Parallel Processing, Springer, Berlin, Heidelberg, pp 110–131Google Scholar
  69. 69.
    Nanduri R et al (2011) Job aware scheduling algorithm for MapReduce framework. In: 2011 IEEE Third International Conference on Cloud Computing Technology and Science (CloudCom). IEEEGoogle Scholar
  70. 70.
    Zhang Q et al (2015) PRISM: fine-grained resource-aware scheduling for MapReduce. IEEE Trans Cloud Comput 1:1CrossRefGoogle Scholar
  71. 71.
    Kllapi H et al (2011) Schedule optimization for data processing flows on the cloud. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACMGoogle Scholar
  72. 72.
    Ponnambalam S, Jawahar N, Chandrasekaran S (2009) Discrete particle swarm optimization algorithm for flowshop scheduling. INTECH Open Access PublisherGoogle Scholar
  73. 73.
    Savic D (2002) Single-objective vs. multiobjective optimisation for integrated decision support. Integr Assess Decision Support 1:7–12Google Scholar
  74. 74.
    Chen Q, Liu C, Xiao Z (2013) Improving MapReduce performance using smart speculative execution strategy. Parallel Distrib Syst 24:1107zbMATHCrossRefGoogle Scholar
  75. 75.
    Nita M-C et al (2015) MOMTH: multi-objective scheduling algorithm of many tasks in Hadoop. Clust Comput 18:1–14CrossRefGoogle Scholar
  76. 76.
    Long S-Q, Zhao Y-L, Chen W (2014) MORM: a multi-objective optimized replication management strategy for cloud storage cluster. J Syst Archit 60(2):234–244CrossRefGoogle Scholar
  77. 77.
    Jiang Y et al (2017) Makespan minimization for MapReduce systems with different servers. Future Gener Comput Syst 67:13–21CrossRefGoogle Scholar
  78. 78.
    Lei H et al (2016) A multi-objective co-evolutionary algorithm for energy-efficient scheduling on a green data center. Comput Oper Res 75:103–117MathSciNetzbMATHCrossRefGoogle Scholar
  79. 79.
    Yang S-J, Chen Y-R (2015) Design adaptive task allocation scheduler to improve MapReduce performance in heterogeneous clouds. J Netw Comput Appl 57:61–70CrossRefGoogle Scholar
  80. 80.
    Xu H, Lau WC (2014) Optimization for speculative execution of multiple jobs in a MapReduce-like cluster. arXiv preprint arXiv:1406.0609
  81. 81.
    You H-H, Yang C-C, Huang J-L (2011) A load-aware scheduler for MapReduce framework in heterogeneous cloud environments. In: Proceedings of the 2011 ACM Symposium on Applied Computing. ACMGoogle Scholar
  82. 82.
    Lei L, Wo T, Hu C (2011) CREST: towards fast speculation of straggler tasks in MapReduce. In: 2011 IEEE 8th International Conference on e-Business Engineering (ICEBE). IEEEGoogle Scholar
  83. 83.
    Fu H et al (2017) FARMS: efficient MapReduce speculation for failure recovery in short jobs. Parallel Comput 61:68–82MathSciNetCrossRefGoogle Scholar
  84. 84.
    Brahmwar M, Kumar M, Sikka G (2016) Tolhit—a scheduling algorithm for Hadoop cluster. Proc Comput Sci 89:203–208CrossRefGoogle Scholar
  85. 85.
    Memishi B, Pérez MS, Antoniu G (2017) Failure detector abstractions for MapReduce-based systems. Inf Sci 379:112–127CrossRefGoogle Scholar
  86. 86.
    Gouasmi T et al (2018) Exact and heuristic MapReduce scheduling algorithms for cloud federation. Comput Electr Eng 69:274CrossRefGoogle Scholar
  87. 87.
    Zhao H et al (2018) Prediction-based and locality-aware task scheduling for parallelizing video transcoding over heterogeneous MapReduce cluster. IEEE Trans Circuits Syst Video Technol 28(4):1009–1020CrossRefGoogle Scholar
  88. 88.
    Singh S, Chana I (2015) QoS-aware autonomic resource management in cloud computing: a systematic review. ACM Comput Surv (CSUR) 48(3):42CrossRefGoogle Scholar
  89. 89.
    Yu J (2007) QoS-based scheduling of workflows on global gridsGoogle Scholar
  90. 90.
    Sheikhalishahi M et al (2016) A multi-dimensional job scheduling. Future Gener Comput Syst 54:123–131CrossRefGoogle Scholar
  91. 91.
    Yao Y et al (2015) Self-adjusting slot configurations for homogeneous and heterogeneous Hadoop clusters. IEEE Trans Cloud Comput 5:344CrossRefGoogle Scholar
  92. 92.
    Khoo BB et al (2007) A multi-dimensional scheduling scheme in a Grid computing environment. J Parallel Distrib Comput 67(6):659–673zbMATHCrossRefGoogle Scholar
  93. 93.
    Yao Z, Papapanagiotou I, Callaway RD (2015) Multi-dimensional scheduling in cloud storage systems. In: International Communications Conference (ICC)Google Scholar
  94. 94.
    Dong X, Wang Y, Liao H (2011) Scheduling mixed real-time and non-real-time applications in MapReduce environment. In: 2011 IEEE 17th International Conference on Parallel and Distributed Systems (ICPADS). IEEEGoogle Scholar
  95. 95.
    Casati F, Shan M-C (2007) Event-based scheduling method and system for workflow activities. Google PatentsGoogle Scholar
  96. 96.
    Ilyushkin A, Ghit B, Epema D (2015) Scheduling workloads of workflows with unknown task runtimes. In: 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). IEEEGoogle Scholar
  97. 97.
    Li Y, Zhang H, Kim KH (2011) A power-aware scheduling of MapReduce applications in the cloud. In: 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing (DASC). IEEEGoogle Scholar
  98. 98.
    Goiri Í et al (2012) GreenHadoop: leveraging green energy in data-processing frameworks. In: Proceedings of the 7th ACM European Conference on Computer Systems. ACMGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Ibrahim Abaker Targio Hashem
    • 1
    • 2
    Email author
  • Nor Badrul Anuar
    • 2
  • Mohsen Marjani
    • 1
  • Ejaz Ahmed
    • 3
  • Haruna Chiroma
    • 4
  • Ahmad Firdaus
    • 5
  • Muhamad Taufik Abdullah
    • 6
  • Faiz Alotaibi
    • 6
  • Waleed Kamaleldin Mahmoud Ali
    • 2
  • Ibrar Yaqoob
    • 2
  • Abdullah Gani
    • 1
  1. 1.School of Computing and Information TechnologyTaylor’s UniversitySubang JayaMalaysia
  2. 2.Faculty of Computer Science and Information TechnologyUniversity of MalayaKuala LumpurMalaysia
  3. 3.Centre for Mobile Cloud Computing ResearchUniversity of MalayaKuala LumpurMalaysia
  4. 4.Department of Computer ScienceFederal College of Education (Technical)GombeNigeria
  5. 5.Faculty of Computer Systems and Software EngineeringUniversiti Malaysia PahangGambang, KuantanMalaysia
  6. 6.Faculty of Computer Science and Information TechnologyUniversiti Putra MalaysiaSerdangMalaysia

Personalised recommendations