Multi-objective scheduling of MapReduce jobs in big data processing

  • Ibrahim Abaker Targio Hashem
  • Nor Badrul Anuar
  • Mohsen Marjani
  • Abdullah Gani
  • Arun Kumar Sangaiah
  • Adewole Kayode Sakariyah
Article

Abstract

Data generation has increased drastically over the past few years due to the rapid development of Internet-based technologies. This period has been called the big data era. Big data offer an emerging paradigm shift in data exploration and utilization. The MapReduce computational paradigm is a well-known framework and is considered the main enabler for the distributed and scalable processing of a large amount of data. However, despite recent efforts toward improving the performance of MapReduce, scheduling MapReduce jobs across multiple nodes has been considered a multi-objective optimization problem. This problem can become increasingly complex when virtualized clusters in cloud computing are used to execute a large number of tasks. This study aims to optimize MapReduce job scheduling based on the completion time and cost of cloud service models. First, the problem is formulated as a multi-objective model. The model consists of two objective functions, namely, (i) completion time and (ii) cost minimization. Second, a scheduling algorithm using earliest finish time scheduling that considers resource allocation and job scheduling in the cloud is proposed. Lastly, experimental results show that the proposed scheduler exhibits better performance than other well-known schedulers, such as FIFO and Fair.

Keywords

Hadoop MapReduce Cloud computing Big data Scheduling algorithms 

Notes

Acknowledgments

This paper is financially supported by by University Malaya Research Grant Programme (Equitable Society) under grant RP032B-16SBS.

References

  1. 1.
    Abouzeid A, Bajda-Pawlikowski K, Abadi D, Silberschatz A, Rasin A (2009) HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc VLDB Endowment 2(1):922–933CrossRefGoogle Scholar
  2. 2.
    Armbrust M, Fox A, Griffith R, Joseph AD, Katz R, Konwinski A et al (2010) A view of cloud computing. Commun ACM 53(4):50–58CrossRefGoogle Scholar
  3. 3.
    Bittencourt LF, Madeira ERM (2011) HCOC: a cost optimization algorithm for workflow scheduling in hybrid clouds. J Internet Serv Appl 2(3):207–227CrossRefGoogle Scholar
  4. 4.
    Chang H, Kodialam M, Kompella RR, Lakshman T, Lee M, Mukherjee S (2011) Scheduling in mapreduce-like systems for fast completion time. Paper presented at the INFOCOM, 2011 Proceedings IEEEGoogle Scholar
  5. 5.
    Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRefGoogle Scholar
  6. 6.
    Dean J, Ghemawat S (2010) MapReduce: a flexible data processing tool. Commun ACM 53(1):72–77CrossRefGoogle Scholar
  7. 7.
    Doulkeridis C, Nørvåg K (2014) A survey of large-scale analytical query processing in MapReduce. VLDB J 23(3):355–380CrossRefGoogle Scholar
  8. 8.
    Durillo JJ, Prodan R (2014) Multi-objective workflow scheduling in amazon EC2. Clust Comput 17(2):169–189CrossRefGoogle Scholar
  9. 9.
    Guo Z, Fox G, Zhou M, Ruan Y (2012) Improving resource utilization in mapreduce. Paper presented at the CLUSTER computing (CLUSTER), 2012 I.E. international conference onGoogle Scholar
  10. 10.
  11. 11.
    Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A, Khan SU (2015) The rise of “big data” on cloud computing: review and open research issues. Inf Syst 47:98–115CrossRefGoogle Scholar
  12. 12.
    Heintz B, Chandra A, Sitaraman RK (2012) Optimizing mapreduce for highly distributed environments. arXiv preprint arXiv:1207.7055Google Scholar
  13. 13.
    Huang S, Huang J, Dai J, Xie T, Huang B (2011) The HiBench benchmark suite: characterization of the MapReduce-based data analysis. New Frontiers in Information and Software as Services,Springer, pp 209–228Google Scholar
  14. 14.
    Hussain H, Malik SUR, Hameed A, Khan SU, Bickler G, Min-Allah N et al (2013) A survey on resource allocation in high performance distributed computing systems. Parallel Comput 39(11):709–736MathSciNetCrossRefGoogle Scholar
  15. 15.
    Ibrahim S, Jin H, Lu L, He B, Antoniu G, Wu S (2012) Maestro: replica-aware map scheduling for mapreduce. Paper presented at the cluster, cloud and grid computing (CCGrid), 2012 12th IEEE/ACM international symposium onGoogle Scholar
  16. 16.
    Isard M, Prabhakaran V, Currey J, Wieder U, Talwar K, Goldberg A (2009) Quincy: fair scheduling for distributed computing clusters. Paper presented at the Proceedings of the ACM SIGOPS 22nd symposium on operating systems principlesGoogle Scholar
  17. 17.
    Jagadish H (2015) Big data and science: myths and reality. Big Data Res 2(2):49–52MathSciNetCrossRefGoogle Scholar
  18. 18.
    Jiang D, Ooi BC, Shi L, Wu S (2010) The performance of mapreduce: an in-depth study. Proc VLDB Endowment 3(1–2):472–483CrossRefGoogle Scholar
  19. 19.
    Kc K, Anyanwu K (2010) Scheduling hadoop jobs to meet deadlines. Paper presented at the cloud computing Technology and science (CloudCom), 2010 I.E. Second international conference onGoogle Scholar
  20. 20.
    Krish K, Anwar A, Butt AR (2014) [phi]Sched: a heterogeneity-aware Hadoop workflow scheduler. Paper presented at the Modelling, Analysis & Simulation of computer and telecommunication systems (MASCOTS), 2014 I.E. 22nd international symposium onGoogle Scholar
  21. 21.
    Laurila JK, Gatica-Perez D, Aad I, Blom J, Bornet O, Do T-M-T,. .. Miettinen M (2012) The mobile data challenge: big data for mobile computing research. Paper presented at the Proceedings of the Workshop on the Nokia Mobile Data Challenge, in Conjunction with the 10th International Conference on Pervasive ComputingGoogle Scholar
  22. 22.
    Li J-J, Cui J, Wang D, Yan L, Huang Y-S (2011) Survey of MapReduce parallel programming model. Dianzi Xuebao (Acta Electron Sin) 39(11):2635–2642Google Scholar
  23. 23.
    Long S-Q, Zhao Y-L, Chen W (2014) MORM: a multi-objective optimized replication management strategy for cloud storage cluster. J Syst Archit 60(2):234–244CrossRefGoogle Scholar
  24. 24.
    Lopes RV, & Menasce D (2016) A taxonomy of job scheduling on distributed computing systems. IEEE Transactions on Parallel and Distributed Systems 27(12):3412–3428Google Scholar
  25. 25.
    Medhane DV, Sangaiah AK (2017) Search space-based multi-objective optimization evolutionary algorithm. Comput Electr Eng 58:126–143CrossRefGoogle Scholar
  26. 26.
    Mundkur P, Tuulos V, Flatow J (2011) Disco: a computing platform for large-scale data analytics. Paper presented at the Proceedings of the 10th ACM SIGPLAN workshop on ErlangGoogle Scholar
  27. 27.
    Nita M-C, Pop F, Voicu C, Dobre C, Xhafa F (2015) MOMTH: multi-objective scheduling algorithm of many tasks in Hadoop. Clust Comput 18:1–14CrossRefGoogle Scholar
  28. 28.
    Philip Chen CL, Zhang C-Y (2014) Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf Sci 275:314–347. doi: 10.1016/j.ins.2014.01.015 CrossRefGoogle Scholar
  29. 29.
    Rasooli A, Down DG (2014) COSHH: a classification and optimization based scheduler for heterogeneous Hadoop systems. Futur Gener Comput Syst 36:1–15CrossRefGoogle Scholar
  30. 30.
    Sakr S, Liu A, Fayoumi AG (2013) The family of MapReduce and large-scale data processing systems. ACM Comput Surv (CSUR) 46(1):11CrossRefGoogle Scholar
  31. 31.
    Tiwari N, Sarkar S, Bellur U, Indrawan M (2015) Classification framework of MapReduce scheduling algorithms. ACM Comput Surv (CSUR) 47(3):49CrossRefGoogle Scholar
  32. 32.
    Valvag SV, Johansen D (2008) Oivos: simple and efficient distributed data processing. Paper presented at the high performance computing and communications, 2008. HPCC'08. 10th IEEE international conference onGoogle Scholar
  33. 33.
    Wang Y, Shi W (2014) Budget-driven scheduling algorithms for batches of MapReduce jobs in heterogeneous clouds. Cloud Comput, IEEE Trans 2(3):306–319CrossRefGoogle Scholar
  34. 34.
    Yoo D, Sim KM (2011) A comparative review of job scheduling for MapReduce. Paper presented at the cloud computing and intelligence systems (CCIS), 2011 I.E. international conference onGoogle Scholar
  35. 35.
    Zaharia M, Konwinski A, Joseph AD, Katz RH, Stoica I (2008) Improving MapReduce performance in heterogeneous environments. Paper presented at the OSDIGoogle Scholar
  36. 36.
    Zhang X, Zhong Z, Feng S, Tu B, Fan J (2011) Improving data locality of MapReduce by scheduling in homogeneous computing environments. Paper presented at the parallel and distributed processing with applications (ISPA), 2011 I.E. 9th international symposium onGoogle Scholar
  37. 37.
    Zhang W, Rajasekaran S, Wood T, Zhu M (2014) Mimp: Deadline and interference aware scheduling of hadoop virtual machines. Paper presented at the Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium onGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  • Ibrahim Abaker Targio Hashem
    • 1
  • Nor Badrul Anuar
    • 1
  • Mohsen Marjani
    • 1
  • Abdullah Gani
    • 1
  • Arun Kumar Sangaiah
    • 2
  • Adewole Kayode Sakariyah
    • 1
  1. 1.Faculty of Computer Science and Information TechnologyUniversity of MalayaKuala LumpurMalaysia
  2. 2.School of Computing Science and EngineeringVIT UniversityVelloreIndia

Personalised recommendations