Abstract
Hadoop has emerged as a popular choice for processing Big data. Its cluster is used to process large scale jobs. The performance of a cluster is largely dependent upon the different kind of scheduling policies employed for job processing. However, a single type of scheduling policy may not be suitable for different kind of jobs. Inefficient performance of a cluster is an apparent outcome of inappropriate scheduling policies. These policies are either too complex or they are too elementary to understand the diverse jobs and their needs. Most of them follow a fixed pattern, which cannot be considered as a common solution for different jobs. The effect of such a non-fitting mechanism is lower resource utilization and poor cluster performance. In this paper, a pluggable scheduling mechanism is proposed for efficient and adaptive processing of the jobs. It utilizes the Matching Market concept for the allocation and further adaptively accommodates the diverse needs of the multiple jobs by understanding the varying requirements of the tasks. The experimental results reveal an enhanced resource utilization and improved cluster performance with an overall reduction in makespan. In certain instances, we have seen resource utilization improved up to 80% and performance improvement up to 60% with the proposed technique. Cluster efficiency is increased up of 31%. The evaluation and comparisons were conducted on various scheduling policies using different benchmarks of Hadoop with the same data and identical configurations. The proposed system has shown significant improvement in cluster efficiency.
Similar content being viewed by others
References
Akbarpour M, Li S, Gharan SO (2014) Dynamic matching market design
Apache. Hadoop yarn. http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html, Accessed on: 16-05-2020
Apache H. Capacity scheduler. https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html, Accessed on: 15-05-2020
Apache H. Fair scheduler. https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html, Accessed on: 16-05-2020
Apache H. Fifo scheduler. https://hadoop.apache.org/docs/r2.8.2/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/apidocs/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.html, Accessed on: 17-05-2020
Baranowski Z, Kleszcz E, Kothuri P, Canali L, Castellotti R, Marquez MM, de Barros NGM, Motesnitsalis E, Mrowczynski P, Duran JCL (2019) Evolution of the hadoop platform and ecosystem for high energy physics. In EPJ Web of Conferences 214:04058. EDP Sciences
Bloch F, Houy N (2012) Optimal assignment of durable objects to successive agents. Economic Theory 51(1):13–33
Bu X, Rao J, Xu C-Z (2013) Interference and locality-aware task scheduling for mapreduce applications in virtual clusters. In Proceedings of the 22nd international symposium on High-performance parallel and distributed computing 227–238
Callan J, Hoy M, Yoo C, Zhao L (2009) Clueweb09 data set
Chen CP, Zhang C-Y (2014) Data-intensive applications, challenges, techniques and technologies: A survey on big data. Inf Sci 275:314–347
Chen J, Wang D, Zhao W (2013) A task scheduling algorithm for hadoop platform. Journal of Computers 8(4):929–936
Cheng D, Rao J, Guo Y, Jiang C, Zhou X (2016) Improving performance of heterogeneous mapreduce clusters with adaptive task tuning. IEEE Trans Parallel Distrib Syst 28(3):774–786
Chugh A, Sharma VK, Jain C (2020) Big data and query optimization techniques. In Advances in Computing and Intelligent Systems 337–345. Springer
Curino C, Difallah DE, Douglas C, Krishnan S, Ramakrishnan R, Rao S (2014) Reservation-based scheduling: If you’re late don’t blame us! In Proceedings of the ACM Symposium on Cloud Computing 1–14
Delimitrou C, Kozyrakis C (2014) Quasar: resource-efficient and qos-aware cluster management. ACM SIGPLAN Notices 49(4):127–144
Dickerson JP, Procaccia AD, Sandholm T (2012) Dynamic matching via weighted myopia with application to kidney exchange. In Twenty-Sixth AAAI Conference on Artificial Intelligence
Easley D, Kleinberg J et al (2010) Networks, crowds, and markets, volume 8. Cambridge university press Cambridge
Ghodsi A, Zaharia M, Hindman B, Konwinski A, Shenker S, Stoica I (2011) Dominant resource fairness: Fair allocation of multiple resource types. In Nsdi 11:24
Glushkova D, Jovanovic P, Abelló A (2019) Mapreduce performance model for hadoop 2. x. Inf Syst 79:32–43
Grandl R, Ananthanarayanan G, Kandula S, Rao S, Akella A (2014) Multi-resource packing for cluster schedulers. ACM SIGCOMM Computer Communication Review 44(4):455–466
Gummaraju J, Mcdougall R, Nelson M, Griffith R, Magdon-Ismail T, Cheveresan R, Du J (2019) Container virtual machines for hadoop. US Patent 10:193-963
Gupta S, Fritz C, Price B, Hoover R, Dekleer J, Witteveen C (2013) Throughputscheduler: Learning to schedule on heterogeneous hadoop clusters. In Proceedings of the 10th International Conference on Autonomic Computing (ICAC 13) 159–165
Hall B, Jaffe A, Trajtenberg M (2001) The nber patent citations data file: Lessons, insights and methodological tools (nber working paper no. 8498
Hindman B, Konwinski A, Zaharia M, Ghodsi A, Joseph AD, Katz RH, Shenker S, Stoica I (2011) Mesos: A platform for fine-grained resource sharing in the data center. In NSDI 11:22
Hsu J-B, Lin C-F, Chang Y-C, Pan R-H (2020) Using independent resource allocation strategies to solve conflicts of hadoop distributed architecture in virtualization. Clust Comput 1–21
Isard M, Prabhakaran V, Currey J, Wieder U, Talwar K, Goldberg A (2009) Quincy: fair scheduling for distributed computing clusters. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles 261–276
Islam MT, Srirama SN, Karunasekera S, Buyya R (2020) Cost-efficient dynamic scheduling of big data applications in apache spark on cloud. J Syst Softw 162:110515
Javanmardi AK, Yaghoubyan SH, BagheriFard K, Nejatian S, Parvin H (2020) An architecture for scheduling with the capability of minimum share to heterogeneous hadoop systems. J Supercomput 1–30
Kc K, Anyanwu K (2010) Scheduling hadoop jobs to meet deadlines. In 2010 IEEE Second International Conference on Cloud Computing Technology and Science 388–392. IEEE
Khelifa A, Hamrouni T, Mokadem R, Charrada FB (2020) Sla-aware task scheduling and data replication for enhancing provider profit in clouds. Prog Comput Sci 176:3143–3152
Lama P, Zhou X (2012) Aroma: Automated resource allocation and configuration of mapreduce environment in the cloud. In Proceedings of the 9th international conference on Autonomic computing 63–72
Lu H-C, Hwang F, Huang Y-H (2020) Parallel and distributed architecture of genetic algorithm on apache hadoop and spark. Appl Soft Comput 95:106497
Naik NS, Negi A, Bapu BRT, Anitha R (2019) A data locality based scheduler to enhance mapreduce performance in heterogeneous environments. Future Gener Comput Syst 90:423–434
Nithyanantham S, Singaravel G (2020) Resource and cost aware glowworm mapreduce optimization based big data processing in geo distributed data center. Wirel Pers Commun 1–22
Niu Z, Tang S, He B (2015) Gemini: An adaptive performance-fairness scheduler for data-intensive cluster computing. In 2015 IEEE 7th International Conference on Cloud Computing Technology and Science (CloudCom) 66–73. IEEE
Niu Z, Tang S, He B (2016) An adaptive efficiency-fairness meta-scheduler for data-intensive computing. IEEE Trans Serv Comput
Polo J, Castillo C, Carrera D, Becerra Y, Whalley I, Steinder M, Torres J, Ayguadé E (2011) Resource-aware adaptive scheduling for mapreduce clusters. In ACM/IFIP/USENIX International Conference on Distributed Systems Platforms and Open Distributed Processing 187–207. Springer
Rasooli A, Down DG (2012) A hybrid scheduling approach for scalable heterogeneous hadoop systems. In 2012 SC Companion: High Performance Computing, Networking Storage and Analysis 1284–1291. IEEE
Sharma V, Bala M (2020) An improved task allocation strategy in cloud using modified k-means clustering technique. Egyptian Informatics Journal
Shenker AGMZS, Stoica I (2013) Choosy: Max-min fair sharing for datacenter jobs with constraints
Tang Z, Zhou J, Li K, Li R (2012) Mtsd: A task scheduling algorithm for mapreduce base on deadline constraints. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum. IEEE
Thaman J, Singh M (2016) Current perspective in task scheduling techniques in cloud computing: A review. International Journal in Foundations of Computer Science & Technology 6(1):65–85
Usama M, Liu M, Chen M (2017) Job schedulers for big data processing in hadoop environment: testing real-life schedulers using benchmark programs. Digital Communications and Networks 3(4):260–273
Verma A, Cherkasova L, Campbell RH (2012) Two sides of a coin: Optimizing the schedule of mapreduce jobs to minimize their makespan and improve cluster performance. In 2012 IEEE 20th international symposium on modeling, analysis and simulation of computer and telecommunication systems 11–18. IEEE
Wang J, Yao Y, Mao Y, Sheng B, Mi N (2014) Fresh: Fair and efficient slot configuration and scheduling for hadoop clusters. In 2014 IEEE 7th International Conference on Cloud Computing 761–768. IEEE
Wang L, Tao J, Ranjan R, Marten H, Streit A, Chen J, Chen D (2013) G-hadoop: Mapreduce across distributed data centers for data-intensive computing. Futur Gener Comput Syst 29(3):739–750
Wang W, Feng C, Li B, Liang B (2014) On the fairness-efficiency tradeoff for packet processing with multiple resources. In Proceedings of the 10th ACM International on Conference on emerging Networking Experiments and Technologies, pages 235–248
Wiktorski T (2019) Hadoop architecture. In Data-intensive Systems 51–61. Springer
Wøhlk S, Laporte G (2017) Computational comparison of several greedy algorithms for the minimum cost perfect matching problem on large graphs. Comput Oper Res 87:107–113
Yahoo. Dataset. https://webscope.sandbox.yahoo.com/, Accessed on: 16-05-2020
Yao Y, Wang J, Sheng B, Lin J, Mi N (2014) Haste: Hadoop yarn scheduling based on task-dependency and resource-demand. In 2014 IEEE 7th International Conference on Cloud Computing 184–191. IEEE
Yao Y, Wang J, Sheng B, Mi N (2013) Using a tunable knob for reducing makespan of mapreduce jobs in a hadoop cluster. In 2013 IEEE Sixth International Conference on Cloud Computing 1–8. IEEE
Zacheilas N, Kalogeraki V (2017) A pareto-based scheduler for exploring cost-performance trade-offs for mapreduce workloads. EURASIP J Embed Syst 2017(1):29
Zaharia M, Borthakur D, Sen Sarma J, Elmeleegy K, Shenker S, Stoica I (2010) Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In Proceedings of the 5th European conference on Computer systems 265–278
Acknowledgements
Authors are thankful to the Yahoo! for providing access to the computing data of cluster.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Singh, B., Verma, H. . EMM: Extended matching market based scheduling for big data platform hadoop. Multimed Tools Appl 81, 34823–34847 (2022). https://doi.org/10.1007/s11042-021-11283-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-021-11283-3