Load Balancing Approach for a MapReduce Job Running on a Heterogeneous Hadoop Cluster

Bawankule, Kamalakant Laxman; Dewang, Rupesh Kumar; Singh, Anil Kumar

doi:10.1007/978-3-030-65621-8_19

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12582))

Included in the following conference series:

International Conference on Distributed Computing and Internet Technology

729 Accesses
5 Citations

Abstract

Hadoop MapReduce has become the de-facto standard in today’s Big data world to process the more prominent data sets on a distributed cluster of commodity hardware. Today computing nodes in a commodity cluster do not have the same hardware configuration, which leads to heterogeneity. Heterogeneity has become common in the industry, research institutes, and academics. Our study shows that the current rules for calculating the required number of Reduce tasks (Reducers) for a MapReduce job are fallacious, leading to significant computing resources’ overutilization. It also degrades MapReduce job performance running on a heterogeneous Hadoop cluster. However, there is no definite answer to the question: What is the optimal number of Reduce tasks required for a MapReduce job to get Hadoop’s most accomplished performance in a heterogeneous cluster? We have proposed a new rule that decides the required number of reduce tasks for a MapReduce job running on a heterogeneous Hadoop cluster accurately. The proposed rule balances the load among the heterogeneous nodes in the Reduce phase of MapReduce. It also minimizes computing resources’ overutilization and improves the MapReduce job execution time by an average of 18% and 28% for TeraSort and PageRank applications running on a heterogeneous Hadoop cluster.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ahmad, F., Chakradhar, S.T., Raghunathan, A., Vijaykumar, T.: Tarazu: optimizing mapreduce on heterogeneous clusters. In: ACM SIGARCH Computer Architecture News, vol. 40, pp. 61–74. ACM (2012)
Google Scholar
Anjos, J.C., Carrera, I., Kolberg, W., Tibola, A.L., Arantes, L.B., Geyer, C.R.: Mra++: scheduling and data placement on mapreduce for heterogeneous environments. Future Gen. Comput. Syst. 42, 22–35 (2015)
Article Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Gandhi, R., Xie, D., Hu, Y.C.: \(\{\)PIKACHU\(\}\): how to rebalance load in optimizing mapreduce on heterogeneous clusters. In: 2013 \(\{\)USENIX\(\}\) Annual Technical Conference (\(\{\)USENIX\(\}\)\(\{\)ATC\(\}\) 13), pp. 61–66 (2013)
Google Scholar
Ghemawat, S., Gobioff, H., Leung, S.T.: The google file system. In: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, pp. 29–43 (2003)
Google Scholar
Hou, X., Thomas, J.P., Varadharajan, V.: Dynamic workload balancing for Hadoop mapreduce. In: Proceedings of the 2014 IEEE Fourth International Conference on Big Data and Cloud Computing, pp. 56–62 (2014)
Google Scholar
Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The hibench benchmark suite: characterization of the mapreduce-based data analysis. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010), pp. 41–51. IEEE (2010)
Google Scholar
Kwon, Y., Balazinska, M., Howe, B., Rolia, J.: Skewtune: mitigating skew in mapreduce applications. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 25–36. ACM (2012)
Google Scholar
Lee, C.W., Hsieh, K.Y., Hsieh, S.Y., Hsiao, H.C.: A dynamic data placement strategy for Hadoop in heterogeneous environments. Big Data Res. 1, 14–22 (2014)
Article Google Scholar
Liu, Z., Liu, Y., Wang, B., Gong, Z.: A novel run-time load balancing method for mapreduce. In: 2015 4th International Conference on Computer Science and Network Technology (ICCSNT), vol. 1, pp. 150–154. IEEE (2015)
Google Scholar
Lu, W., Chen, L., Yuan, H., Wang, L., Xing, W., Yang, Y.: Improving mapreduce performance by using a new partitioner in yarn. In: The 23rd International Conference on Distributed Multimedia Systems, Visual Languages and Sentient Systems, pp. 24–33 (2017)
Google Scholar
Naik, N.S., Negi, A., BR, T.B., Anitha, R.: A data locality based scheduler to enhance mapreduce performance in heterogeneous environments. Future Gen. Comput. Syst. 90, 423–434 (2019)
Article Google Scholar
Nghiem, P.P., Figueira, S.M.: Towards efficient resource provisioning in mapreduce. J. Parallel Distrib. Comput. 95, 29–41 (2016)
Article Google Scholar
Paravastu, R., Scarlat, R., Chandrasekaran, B.: Adaptive load balancing in mapreduce using flubber. Duke University Project Report (2012)
Google Scholar
Shvachko, K., Kuang, H., Radia, S., Chansler, R., et al.: The Hadoop distributed file system. MSST. 10, 1–10 (2010)
Google Scholar
White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Inc., Massachusetts (2012)
Google Scholar
Yan, W., Li, C., Du, S., Mao, X.: An optimization algorithm for heterogeneous Hadoop clusters based on dynamic load balancing. In: 2016 17th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT), pp. 250–255. IEEE (2016)
Google Scholar
Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R.H., Stoica, I.: Improving mapreduce performance in heterogeneous environments. In: Osdi, vol. 8, p. 7 (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Motilal Nehru National Institute of Technology Allahabad, Prayagraj, India
Kamalakant Laxman Bawankule, Rupesh Kumar Dewang & Anil Kumar Singh

Authors

Kamalakant Laxman Bawankule
View author publications
You can also search for this author in PubMed Google Scholar
Rupesh Kumar Dewang
View author publications
You can also search for this author in PubMed Google Scholar
Anil Kumar Singh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Kamalakant Laxman Bawankule , Rupesh Kumar Dewang or Anil Kumar Singh .

Editor information

Editors and Affiliations

Indian Institute of Technology Guwahati, Guwahati, India
Diganta Goswami
University of Engineering and Technology, Hanoi, Vietnam
Truong Anh Hoang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bawankule, K.L., Dewang, R.K., Singh, A.K. (2021). Load Balancing Approach for a MapReduce Job Running on a Heterogeneous Hadoop Cluster. In: Goswami, D., Hoang, T.A. (eds) Distributed Computing and Internet Technology. ICDCIT 2021. Lecture Notes in Computer Science(), vol 12582. Springer, Cham. https://doi.org/10.1007/978-3-030-65621-8_19

Download citation

DOI: https://doi.org/10.1007/978-3-030-65621-8_19
Published: 12 December 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-65620-1
Online ISBN: 978-3-030-65621-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics