Abstract
Hadoop MapReduce has become the de-facto standard in today’s Big data world to process the more prominent data sets on a distributed cluster of commodity hardware. Today computing nodes in a commodity cluster do not have the same hardware configuration, which leads to heterogeneity. Heterogeneity has become common in the industry, research institutes, and academics. Our study shows that the current rules for calculating the required number of Reduce tasks (Reducers) for a MapReduce job are fallacious, leading to significant computing resources’ overutilization. It also degrades MapReduce job performance running on a heterogeneous Hadoop cluster. However, there is no definite answer to the question: What is the optimal number of Reduce tasks required for a MapReduce job to get Hadoop’s most accomplished performance in a heterogeneous cluster? We have proposed a new rule that decides the required number of reduce tasks for a MapReduce job running on a heterogeneous Hadoop cluster accurately. The proposed rule balances the load among the heterogeneous nodes in the Reduce phase of MapReduce. It also minimizes computing resources’ overutilization and improves the MapReduce job execution time by an average of 18% and 28% for TeraSort and PageRank applications running on a heterogeneous Hadoop cluster.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ahmad, F., Chakradhar, S.T., Raghunathan, A., Vijaykumar, T.: Tarazu: optimizing mapreduce on heterogeneous clusters. In: ACM SIGARCH Computer Architecture News, vol. 40, pp. 61–74. ACM (2012)
Anjos, J.C., Carrera, I., Kolberg, W., Tibola, A.L., Arantes, L.B., Geyer, C.R.: Mra++: scheduling and data placement on mapreduce for heterogeneous environments. Future Gen. Comput. Syst. 42, 22–35 (2015)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Gandhi, R., Xie, D., Hu, Y.C.: \(\{\)PIKACHU\(\}\): how to rebalance load in optimizing mapreduce on heterogeneous clusters. In: 2013 \(\{\)USENIX\(\}\) Annual Technical Conference (\(\{\)USENIX\(\}\)\(\{\)ATC\(\}\) 13), pp. 61–66 (2013)
Ghemawat, S., Gobioff, H., Leung, S.T.: The google file system. In: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, pp. 29–43 (2003)
Hou, X., Thomas, J.P., Varadharajan, V.: Dynamic workload balancing for Hadoop mapreduce. In: Proceedings of the 2014 IEEE Fourth International Conference on Big Data and Cloud Computing, pp. 56–62 (2014)
Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The hibench benchmark suite: characterization of the mapreduce-based data analysis. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010), pp. 41–51. IEEE (2010)
Kwon, Y., Balazinska, M., Howe, B., Rolia, J.: Skewtune: mitigating skew in mapreduce applications. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 25–36. ACM (2012)
Lee, C.W., Hsieh, K.Y., Hsieh, S.Y., Hsiao, H.C.: A dynamic data placement strategy for Hadoop in heterogeneous environments. Big Data Res. 1, 14–22 (2014)
Liu, Z., Liu, Y., Wang, B., Gong, Z.: A novel run-time load balancing method for mapreduce. In: 2015 4th International Conference on Computer Science and Network Technology (ICCSNT), vol. 1, pp. 150–154. IEEE (2015)
Lu, W., Chen, L., Yuan, H., Wang, L., Xing, W., Yang, Y.: Improving mapreduce performance by using a new partitioner in yarn. In: The 23rd International Conference on Distributed Multimedia Systems, Visual Languages and Sentient Systems, pp. 24–33 (2017)
Naik, N.S., Negi, A., BR, T.B., Anitha, R.: A data locality based scheduler to enhance mapreduce performance in heterogeneous environments. Future Gen. Comput. Syst. 90, 423–434 (2019)
Nghiem, P.P., Figueira, S.M.: Towards efficient resource provisioning in mapreduce. J. Parallel Distrib. Comput. 95, 29–41 (2016)
Paravastu, R., Scarlat, R., Chandrasekaran, B.: Adaptive load balancing in mapreduce using flubber. Duke University Project Report (2012)
Shvachko, K., Kuang, H., Radia, S., Chansler, R., et al.: The Hadoop distributed file system. MSST. 10, 1–10 (2010)
White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Inc., Massachusetts (2012)
Yan, W., Li, C., Du, S., Mao, X.: An optimization algorithm for heterogeneous Hadoop clusters based on dynamic load balancing. In: 2016 17th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT), pp. 250–255. IEEE (2016)
Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R.H., Stoica, I.: Improving mapreduce performance in heterogeneous environments. In: Osdi, vol. 8, p. 7 (2008)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Bawankule, K.L., Dewang, R.K., Singh, A.K. (2021). Load Balancing Approach for a MapReduce Job Running on a Heterogeneous Hadoop Cluster. In: Goswami, D., Hoang, T.A. (eds) Distributed Computing and Internet Technology. ICDCIT 2021. Lecture Notes in Computer Science(), vol 12582. Springer, Cham. https://doi.org/10.1007/978-3-030-65621-8_19
Download citation
DOI: https://doi.org/10.1007/978-3-030-65621-8_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-65620-1
Online ISBN: 978-3-030-65621-8
eBook Packages: Computer ScienceComputer Science (R0)