Skip to main content

Load Balancing Approach for a MapReduce Job Running on a Heterogeneous Hadoop Cluster

  • Conference paper
  • First Online:
Distributed Computing and Internet Technology (ICDCIT 2021)

Abstract

Hadoop MapReduce has become the de-facto standard in today’s Big data world to process the more prominent data sets on a distributed cluster of commodity hardware. Today computing nodes in a commodity cluster do not have the same hardware configuration, which leads to heterogeneity. Heterogeneity has become common in the industry, research institutes, and academics. Our study shows that the current rules for calculating the required number of Reduce tasks (Reducers) for a MapReduce job are fallacious, leading to significant computing resources’ overutilization. It also degrades MapReduce job performance running on a heterogeneous Hadoop cluster. However, there is no definite answer to the question: What is the optimal number of Reduce tasks required for a MapReduce job to get Hadoop’s most accomplished performance in a heterogeneous cluster? We have proposed a new rule that decides the required number of reduce tasks for a MapReduce job running on a heterogeneous Hadoop cluster accurately. The proposed rule balances the load among the heterogeneous nodes in the Reduce phase of MapReduce. It also minimizes computing resources’ overutilization and improves the MapReduce job execution time by an average of 18% and 28% for TeraSort and PageRank applications running on a heterogeneous Hadoop cluster.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Ahmad, F., Chakradhar, S.T., Raghunathan, A., Vijaykumar, T.: Tarazu: optimizing mapreduce on heterogeneous clusters. In: ACM SIGARCH Computer Architecture News, vol. 40, pp. 61–74. ACM (2012)

    Google Scholar 

  2. Anjos, J.C., Carrera, I., Kolberg, W., Tibola, A.L., Arantes, L.B., Geyer, C.R.: Mra++: scheduling and data placement on mapreduce for heterogeneous environments. Future Gen. Comput. Syst. 42, 22–35 (2015)

    Article  Google Scholar 

  3. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  4. Gandhi, R., Xie, D., Hu, Y.C.: \(\{\)PIKACHU\(\}\): how to rebalance load in optimizing mapreduce on heterogeneous clusters. In: 2013 \(\{\)USENIX\(\}\) Annual Technical Conference (\(\{\)USENIX\(\}\)\(\{\)ATC\(\}\) 13), pp. 61–66 (2013)

    Google Scholar 

  5. Ghemawat, S., Gobioff, H., Leung, S.T.: The google file system. In: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, pp. 29–43 (2003)

    Google Scholar 

  6. Hou, X., Thomas, J.P., Varadharajan, V.: Dynamic workload balancing for Hadoop mapreduce. In: Proceedings of the 2014 IEEE Fourth International Conference on Big Data and Cloud Computing, pp. 56–62 (2014)

    Google Scholar 

  7. Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The hibench benchmark suite: characterization of the mapreduce-based data analysis. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010), pp. 41–51. IEEE (2010)

    Google Scholar 

  8. Kwon, Y., Balazinska, M., Howe, B., Rolia, J.: Skewtune: mitigating skew in mapreduce applications. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 25–36. ACM (2012)

    Google Scholar 

  9. Lee, C.W., Hsieh, K.Y., Hsieh, S.Y., Hsiao, H.C.: A dynamic data placement strategy for Hadoop in heterogeneous environments. Big Data Res. 1, 14–22 (2014)

    Article  Google Scholar 

  10. Liu, Z., Liu, Y., Wang, B., Gong, Z.: A novel run-time load balancing method for mapreduce. In: 2015 4th International Conference on Computer Science and Network Technology (ICCSNT), vol. 1, pp. 150–154. IEEE (2015)

    Google Scholar 

  11. Lu, W., Chen, L., Yuan, H., Wang, L., Xing, W., Yang, Y.: Improving mapreduce performance by using a new partitioner in yarn. In: The 23rd International Conference on Distributed Multimedia Systems, Visual Languages and Sentient Systems, pp. 24–33 (2017)

    Google Scholar 

  12. Naik, N.S., Negi, A., BR, T.B., Anitha, R.: A data locality based scheduler to enhance mapreduce performance in heterogeneous environments. Future Gen. Comput. Syst. 90, 423–434 (2019)

    Article  Google Scholar 

  13. Nghiem, P.P., Figueira, S.M.: Towards efficient resource provisioning in mapreduce. J. Parallel Distrib. Comput. 95, 29–41 (2016)

    Article  Google Scholar 

  14. Paravastu, R., Scarlat, R., Chandrasekaran, B.: Adaptive load balancing in mapreduce using flubber. Duke University Project Report (2012)

    Google Scholar 

  15. Shvachko, K., Kuang, H., Radia, S., Chansler, R., et al.: The Hadoop distributed file system. MSST. 10, 1–10 (2010)

    Google Scholar 

  16. White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Inc., Massachusetts (2012)

    Google Scholar 

  17. Yan, W., Li, C., Du, S., Mao, X.: An optimization algorithm for heterogeneous Hadoop clusters based on dynamic load balancing. In: 2016 17th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT), pp. 250–255. IEEE (2016)

    Google Scholar 

  18. Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R.H., Stoica, I.: Improving mapreduce performance in heterogeneous environments. In: Osdi, vol. 8, p. 7 (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Kamalakant Laxman Bawankule , Rupesh Kumar Dewang or Anil Kumar Singh .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bawankule, K.L., Dewang, R.K., Singh, A.K. (2021). Load Balancing Approach for a MapReduce Job Running on a Heterogeneous Hadoop Cluster. In: Goswami, D., Hoang, T.A. (eds) Distributed Computing and Internet Technology. ICDCIT 2021. Lecture Notes in Computer Science(), vol 12582. Springer, Cham. https://doi.org/10.1007/978-3-030-65621-8_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-65621-8_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-65620-1

  • Online ISBN: 978-3-030-65621-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics