Skip to main content

The Partition Cost Model for Load Balancing in MapReduce

  • Conference paper
  • First Online:
Book cover Cloud Computing and Services Science (CLOSER 2011)

Abstract

The popularity of MapReduce systems for processing large data sets in both industry and science has increased drastically over the last years. While sample applications often found in literature, for example, word count, are rather simple, e-science applications tend to be complex, thereby posing new challenges to MapReduce systems. The high runtime complexity of e-science applications on the one hand, and skewed data distributions often encountered in scientific data sets on the other hand, lead to highly varying reducer execution times. These, in turn, cause high overall execution times and poor resource utilisation.

In this paper, we tackle the challenge of balancing the workload on the reducers, considering both complex reduce tasks and skewed data. We define the partition cost model which takes into account non-linear reducer tasks, and provide an algorithm for efficient cost estimation in a distributed environment. Finally, we present two load balancing approaches, fine partitioning and dynamic fragmentation, based on our partition cost model. Both these approaches can be seamlessly integrated into existing MapReduce systems like Hadoop. We evaluate our solutions using both synthetic, and real e-science data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://hadoop.apache.org.

  2. 2.

    http://www.g-vo.org/Millennium.

References

  1. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. In: VLDB (2009)

    Google Scholar 

  2. Afrati, F. N., Ullman, J. D.: Optimizing Joins in a Map-Reduce Environment. In: EDBT (2010)

    Google Scholar 

  3. Battré, D., Ewen, S., Hueske, F., Kao, O., Markl, V., Warneke, D.: Nephele/PACTs: A Programming Model and Execution Framework for Web-Scale Analytical Processing. In: SoCC (2010)

    Google Scholar 

  4. Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. CACM 51(1) (2008)

    Google Scholar 

  5. DeWitt, D., Naughton, J. F., Schneider, D. A., Seshadri, S.: Practical Skew Handling in Parallel Joins. In: VLDB (1992)

    Google Scholar 

  6. Dittrich, J., Quiané-Ruiz, J. A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop +  + : Making a Yellow Elephant Run Like a Cheetah. In: VLDB (2010)

    Google Scholar 

  7. Gates, A. F., Natkovich, O., Chopra, S., Kamath, P., Narayanamurthy, S. M., Olston, C., Reed, B., Srinivasan, S., Srivastava, U.: Building a High-Level Dataflow System on top of Map-Reduce: The Pig Experience. In: VLDB (2009)

    Google Scholar 

  8. Johnson, D. S.: Approximation Algorithms for Combinatorial Problems. In: STOC (1973)

    Google Scholar 

  9. Kwon, Y., Balazinska, M., Howe, B., Rolia, J.A.: Skew-resistant Parallel Processing of Feature-Extracting Scientific User-Defined Functions. In: SoCC (2010)

    Google Scholar 

  10. Pavlo, A., Paulson, E., Rasin, A., Abadi, D., DeWitt, D., Madden, S., Stonebraker, M.: A Comparison of Approaches to Large-Scale Data Analysis. In: SIGMOD (2009)

    Google Scholar 

  11. Springel, V., White, S., Jenkins, A., Frenk, C., Yoshida, N., Gao, L., Navarro, J., Thacker, R., Croton, D., Helly, J., Peacock, J., Cole, S., Thomas, P., Couchman, H., Evrard, A., Colberg, J., Pearce, F.: Simulating the Joint Evolution of Quasars, Galaxies and their Large-Scale Distribution. Nature 435 (2005)

    Google Scholar 

  12. Stamos, J. W., Young, H. C.: A Symmetric Fragment and Replicate Algorithm for Distributed Joins. IEEE TPDS 4(12) (1993)

    Google Scholar 

  13. Stonebraker, M., Abadi, D., DeWitt, D., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and Parallel DBMSs: Friends or Foes? CACM 53(1) (2010)

    Google Scholar 

  14. Whang, K. Y., Zanden, B. T. V., Taylor, H. M.: A Linear-Time Probabilistic Counting Algorithm for Database Applications. TODS 15(2) (1990)

    Google Scholar 

  15. Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R., Stoica, I.: Improving MapReduce Performance in Heterogeneous Environments. In: OSDI (2008)

    Google Scholar 

  16. Zeller, H., Gray, J.: An Adaptive Hash Join Algorithm for Multiuser Environments. In: VLDB (1990)

    Google Scholar 

Download references

Acknowledgements

This work was funded by the German Federal Ministry of Education and Research (BMBF, contract 05A08VHA) in the context of the GAVO-III project and by the Autonomous Province of Bolzano – South Tyrol, Italy, Promotion of Educational Policies, University and Research Department.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Benjamin Gufler .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer Science+Business Media New York

About this paper

Cite this paper

Gufler, B., Augsten, N., Reiser, A., Kemper, A. (2012). The Partition Cost Model for Load Balancing in MapReduce. In: Ivanov, I., van Sinderen, M., Shishkov, B. (eds) Cloud Computing and Services Science. CLOSER 2011. Service Science: Research and Innovations in the Service Economy. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-2326-3_20

Download citation

  • DOI: https://doi.org/10.1007/978-1-4614-2326-3_20

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4614-2325-6

  • Online ISBN: 978-1-4614-2326-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics