The Partition Cost Model for Load Balancing in MapReduce

Gufler, Benjamin; Augsten, Nikolaus; Reiser, Angelika; Kemper, Alfons

doi:10.1007/978-1-4614-2326-3_20

Benjamin Gufler⁴,
Nikolaus Augsten⁵,
Angelika Reiser⁴ &
…
Alfons Kemper⁴

Part of the book series: Service Science: Research and Innovations in the Service Economy ((SSRI))

Included in the following conference series:

International Conference on Cloud Computing and Services Science

2066 Accesses
3 Citations
1 Altmetric

Abstract

The popularity of MapReduce systems for processing large data sets in both industry and science has increased drastically over the last years. While sample applications often found in literature, for example, word count, are rather simple, e-science applications tend to be complex, thereby posing new challenges to MapReduce systems. The high runtime complexity of e-science applications on the one hand, and skewed data distributions often encountered in scientific data sets on the other hand, lead to highly varying reducer execution times. These, in turn, cause high overall execution times and poor resource utilisation.

In this paper, we tackle the challenge of balancing the workload on the reducers, considering both complex reduce tasks and skewed data. We define the partition cost model which takes into account non-linear reducer tasks, and provide an algorithm for efficient cost estimation in a distributed environment. Finally, we present two load balancing approaches, fine partitioning and dynamic fragmentation, based on our partition cost model. Both these approaches can be seamlessly integrated into existing MapReduce systems like Hadoop. We evaluate our solutions using both synthetic, and real e-science data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. In: VLDB (2009)
Google Scholar
Afrati, F. N., Ullman, J. D.: Optimizing Joins in a Map-Reduce Environment. In: EDBT (2010)
Google Scholar
Battré, D., Ewen, S., Hueske, F., Kao, O., Markl, V., Warneke, D.: Nephele/PACTs: A Programming Model and Execution Framework for Web-Scale Analytical Processing. In: SoCC (2010)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. CACM 51(1) (2008)
Google Scholar
DeWitt, D., Naughton, J. F., Schneider, D. A., Seshadri, S.: Practical Skew Handling in Parallel Joins. In: VLDB (1992)
Google Scholar
Dittrich, J., Quiané-Ruiz, J. A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop + + : Making a Yellow Elephant Run Like a Cheetah. In: VLDB (2010)
Google Scholar
Gates, A. F., Natkovich, O., Chopra, S., Kamath, P., Narayanamurthy, S. M., Olston, C., Reed, B., Srinivasan, S., Srivastava, U.: Building a High-Level Dataflow System on top of Map-Reduce: The Pig Experience. In: VLDB (2009)
Google Scholar
Johnson, D. S.: Approximation Algorithms for Combinatorial Problems. In: STOC (1973)
Google Scholar
Kwon, Y., Balazinska, M., Howe, B., Rolia, J.A.: Skew-resistant Parallel Processing of Feature-Extracting Scientific User-Defined Functions. In: SoCC (2010)
Google Scholar
Pavlo, A., Paulson, E., Rasin, A., Abadi, D., DeWitt, D., Madden, S., Stonebraker, M.: A Comparison of Approaches to Large-Scale Data Analysis. In: SIGMOD (2009)
Google Scholar
Springel, V., White, S., Jenkins, A., Frenk, C., Yoshida, N., Gao, L., Navarro, J., Thacker, R., Croton, D., Helly, J., Peacock, J., Cole, S., Thomas, P., Couchman, H., Evrard, A., Colberg, J., Pearce, F.: Simulating the Joint Evolution of Quasars, Galaxies and their Large-Scale Distribution. Nature 435 (2005)
Google Scholar
Stamos, J. W., Young, H. C.: A Symmetric Fragment and Replicate Algorithm for Distributed Joins. IEEE TPDS 4(12) (1993)
Google Scholar
Stonebraker, M., Abadi, D., DeWitt, D., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and Parallel DBMSs: Friends or Foes? CACM 53(1) (2010)
Google Scholar
Whang, K. Y., Zanden, B. T. V., Taylor, H. M.: A Linear-Time Probabilistic Counting Algorithm for Database Applications. TODS 15(2) (1990)
Google Scholar
Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R., Stoica, I.: Improving MapReduce Performance in Heterogeneous Environments. In: OSDI (2008)
Google Scholar
Zeller, H., Gray, J.: An Adaptive Hash Join Algorithm for Multiuser Environments. In: VLDB (1990)
Google Scholar

Download references

Acknowledgements

This work was funded by the German Federal Ministry of Education and Research (BMBF, contract 05A08VHA) in the context of the GAVO-III project and by the Autonomous Province of Bolzano – South Tyrol, Italy, Promotion of Educational Policies, University and Research Department.

Author information

Authors and Affiliations

Fakultät für Informatik, Technische Universität München, Boltzmannstraße 3, D-85748, Garching bei München, Germany
Benjamin Gufler, Angelika Reiser & Alfons Kemper
Faculty of Computer Science, Free University of Bozen-Bolzano, Dominikanerplatz 3, I-39100, Bozen-Bolzano, Italy
Nikolaus Augsten

Authors

Benjamin Gufler
View author publications
You can also search for this author in PubMed Google Scholar
Nikolaus Augsten
View author publications
You can also search for this author in PubMed Google Scholar
Angelika Reiser
View author publications
You can also search for this author in PubMed Google Scholar
Alfons Kemper
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Benjamin Gufler .

Editor information

Editors and Affiliations

SUNY Empire State College, Hauppauge, 11788, New York, USA
Ivan Ivanov
, Centre for Telematics and Information Te, University of Twente, Enschede, 7500 AE, The Netherlands
Marten van Sinderen
Collaboration & Research on, Enterprise Systems &, Interdisciplinary Institute for, Sofia, Bulgaria
Boris Shishkov

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gufler, B., Augsten, N., Reiser, A., Kemper, A. (2012). The Partition Cost Model for Load Balancing in MapReduce. In: Ivanov, I., van Sinderen, M., Shishkov, B. (eds) Cloud Computing and Services Science. CLOSER 2011. Service Science: Research and Innovations in the Service Economy. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-2326-3_20

Download citation

DOI: https://doi.org/10.1007/978-1-4614-2326-3_20
Published: 17 March 2012
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-2325-6
Online ISBN: 978-1-4614-2326-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics