Abstract
The popularity of MapReduce systems for processing large data sets in both industry and science has increased drastically over the last years. While sample applications often found in literature, for example, word count, are rather simple, e-science applications tend to be complex, thereby posing new challenges to MapReduce systems. The high runtime complexity of e-science applications on the one hand, and skewed data distributions often encountered in scientific data sets on the other hand, lead to highly varying reducer execution times. These, in turn, cause high overall execution times and poor resource utilisation.
In this paper, we tackle the challenge of balancing the workload on the reducers, considering both complex reduce tasks and skewed data. We define the partition cost model which takes into account non-linear reducer tasks, and provide an algorithm for efficient cost estimation in a distributed environment. Finally, we present two load balancing approaches, fine partitioning and dynamic fragmentation, based on our partition cost model. Both these approaches can be seamlessly integrated into existing MapReduce systems like Hadoop. We evaluate our solutions using both synthetic, and real e-science data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. In: VLDB (2009)
Afrati, F. N., Ullman, J. D.: Optimizing Joins in a Map-Reduce Environment. In: EDBT (2010)
Battré, D., Ewen, S., Hueske, F., Kao, O., Markl, V., Warneke, D.: Nephele/PACTs: A Programming Model and Execution Framework for Web-Scale Analytical Processing. In: SoCC (2010)
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. CACM 51(1) (2008)
DeWitt, D., Naughton, J. F., Schneider, D. A., Seshadri, S.: Practical Skew Handling in Parallel Joins. In: VLDB (1992)
Dittrich, J., Quiané-Ruiz, J. A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop + + : Making a Yellow Elephant Run Like a Cheetah. In: VLDB (2010)
Gates, A. F., Natkovich, O., Chopra, S., Kamath, P., Narayanamurthy, S. M., Olston, C., Reed, B., Srinivasan, S., Srivastava, U.: Building a High-Level Dataflow System on top of Map-Reduce: The Pig Experience. In: VLDB (2009)
Johnson, D. S.: Approximation Algorithms for Combinatorial Problems. In: STOC (1973)
Kwon, Y., Balazinska, M., Howe, B., Rolia, J.A.: Skew-resistant Parallel Processing of Feature-Extracting Scientific User-Defined Functions. In: SoCC (2010)
Pavlo, A., Paulson, E., Rasin, A., Abadi, D., DeWitt, D., Madden, S., Stonebraker, M.: A Comparison of Approaches to Large-Scale Data Analysis. In: SIGMOD (2009)
Springel, V., White, S., Jenkins, A., Frenk, C., Yoshida, N., Gao, L., Navarro, J., Thacker, R., Croton, D., Helly, J., Peacock, J., Cole, S., Thomas, P., Couchman, H., Evrard, A., Colberg, J., Pearce, F.: Simulating the Joint Evolution of Quasars, Galaxies and their Large-Scale Distribution. Nature 435 (2005)
Stamos, J. W., Young, H. C.: A Symmetric Fragment and Replicate Algorithm for Distributed Joins. IEEE TPDS 4(12) (1993)
Stonebraker, M., Abadi, D., DeWitt, D., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and Parallel DBMSs: Friends or Foes? CACM 53(1) (2010)
Whang, K. Y., Zanden, B. T. V., Taylor, H. M.: A Linear-Time Probabilistic Counting Algorithm for Database Applications. TODS 15(2) (1990)
Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R., Stoica, I.: Improving MapReduce Performance in Heterogeneous Environments. In: OSDI (2008)
Zeller, H., Gray, J.: An Adaptive Hash Join Algorithm for Multiuser Environments. In: VLDB (1990)
Acknowledgements
This work was funded by the German Federal Ministry of Education and Research (BMBF, contract 05A08VHA) in the context of the GAVO-III project and by the Autonomous Province of Bolzano – South Tyrol, Italy, Promotion of Educational Policies, University and Research Department.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer Science+Business Media New York
About this paper
Cite this paper
Gufler, B., Augsten, N., Reiser, A., Kemper, A. (2012). The Partition Cost Model for Load Balancing in MapReduce. In: Ivanov, I., van Sinderen, M., Shishkov, B. (eds) Cloud Computing and Services Science. CLOSER 2011. Service Science: Research and Innovations in the Service Economy. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-2326-3_20
Download citation
DOI: https://doi.org/10.1007/978-1-4614-2326-3_20
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-2325-6
Online ISBN: 978-1-4614-2326-3
eBook Packages: Computer ScienceComputer Science (R0)