The Partition Cost Model for Load Balancing in MapReduce

  • Benjamin GuflerEmail author
  • Nikolaus Augsten
  • Angelika Reiser
  • Alfons Kemper
Part of the Service Science: Research and Innovations in the Service Economy book series (SSRI)


The popularity of MapReduce systems for processing large data sets in both industry and science has increased drastically over the last years. While sample applications often found in literature, for example, word count, are rather simple, e-science applications tend to be complex, thereby posing new challenges to MapReduce systems. The high runtime complexity of e-science applications on the one hand, and skewed data distributions often encountered in scientific data sets on the other hand, lead to highly varying reducer execution times. These, in turn, cause high overall execution times and poor resource utilisation.

In this paper, we tackle the challenge of balancing the workload on the reducers, considering both complex reduce tasks and skewed data. We define the partition cost model which takes into account non-linear reducer tasks, and provide an algorithm for efficient cost estimation in a distributed environment. Finally, we present two load balancing approaches, fine partitioning and dynamic fragmentation, based on our partition cost model. Both these approaches can be seamlessly integrated into existing MapReduce systems like Hadoop. We evaluate our solutions using both synthetic, and real e-science data.


Execution Time Load Balance Skewed Data Dynamic Fragmentation Fine Partitioning 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This work was funded by the German Federal Ministry of Education and Research (BMBF, contract 05A08VHA) in the context of the GAVO-III project and by the Autonomous Province of Bolzano – South Tyrol, Italy, Promotion of Educational Policies, University and Research Department.


  1. 1.
    Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. In: VLDB (2009)Google Scholar
  2. 2.
    Afrati, F. N., Ullman, J. D.: Optimizing Joins in a Map-Reduce Environment. In: EDBT (2010)Google Scholar
  3. 3.
    Battré, D., Ewen, S., Hueske, F., Kao, O., Markl, V., Warneke, D.: Nephele/PACTs: A Programming Model and Execution Framework for Web-Scale Analytical Processing. In: SoCC (2010)Google Scholar
  4. 4.
    Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. CACM 51(1) (2008)Google Scholar
  5. 5.
    DeWitt, D., Naughton, J. F., Schneider, D. A., Seshadri, S.: Practical Skew Handling in Parallel Joins. In: VLDB (1992)Google Scholar
  6. 6.
    Dittrich, J., Quiané-Ruiz, J. A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop +  + : Making a Yellow Elephant Run Like a Cheetah. In: VLDB (2010)Google Scholar
  7. 7.
    Gates, A. F., Natkovich, O., Chopra, S., Kamath, P., Narayanamurthy, S. M., Olston, C., Reed, B., Srinivasan, S., Srivastava, U.: Building a High-Level Dataflow System on top of Map-Reduce: The Pig Experience. In: VLDB (2009)Google Scholar
  8. 8.
    Johnson, D. S.: Approximation Algorithms for Combinatorial Problems. In: STOC (1973)Google Scholar
  9. 9.
    Kwon, Y., Balazinska, M., Howe, B., Rolia, J.A.: Skew-resistant Parallel Processing of Feature-Extracting Scientific User-Defined Functions. In: SoCC (2010)Google Scholar
  10. 10.
    Pavlo, A., Paulson, E., Rasin, A., Abadi, D., DeWitt, D., Madden, S., Stonebraker, M.: A Comparison of Approaches to Large-Scale Data Analysis. In: SIGMOD (2009)Google Scholar
  11. 11.
    Springel, V., White, S., Jenkins, A., Frenk, C., Yoshida, N., Gao, L., Navarro, J., Thacker, R., Croton, D., Helly, J., Peacock, J., Cole, S., Thomas, P., Couchman, H., Evrard, A., Colberg, J., Pearce, F.: Simulating the Joint Evolution of Quasars, Galaxies and their Large-Scale Distribution. Nature 435 (2005)Google Scholar
  12. 12.
    Stamos, J. W., Young, H. C.: A Symmetric Fragment and Replicate Algorithm for Distributed Joins. IEEE TPDS 4(12) (1993)Google Scholar
  13. 13.
    Stonebraker, M., Abadi, D., DeWitt, D., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and Parallel DBMSs: Friends or Foes? CACM 53(1) (2010)Google Scholar
  14. 14.
    Whang, K. Y., Zanden, B. T. V., Taylor, H. M.: A Linear-Time Probabilistic Counting Algorithm for Database Applications. TODS 15(2) (1990)Google Scholar
  15. 15.
    Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R., Stoica, I.: Improving MapReduce Performance in Heterogeneous Environments. In: OSDI (2008)Google Scholar
  16. 16.
    Zeller, H., Gray, J.: An Adaptive Hash Join Algorithm for Multiuser Environments. In: VLDB (1990)Google Scholar

Copyright information

© Springer Science+Business Media New York 2012

Authors and Affiliations

  • Benjamin Gufler
    • 1
    Email author
  • Nikolaus Augsten
    • 2
  • Angelika Reiser
    • 1
  • Alfons Kemper
    • 1
  1. 1.Fakultät für InformatikTechnische Universität MünchenGarching bei MünchenGermany
  2. 2.Faculty of Computer ScienceFree University of Bozen-BolzanoBozen-BolzanoItaly

Personalised recommendations