Reducing partition skew on MapReduce: an incremental allocation approach
- 90 Downloads
MapReduce, a parallel computational model, has been widely used in processing big data in a distributed cluster. Consisting of alternate map and reduce phases, MapReduce has to shuffle the intermediate data generated by mappers to reducers. The key challenge of ensuring balanced workload on MapReduce is to reduce partition skew among reducers without detailed distribution information on mapped data.
In this paper, we propose an incremental data allocation approach to reduce partition skew among reducers on MapReduce. The proposed approach divides mapped data into many micro-partitions and gradually gathers the statistics on their sizes in the process of mapping. The micropartitions are then incrementally allocated to reducers in multiple rounds. We propose to execute incremental allocation in two steps, micro-partition scheduling and micro-partition allocation. We propose a Markov decision process (MDP) model to optimize the problem of multiple-round micropartition scheduling for allocation commitment. We present an optimal solution with the time complexity of O(K · N2), in which K represents the number of allocation rounds and N represents the number of micro-partitions. Alternatively, we also present a greedy but more efficient algorithm with the time complexity of O(K · N ln N). Then, we propose a minmax programming model to handle the allocation mapping between micro-partitions and reducers, and present an effective heuristic solution due to its NP-completeness. Finally, we have implemented the proposed approach on Hadoop, an open-source MapReduce platform, and empirically evaluated its performance. Our extensive experiments show that compared with the state-of-the-art approaches, the proposed approach achieves considerably better data load balance among reducers as well as overall better parallel performance.
Keywordsincremental partitioning data balance MapReduce
Unable to display preview. Download preview PDF.
This work was supported by the Ministry of Science and Technology of China, National Key Research and Development Program (2016YFB1000703), the National Natural Science Foundation of China (Grant Nos. 61732014, 61332006, 61672432, 61472321, 61502390), the Natural Science Basic Research Plan in Shaanxi Province of China (2018JM6086) and the Fundamental Research Funds for the Central Universities (3102017jg02002).
- 2.Li F, Ooi B C, Özsu M T, Wu S. Distributed data management using mapreduce. ACM Computing Surveys (CSUR), 2014, 46(3): 31Google Scholar
- 3.Hadoop A. Hadoop, 2009Google Scholar
- 4.Lin J. The curse of zipf and limits to parallelization: a look at the stragglers problem in mapreduce. In: Proceedings of the 7th Workshop on Large-Scale Distributed Systems for Information Retrieval. 2009, 57–62Google Scholar
- 6.Racha S C. Load balancing map-reduce communications for efficient executions of applications in a cloud. Project Report, 2012Google Scholar
- 7.Kolb L, Thor A, Rahm E. Block-based load balancing for entity resolution with mapreduce. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 2011, 2397–2400Google Scholar
- 8.Kolb L, Thor A, Rahm E. Load balancing for mapreduce-based entity resolution. In: Proceedings of the 28th IEEE International Conference on Data Engineering. 2012, 618–629Google Scholar
- 9.Gufler B, Augsten N, Reiser A, Kemper A. Handing data skew in mapreduce. In: Proceedings of the 1st International Conference on Cloud Computing and Services Science. 2011, 574–583Google Scholar
- 10.Gufler B, Augsten N, Reiser A, Kemper A. Load balancing in mapreduce based on scalable cardinality estimates. In: Proceedings of the 28th IEEE International Conference on Data Engineering. 2012, 522–533Google Scholar
- 12.DeWitt D, Stonebraker M. Mapreduce: a major step backwards. The Database Column, 2008, 1: 23Google Scholar
- 13.Kwon Y C, Balazinska M, Howe B, Rolia J. A study of skew in mapreduce applications. Open Cirrus Summit, 2011, 11Google Scholar
- 14.Rasmussen A, Conley M, Kapoor R, Lam U T, Porter G, Vahdat A. Themis: an I/O-efficient MapReduce. In: Proceedings of the 3rd ACM Symposium on Cloud Computing. 2012, 13Google Scholar
- 17.Shirazi B A, Kavi K M, Hurson A R. Scheduling and Load Balancing in Parallel and Distributed Systems. Los Alamitos: IEEE Computer Society Press, 1995Google Scholar
- 18.Bharadwaj V, Ghose D, Mani V, Robertazzi T G. Scheduling Divisible Loads in Parallel and Distributed Systems. New York: John Wiley & Sons, 1996Google Scholar
- 19.Ibrahim S, Jin H, Lu L,Wu S, He B. Leen: locality/fairness-aware key partitioning for mapreduce in the cloud. In: Proceedings of the 2nd IEEE International Conference on Cloud Computing Technology and Science. 2010, 17–24Google Scholar
- 21.Dhawalia P, Kailasam S, Janakiram D. Chisel: a resource savvy approach for handling skew in mapreduce applications. In: Proceedings of the 6th IEEE International Conference on Cloud Computing. 2013, 652–660Google Scholar
- 22.Vernica R, Balmin A, Beyer K S, Ercegovac V. Adaptive mapreduce using situation-aware mappers. In: Proceedings of the 15th International Conference on Extending Database Technology. 2012, 420–431Google Scholar
- 23.Ramakrishnan S R, Swart G, Urmanov A. Balancing reducer skew in mapreduce workloads using progressive sampling. In: Proceedings of the 3rd ACM Symposium on Cloud Computing. 2012, 16Google Scholar
- 24.Grover R, Carey M J. Extending map-reduce for efficient predicatebased sampling. In: Proceedings of the 28th IEEE International Conference on Data Engineering. 2012, 486–497Google Scholar
- 26.Dhawalia P, Kailasam S, Janakiram D. Chisel++: handling partitioning skew in mapreduce framework using efficient range partitioning technique. In: Proceedings of the 6th International Workshop on Data Intensive Distributed Computing. 2014, 21–28Google Scholar
- 29.Kwon Y C, Balazinska M, Howe B, Rolia J. Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In: Proceedings of the 1st ACM Symposium on Cloud Computing. 2010, 75–86Google Scholar
- 33.Graham R L. Bounds on the performance of scheduling algorithms. Computer and Job Scheduling Theory, 1976, 165–227Google Scholar