The Journal of Supercomputing

, Volume 75, Issue 1, pp 228–254 | Cite as

Load balancing in join algorithms for skewed data in MapReduce systems

  • Elaheh Gavagsaz
  • Ali RezaeeEmail author
  • Hamid Haj Seyyed Javadi


Join is an essential tool for data analysis which collected from different data sources. MapReduce has emerged as a prominent programming model for processing of massive data. However, traditional join algorithms based on MapReduce are not efficient when handling skewed data. The presence of data skew in input data leads to considerable load imbalance and performance degradation. This paper proposes a new skew-insensitive method, called fine-grained partitioning for skew data (FGSD) which can improve the load balancing for reduce tasks. The proposed method considers the properties of both input and output data through a proposed stream sampling algorithm. FGSD introduces a new approach for distribution of input data which leads to efficiently handling redistribution and join product skew. The experimental results confirm that our solution can not only achieve higher balancing performance, but also reduce the execution time of a job with varying degrees of the data skew. Furthermore, FGSD does not require any modification to the MapReduce environment and is applicable to complex join.


Load balancing Join algorithm Data skew MapReduce Spark 


  1. 1.
    Akoka J, Comyn-Wattiau I, Laoufi N (2017) Research on big data—a systematic mapping study. Comput Stand Interfaces 54:105–115. CrossRefGoogle Scholar
  2. 2.
    Alharthi A, Krotov V, Bowman M (2017) Addressing barriers to big data. Bus Horiz 60(3):285–292. CrossRefGoogle Scholar
  3. 3.
    Anagnostopoulos I, Zeadally S, Exposito E (2016) Handling big data: research challenges and future directions. J Supercomput 72(4):1494–1516. CrossRefGoogle Scholar
  4. 4.
    Lee I (2017) Big data: dimensions, evolution, impacts, and challenges. Bus Horiz 60(3):293–303. CrossRefGoogle Scholar
  5. 5.
    Rodríguez-Mazahua L, Rodríguez-Enríquez C-A, Sánchez-Cervantes JL, Cervantes J, García-Alcaraz JL, Alor-Hernández G (2016) A general perspective of big data: applications, tools, challenges and trends. J Supercomput 72(8):3073–3113. CrossRefGoogle Scholar
  6. 6.
    Arabnia HR (1996) Distributed stereo-correlation algorithm. Comput Commun 19(8):707–711. CrossRefGoogle Scholar
  7. 7.
    Arabnia HR, Taha TR (1998) A parallel numerical algorithm on a reconfigurable multi-ring network. Telecommun Syst 10(1):185–202. CrossRefGoogle Scholar
  8. 8.
    Arabnia HR, Oliver MA (1986) Fast operations on raster images with SIMD machine architectures. Comput Graph Forum 5(3):179–188. CrossRefGoogle Scholar
  9. 9.
    Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. ACM 51(1):107–113. CrossRefGoogle Scholar
  10. 10.
    Li J, Liu Y, Pan J, Zhang P, Chen W, Wang L (2017) Map-balance-reduce: an improved parallel programming model for load balancing of MapReduce. Future Gener Comput Syst. Google Scholar
  11. 11.
    Afrati FN, Ullman JD (2010) Optimizing joins in a map-reduce environment. In: Proceedings of the 13th International Conference on Extending Database Technology, pp 99–110.
  12. 12.
    Blanas S, Patel JM, Ercegovac V, Rao J, Shekita EJ, Tian Y (2010) A comparison of join algorithms for log processing in MaPreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp 975–986.
  13. 13.
    Jiang D, Tung AKH, Chen G (2011) MAP-JOIN-REDUCE: toward scalable and efficient data analysis on large clusters. IEEE Trans Knowl Data Eng 23(9):1299–1311. CrossRefGoogle Scholar
  14. 14.
    Okcan A, Riedewald M (2011) Processing theta-joins using MapReduce. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pp 949–960.
  15. 15.
    Vernica R, Carey MJ, Li C (2010) Efficient parallel set-similarity joins using MapReduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp 495–506.
  16. 16.
    Yang H-c, Dasdan A, Hsiao R-L, Parker DS (2007) Map-reduce-merge: simplified relational data processing on large clusters. In Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp 1029–1040.
  17. 17.
    Doulkeridis C, Nørvåg K (2014) A survey of large-scale analytical query processing in MapReduce. VLDB J 23(3):355–380. CrossRefGoogle Scholar
  18. 18.
    Lee K-H, Lee Y-J, Choi H, Chung YD, Moon B (2012) Parallel data processing with MapReduce: a survey. SIGMOD Rec 40(4):11–20. CrossRefGoogle Scholar
  19. 19.
    Atta F, Viglas SD, Niazi S (2011) SAND Join: A skew handling join algorithm for Google’s MapReduce framework. In: 2011 IEEE 14th International Multitopic Conference, pp 170–175.
  20. 20.
    DeWitt DJ, Naughton JF, Schneider DA, Seshadri S (1992) Practical Skew Handling in Parallel Joins. In: Proceedings of the 18th International Conference on Very Large Data Bases, pp 27–40Google Scholar
  21. 21.
    Chen Q, Yao J, Xiao Z (2015) LIBRA: lightweight data skew mitigation in MapReduce. IEEE Trans Parallel Distrib Syst 26(9):2520–2533. CrossRefGoogle Scholar
  22. 22.
    Gufler B, Augsten N, Reiser A, Kemper A (2012) Load Balancing in MapReduce Based on Scalable Cardinality Estimates. In: 2012 IEEE 28th International Conference on Data Engineering, pp 522–533.
  23. 23.
    Kwon Y, Balazinska M, Howe B, Rolia J (2012) SkewTune: mitigating skew in mapreduce applications. Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp 25–36.
  24. 24.
    Tang Z, Zhang X, Li K, Li K (2018) An intermediate data placement algorithm for load balancing in Spark computing environment. Future Gener Comput Syst 78:287–301. CrossRefGoogle Scholar
  25. 25.
    Xu Y, Zou P, Qu W, Li Z, Li K, Cui X (2012) Sampling-Based Partitioning in MapReduce for Skewed Data. In: 2012 Seventh ChinaGrid Annual Conference, pp 1–8.
  26. 26.
    Gavagsaz E, Rezaee A, Haj Seyyed Javadi H (2018) Load balancing in reducers for skewed data in MapReduce systems by using scalable simple random sampling. J Supercomput. Google Scholar
  27. 27.
    Vitorovic A, Elseidy M, Koch C (2016) Load balancing and skew resilience for parallel joins. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pp 313–324.
  28. 28.
    Myung J, Shim J, Yeon J, S-g Lee (2016) Handling data skew in join algorithms using MapReduce. Expert Syst Appl 51:286–299. CrossRefGoogle Scholar
  29. 29.
    Beame P, Koutris P, Suciu D (2014) Skew in parallel query processing. In: Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp 212–223.
  30. 30.
    Epstein R, Stonebraker M, Wong E (1978) Distributed query processing in a relational data base system. Proceedings of the 1978 ACM SIGMOD International Conference on Management of Data, pp 169–180.
  31. 31.
    Elseidy M, Elguindy A, Vitorovic A, Koch C (2014) Scalable and adaptive online joins. Proc VLDB Endow 7(6):441–452. CrossRefGoogle Scholar
  32. 32.
    Cochran WG (1977) Sampling techniques. Wiley, New YorkzbMATHGoogle Scholar
  33. 33.
    Le Y, Liu J, Ergün F, Wang D (2014) Online load balancing for MapReduce with skewed data input. In: IEEE INFOCOM 2014—IEEE Conference on Computer Communications, pp 2004–2012.
  34. 34.
    Tillé Y (2006) Sampling algorithms. Springer, New York. zbMATHGoogle Scholar
  35. 35.
    Meng X (2013) Scalable simple random sampling and stratified sampling. In: Proceedings of the 30th International Conference on International Conference on Machine Learning, vol 28, pp III-531–III-539Google Scholar
  36. 36.
    Chaudhuri S, Motwani R, Narasayya V (1999) On random sampling over joins. SIGMOD Rec 28(2):263–274. CrossRefGoogle Scholar
  37. 37.
    Graham R (1969) Bounds on multiprocessing timing anomalies. SIAM J Appl Math 17(2):416–429. MathSciNetCrossRefzbMATHGoogle Scholar
  38. 38.
    Mishra P, Eich MH (1992) Join processing in relational databases. ACM Comput Surv 24(1):63–113. CrossRefGoogle Scholar
  39. 39.
    Walton CB, Dale AG, Jenevein RM (1991) A taxonomy and performance model of data skew effects in parallel joins. In: Proceedings of the 17th International Conference on Very Large Data Bases, pp 537–548Google Scholar
  40. 40.
    Harada L, Kitsuregawa M (1995) Dynamic join product skew handling for hash-joins in shared-nothing database systems. In: DASFAAGoogle Scholar
  41. 41.
    Jimmy L (2009) The curse of zipf and limits to parallelization: a look at the stragglers problem in MapReduce. In: Proceedings of LSDS-IR WorkshopGoogle Scholar
  42. 42.
    Zipf GK (1949) Human behavior and the principle of least effort: an introduction to human ecology. Addison-Wesley Press, BostonGoogle Scholar
  43. 43.
    Ramakrishnan SR, Swart G, Urmanov A (2012) Balancing reducer skew in MapReduce workloads using progressive sampling. In: Proceedings of the Third ACM Symposium on Cloud Computing, pp 1–14.
  44. 44.
    Altman DG, Bland JM (1996) Statistics notes: detecting skewness from summary information. BMJ 313(7066):1200CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Computer Engineering, Science and Research BranchIslamic Azad UniversityTehranIran
  2. 2.Department of Applied Mathematics, Faculty of Mathematics and Computer ScienceShahed UniversityTehranIran

Personalised recommendations