Load balancing in join algorithms for skewed data in MapReduce systems

Abstract

Join is an essential tool for data analysis which collected from different data sources. MapReduce has emerged as a prominent programming model for processing of massive data. However, traditional join algorithms based on MapReduce are not efficient when handling skewed data. The presence of data skew in input data leads to considerable load imbalance and performance degradation. This paper proposes a new skew-insensitive method, called fine-grained partitioning for skew data (FGSD) which can improve the load balancing for reduce tasks. The proposed method considers the properties of both input and output data through a proposed stream sampling algorithm. FGSD introduces a new approach for distribution of input data which leads to efficiently handling redistribution and join product skew. The experimental results confirm that our solution can not only achieve higher balancing performance, but also reduce the execution time of a job with varying degrees of the data skew. Furthermore, FGSD does not require any modification to the MapReduce environment and is applicable to complex join.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

References

  1. 1.

    Akoka J, Comyn-Wattiau I, Laoufi N (2017) Research on big data—a systematic mapping study. Comput Stand Interfaces 54:105–115. https://doi.org/10.1016/j.csi.2017.01.004

    Article  Google Scholar 

  2. 2.

    Alharthi A, Krotov V, Bowman M (2017) Addressing barriers to big data. Bus Horiz 60(3):285–292. https://doi.org/10.1016/j.bushor.2017.01.002

    Article  Google Scholar 

  3. 3.

    Anagnostopoulos I, Zeadally S, Exposito E (2016) Handling big data: research challenges and future directions. J Supercomput 72(4):1494–1516. https://doi.org/10.1007/s11227-016-1677-z

    Article  Google Scholar 

  4. 4.

    Lee I (2017) Big data: dimensions, evolution, impacts, and challenges. Bus Horiz 60(3):293–303. https://doi.org/10.1016/j.bushor.2017.01.004

    Article  Google Scholar 

  5. 5.

    Rodríguez-Mazahua L, Rodríguez-Enríquez C-A, Sánchez-Cervantes JL, Cervantes J, García-Alcaraz JL, Alor-Hernández G (2016) A general perspective of big data: applications, tools, challenges and trends. J Supercomput 72(8):3073–3113. https://doi.org/10.1007/s11227-015-1501-1

    Article  Google Scholar 

  6. 6.

    Arabnia HR (1996) Distributed stereo-correlation algorithm. Comput Commun 19(8):707–711. https://doi.org/10.1016/S0140-3664(96)01104-8

    Article  Google Scholar 

  7. 7.

    Arabnia HR, Taha TR (1998) A parallel numerical algorithm on a reconfigurable multi-ring network. Telecommun Syst 10(1):185–202. https://doi.org/10.1023/a:1019119117297

    Article  Google Scholar 

  8. 8.

    Arabnia HR, Oliver MA (1986) Fast operations on raster images with SIMD machine architectures. Comput Graph Forum 5(3):179–188. https://doi.org/10.1111/j.1467-8659.1986.tb00296.x

    Article  Google Scholar 

  9. 9.

    Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. ACM 51(1):107–113. https://doi.org/10.1145/1327452.1327492

    Article  Google Scholar 

  10. 10.

    Li J, Liu Y, Pan J, Zhang P, Chen W, Wang L (2017) Map-balance-reduce: an improved parallel programming model for load balancing of MapReduce. Future Gener Comput Syst. https://doi.org/10.1016/j.future.2017.03.013

    Google Scholar 

  11. 11.

    Afrati FN, Ullman JD (2010) Optimizing joins in a map-reduce environment. In: Proceedings of the 13th International Conference on Extending Database Technology, pp 99–110. https://doi.org/10.1145/1739041.1739056

  12. 12.

    Blanas S, Patel JM, Ercegovac V, Rao J, Shekita EJ, Tian Y (2010) A comparison of join algorithms for log processing in MaPreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp 975–986. https://doi.org/10.1145/1807167.1807273

  13. 13.

    Jiang D, Tung AKH, Chen G (2011) MAP-JOIN-REDUCE: toward scalable and efficient data analysis on large clusters. IEEE Trans Knowl Data Eng 23(9):1299–1311. https://doi.org/10.1109/TKDE.2010.248

    Article  Google Scholar 

  14. 14.

    Okcan A, Riedewald M (2011) Processing theta-joins using MapReduce. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pp 949–960. https://doi.org/10.1145/1989323.1989423

  15. 15.

    Vernica R, Carey MJ, Li C (2010) Efficient parallel set-similarity joins using MapReduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp 495–506. https://doi.org/10.1145/1807167.1807222

  16. 16.

    Yang H-c, Dasdan A, Hsiao R-L, Parker DS (2007) Map-reduce-merge: simplified relational data processing on large clusters. In Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp 1029–1040. https://doi.org/10.1145/1247480.1247602

  17. 17.

    Doulkeridis C, Nørvåg K (2014) A survey of large-scale analytical query processing in MapReduce. VLDB J 23(3):355–380. https://doi.org/10.1007/s00778-013-0319-9

    Article  Google Scholar 

  18. 18.

    Lee K-H, Lee Y-J, Choi H, Chung YD, Moon B (2012) Parallel data processing with MapReduce: a survey. SIGMOD Rec 40(4):11–20. https://doi.org/10.1145/2094114.2094118

    Article  Google Scholar 

  19. 19.

    Atta F, Viglas SD, Niazi S (2011) SAND Join: A skew handling join algorithm for Google’s MapReduce framework. In: 2011 IEEE 14th International Multitopic Conference, pp 170–175. https://doi.org/10.1109/inmic.2011.6151466

  20. 20.

    DeWitt DJ, Naughton JF, Schneider DA, Seshadri S (1992) Practical Skew Handling in Parallel Joins. In: Proceedings of the 18th International Conference on Very Large Data Bases, pp 27–40

  21. 21.

    Chen Q, Yao J, Xiao Z (2015) LIBRA: lightweight data skew mitigation in MapReduce. IEEE Trans Parallel Distrib Syst 26(9):2520–2533. https://doi.org/10.1109/TPDS.2014.2350972

    Article  Google Scholar 

  22. 22.

    Gufler B, Augsten N, Reiser A, Kemper A (2012) Load Balancing in MapReduce Based on Scalable Cardinality Estimates. In: 2012 IEEE 28th International Conference on Data Engineering, pp 522–533. https://doi.org/10.1109/icde.2012.58

  23. 23.

    Kwon Y, Balazinska M, Howe B, Rolia J (2012) SkewTune: mitigating skew in mapreduce applications. Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp 25–36. https://doi.org/10.1145/2213836.2213840

  24. 24.

    Tang Z, Zhang X, Li K, Li K (2018) An intermediate data placement algorithm for load balancing in Spark computing environment. Future Gener Comput Syst 78:287–301. https://doi.org/10.1016/j.future.2016.06.027

    Article  Google Scholar 

  25. 25.

    Xu Y, Zou P, Qu W, Li Z, Li K, Cui X (2012) Sampling-Based Partitioning in MapReduce for Skewed Data. In: 2012 Seventh ChinaGrid Annual Conference, pp 1–8. https://doi.org/10.1109/chinagrid.2012.18

  26. 26.

    Gavagsaz E, Rezaee A, Haj Seyyed Javadi H (2018) Load balancing in reducers for skewed data in MapReduce systems by using scalable simple random sampling. J Supercomput. https://doi.org/10.1007/s11227-018-2391-9

    Google Scholar 

  27. 27.

    Vitorovic A, Elseidy M, Koch C (2016) Load balancing and skew resilience for parallel joins. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pp 313–324. https://doi.org/10.1109/icde.2016.7498250

  28. 28.

    Myung J, Shim J, Yeon J, S-g Lee (2016) Handling data skew in join algorithms using MapReduce. Expert Syst Appl 51:286–299. https://doi.org/10.1016/j.eswa.2015.12.024

    Article  Google Scholar 

  29. 29.

    Beame P, Koutris P, Suciu D (2014) Skew in parallel query processing. In: Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp 212–223. https://doi.org/10.1145/2594538.2594558

  30. 30.

    Epstein R, Stonebraker M, Wong E (1978) Distributed query processing in a relational data base system. Proceedings of the 1978 ACM SIGMOD International Conference on Management of Data, pp 169–180. https://doi.org/10.1145/509252.509292

  31. 31.

    Elseidy M, Elguindy A, Vitorovic A, Koch C (2014) Scalable and adaptive online joins. Proc VLDB Endow 7(6):441–452. https://doi.org/10.14778/2732279.2732281

    Article  Google Scholar 

  32. 32.

    Cochran WG (1977) Sampling techniques. Wiley, New York

    Google Scholar 

  33. 33.

    Le Y, Liu J, Ergün F, Wang D (2014) Online load balancing for MapReduce with skewed data input. In: IEEE INFOCOM 2014—IEEE Conference on Computer Communications, pp 2004–2012. https://doi.org/10.1109/infocom.2014.6848141

  34. 34.

    Tillé Y (2006) Sampling algorithms. Springer, New York. https://doi.org/10.1007/0-387-34240-0

    Google Scholar 

  35. 35.

    Meng X (2013) Scalable simple random sampling and stratified sampling. In: Proceedings of the 30th International Conference on International Conference on Machine Learning, vol 28, pp III-531–III-539

  36. 36.

    Chaudhuri S, Motwani R, Narasayya V (1999) On random sampling over joins. SIGMOD Rec 28(2):263–274. https://doi.org/10.1145/304181.304206

    Article  Google Scholar 

  37. 37.

    Graham R (1969) Bounds on multiprocessing timing anomalies. SIAM J Appl Math 17(2):416–429. https://doi.org/10.1137/0117039

    MathSciNet  Article  MATH  Google Scholar 

  38. 38.

    Mishra P, Eich MH (1992) Join processing in relational databases. ACM Comput Surv 24(1):63–113. https://doi.org/10.1145/128762.128764

    Article  Google Scholar 

  39. 39.

    Walton CB, Dale AG, Jenevein RM (1991) A taxonomy and performance model of data skew effects in parallel joins. In: Proceedings of the 17th International Conference on Very Large Data Bases, pp 537–548

  40. 40.

    Harada L, Kitsuregawa M (1995) Dynamic join product skew handling for hash-joins in shared-nothing database systems. In: DASFAA

  41. 41.

    Jimmy L (2009) The curse of zipf and limits to parallelization: a look at the stragglers problem in MapReduce. In: Proceedings of LSDS-IR Workshop

  42. 42.

    Zipf GK (1949) Human behavior and the principle of least effort: an introduction to human ecology. Addison-Wesley Press, Boston

    Google Scholar 

  43. 43.

    Ramakrishnan SR, Swart G, Urmanov A (2012) Balancing reducer skew in MapReduce workloads using progressive sampling. In: Proceedings of the Third ACM Symposium on Cloud Computing, pp 1–14. https://doi.org/10.1145/2391229.2391245

  44. 44.

    Altman DG, Bland JM (1996) Statistics notes: detecting skewness from summary information. BMJ 313(7066):1200

    Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Ali Rezaee.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Gavagsaz, E., Rezaee, A. & Haj Seyyed Javadi, H. Load balancing in join algorithms for skewed data in MapReduce systems. J Supercomput 75, 228–254 (2019). https://doi.org/10.1007/s11227-018-2578-0

Download citation

Keywords

  • Load balancing
  • Join algorithm
  • Data skew
  • MapReduce
  • Spark