Generalized communication cost efficient multi-way spatial join: revisiting the curse of the last reducer

  • S. Nagesh BhattuEmail author
  • Avinash Potluri
  • Prashanth Kadari
  • Subramanyam R. B. V.


With the huge increase in usage of smart mobiles, social media and sensors, large volumes of location-based data is available. Location based data carries important signals pertaining to user intensive information as well as population characteristics. The key analytical tool for location based analysis is multi-way spatial join. Unlike the conventional join strategies, multi-way join using map-reduce offers a scalable, distributed computational paradigm and efficient implementation through communication cost reduction strategies. Controlled Replicate (C-Rep) is a useful strategy used in the literature to perform the multi-way spatial join efficiently. Though C-Rep performance is superior compared to naive sequential join, careful analysis of its performance reveals that such a strategy is plagued by the curse of the last reducer, wherein the skew inherently present in the datasets and the skew introduced by replication operation, causes some of the reducers to take much longer time compared to others. In this work, we design an algorithm GEMS (G eneralized Communication cost E fficient M ulti-Way S patial Join) to address the skewness inherent in the connectivity of spatial objects while performing a multi-way join. We analysed all the algorithms concerned, in terms of I/O and communication costs. We prove that the communication cost of GEMS approach is better than that of C-Rep by a factor O(α) where α is the number of reducers in a single row/column of a grid of reducers. Our experimental results on different datasets indicate that GEMS approach is three times superior(in terms of turn around time) compared to C-Rep.


Big data Communication cost Multi-way spatial join Skewness 



  1. 1.
    Afrati F, Stasinopoulos N, Ullman J D, Vassilakopoulos A (2018) Sharesskew: An algorithm to handle skew for joins in mapreduce. Information SystemsGoogle Scholar
  2. 2.
    Afrati FN, Ullman JD (2010) Optimizing joins in a map-reduce environment. In: Proceedings of the 13th International Conference on Extending Database Technology, EDBT ’10. ACM, New York, pp 99–110Google Scholar
  3. 3.
    Aji A, Wang F, Vo H, Lee R, Liu Q, Zhang X, Saltz J (2013) Hadoop gis: a high performance spatial data warehousing system over mapreduce. Proc VLDB Endowment 6(11):1009–1020CrossRefGoogle Scholar
  4. 4.
    Aji A, Hoang V, Wang F (2015) Effective spatial data partitioning for scalable query processing. arXiv:150900910
  5. 5.
    Arge L, Procopiuc O, Ramaswamy S, Suel T, Vitter J S (1998) Scalable sweeping-based spatial join. In: VLDB, vol 98, pp 570–581Google Scholar
  6. 6.
    Blanas S, Patel J M, Ercegovac V, Rao J, Shekita E J, Tian Y (2010) A comparison of join algorithms for log processing in mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, pp 975–986Google Scholar
  7. 7.
    Bouros P, Mamoulis N (2017) A forward scan based plane sweep algorithm for parallel interval joins. Proc VLDB Endowment 10(11):1346–1357CrossRefGoogle Scholar
  8. 8.
    Bozanis P, Foteinos P (2007) Wer-trees. Data Knowl Eng 63(2):397–413CrossRefGoogle Scholar
  9. 9.
    Brinkhoff T, Kriegel H P, Seeger B (1996) Parallel processing of spatial joins using r-trees. In: 1996. Proceedings of the Twelfth International Conference on Data engineering. IEEE, pp 258–265Google Scholar
  10. 10.
    Chaudhuri S (1998) An overview of query optimization in relational systems. In: Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS ’98. ACM, New York, pp 34–43.
  11. 11.
    Cho E, Myers SA, Leskovec J (2011) Friendship and mobility: user movement in location-based social networks. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 1082–1090Google Scholar
  12. 12.
    Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRefGoogle Scholar
  13. 13.
    Dittrich JP, Seeger B (2000) Data redundancy and duplicate detection in spatial join processing. In: 2000. Proceedings. 16th International Conference on Data Engineering. IEEE, pp 535–546Google Scholar
  14. 14.
    Doulkeridis C, NOrvag K (2014) A survey of large-scale analytical query processing in mapreduce. The VLDB J 23(3):355–380. CrossRefGoogle Scholar
  15. 15.
    Du Z, Zhao X, Ye X, Zhou J, Zhang F, Liu R (2017) An effective high-performance multiway spatial join algorithm with spark. ISPRS Int J Geo-Inf 6 (4):96CrossRefGoogle Scholar
  16. 16.
    Eldawy A, Mokbel MF (2015a) The era of big spatial data. In: 2015 31st IEEE International Conference on Data Engineering Workshops. IEEE, pp 42–49Google Scholar
  17. 17.
    Eldawy A, Mokbel MF (2015b) Spatialhadoop: A mapreduce framework for spatial data. In: 2015 IEEE 31st International Conference on Data Engineering (ICDE). IEEE, pp 1352–1363Google Scholar
  18. 18.
    Eldawy A, Li Y, Mokbel MF, Janardan R (2013) Cg_hadoop: computational geometry in mapreduce. In: Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. ACM, pp 294–303Google Scholar
  19. 19.
    Günther O (1993) Efficient computation of spatial joins. In: Proceedings of the Ninth International Conference on Data Engineering. IEEE Computer Society, Washington, pp 50–59.
  20. 20.
    Gupta H, Chawda B (2014) ε-controlled-replicate: An improvedcontrolled-replicate algorithm for multi-way spatial join processing on map-reduce. In: International Conference on Web Information Systems Engineering. Springer, pp 278–293Google Scholar
  21. 21.
    Gupta H, Chawda B, Negi S, Faruquie TA, Subramaniam LV, Mohania M (2013) Processing multi-way spatial joins on map-reduce. In: Proceedings of the 16th International Conference on Extending Database Technology, EDBT ’13. ACM, New York, pp 113–124.
  22. 22.
    Güting R H (1994) An introduction to spatial database systems. VLDB J Int J Very Large Data Bases 3(4):357–399CrossRefGoogle Scholar
  23. 23.
    Jacox E H, Samet H (2003) Iterative spatial join. ACM Trans Database Syst 28(3):230–256. CrossRefGoogle Scholar
  24. 24.
    Jacox E H, Samet H (2007) Spatial join techniques. ACM Trans Database Syst (TODS) 32(1):7CrossRefGoogle Scholar
  25. 25.
    Kipf A, Lang H, Pandey V, Persa RA, Boncz P, Neumann T, Kemper A (2018) Adaptive geospatial joins for modern hardware. arXiv:180209488
  26. 26.
    Kriegel N B H P, Schneider R, Seeger B (1990) The r*-tree: an e cient and robust access method for points and rectangles. In: Proceedings of the ACM SIGMOD Conference on Management of DataGoogle Scholar
  27. 27.
    Leskovec J, Rajaraman A, Ullman JD (2014) Mining of Massive Datasets, 2nd Ed. Cambridge University Press, CambridgeGoogle Scholar
  28. 28.
    Lin J, et al. (2009) The curse of zipf and limits to parallelization: a look at the stragglers problem in mapreduce. In: 7Th workshop on large-scale distributed systems for information retrieval. ACM Boston, vol 1, pp 57–62Google Scholar
  29. 29.
    Liroz-Gistau M, Akbarinia R, Agrawal D, Valduriez P (2016) Fp-hadoop: Efficient processing of skewed mapreduce jobs. Inf Syst 60:69–84CrossRefGoogle Scholar
  30. 30.
    Liu Z, Zhang Q, Ahmed R, Boutaba R, Liu Y, Gong Z (2016) Dynamic resource allocation for mapreduce with partitioning skew. IEEE Trans Comput 65(11):3304–3317. CrossRefGoogle Scholar
  31. 31.
    Lo M L, Ravishankar C V (1996) Spatial hash-joins. In: ACM SIGMOD Record. ACM, vol 25, pp 247–258CrossRefGoogle Scholar
  32. 32.
    Loboz C (2012) Cloud resource usage—heavy tailed distributions invalidating traditional capacity planning models. J Grid Comput 10(1):85–108CrossRefGoogle Scholar
  33. 33.
    Mamoulis N, Papadias D (2001) Multiway spatial joins. ACM Trans Database Syst (TODS) 26(4):424–475CrossRefGoogle Scholar
  34. 34.
    Nishimura S, Das S, Agrawal D, El Abbadi A (2013) Hbase: design and implementation of an elastic data infrastructure for cloud-scale location services. Distrib Parallel Database 31(2):289– 319CrossRefGoogle Scholar
  35. 35.
    Nobari S, Tauheed F, Heinis T, Karras P, Bressan S, Ailamaki A (2013) Touch: in-memory spatial join by hierarchical data-oriented partitioning. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM, pp 701–712Google Scholar
  36. 36.
    Okcan A, Riedewald M (2011) Processing theta-joins using mapreduce. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, pp 949–960Google Scholar
  37. 37.
    Papadias D, Arkoumanis D (2002) Search algorithms for multiway spatial joins. Int J Geograph Inf Sci 16(7):613–639CrossRefGoogle Scholar
  38. 38.
    Papadias D, Mamoulis N, Delis B (1998) Algorithms for querying by spatial structure In: Proceedings of Very Large Data Bases Conference (VLDB), New YorkGoogle Scholar
  39. 39.
    Papadias D, Mamoulis N, Theodoridis Y (1999) Processing and optimization of multiway spatial joins using r-trees. In: Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, pp 44–55Google Scholar
  40. 40.
    Papadias D, Mamoulis N, Theodoridis Y (2001) Constraint-based processing of multiway spatial joins. Algorithmica 30(2):188–215CrossRefGoogle Scholar
  41. 41.
    Park HH, Cha GH, Chung CW (1999) Multi-way spatial joins using r-trees: Methodology and performance evaluation. In: Advances in Spatial Databases. Springer, pp 229–250Google Scholar
  42. 42.
    Patel J M, Patel and DeWitt D J (1996) Partition based spatial-merge join. In: ACM SIGMOD Record. ACM, vol 25, pp 259–270Google Scholar
  43. 43.
    Patel JM, DeWitt DJ (2000) Clone join and shadow join: two parallel spatial join algorithms. In: Proceedings of the 8th ACM international symposium on Advances in geographic information systems. ACM, pp 54–61Google Scholar
  44. 44.
    Pavlo A, Paulson E, Rasin A, Abadi DJ, DeWitt DJ, Madden S, Stonebraker M (2009) A comparison of approaches to large-scale data analysis. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, SIGMOD ’09. ACM, New York, pp 165–178.
  45. 45.
    Pearce O, Gamblin T, de Supinski BR, Schulz M, Amato NM (2012) Quantifying the effectiveness of load balance algorithms. In: Proceedings of the 26th ACM International Conference on Supercomputing, ICS ’12. ACM, New York, pp 185–194Google Scholar
  46. 46.
    Sabek I, Mokbel MF (2017) On spatial joins in mapreduce. In: Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. ACM, pp 21Google Scholar
  47. 47.
    Singh H, Bawa S (2017) A survey of traditional and mapreducebased spatial query processing approaches. SIGMOD Rec 46(2):18–29. CrossRefGoogle Scholar
  48. 48.
    Vassilakopoulos M, Corral A, Karanikolas N (2011) Join-queries between two spatial datasets indexed by a single r*-tree. SOFSEM 2011: Theory and Practice of Computer Science, pp 533–544CrossRefGoogle Scholar
  49. 49.
    Vernica R, Carey M J, Li C (2010) Efficient parallel set-similarity joins using mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, pp 495–506Google Scholar
  50. 50.
    Wang K, Han J, Tu B, Dai J, Zhou W, Song X (2010) Accelerating spatial data processing with mapreduce. In: 2010 IEEE 16th International Conference on Parallel and Distributed Systems (ICPADS). IEEE, pp 229–236Google Scholar
  51. 51.
    Zhang S, Han J, Liu Z, Wang K, Feng S (2009a) Spatial queries evaluation with mapreduce. In: 2009. GCC’09. Eighth International Conference on Grid and cooperative computing. IEEE, pp 287– 292Google Scholar
  52. 52.
    Zhang S, Han J, Liu Z, Wang K, Xu Z (2009B) Sjmr: Parallelizing spatial join with mapreduce on clusters. In: 2009. CLUSTER’09. IEEE international conference on Cluster computing and workshops. IEEE, pp 1–8Google Scholar
  53. 53.
    Zhang X, Chen L, Wang M (2012) Efficient multi-way theta-join processing using mapreduce. Proc VLDB Endow 5(11):1184–1195CrossRefGoogle Scholar
  54. 54.
    Zhong Y, Han J, Zhang T, Li Z, Fang J, Chen G (2012) Towards parallel spatial query processing for big spatial data. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum. IEEE, pp 2085–2094Google Scholar
  55. 55.
    Zhou X, Abel D J, Truffet D (1998) Data partitioning for parallel spatial join processing. Geoinformatica 2(2):175–204CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2020

Authors and Affiliations

  1. 1.NIT Andhra PradeshAndhra PradeshIndia
  2. 2.IDRBTHyderabadIndia
  3. 3.NIT WarangalWarangalIndia

Personalised recommendations