Abstract
With the huge increase in usage of smart mobiles, social media and sensors, large volumes of location-based data is available. Location based data carries important signals pertaining to user intensive information as well as population characteristics. The key analytical tool for location based analysis is multi-way spatial join. Unlike the conventional join strategies, multi-way join using map-reduce offers a scalable, distributed computational paradigm and efficient implementation through communication cost reduction strategies. Controlled Replicate (C-Rep) is a useful strategy used in the literature to perform the multi-way spatial join efficiently. Though C-Rep performance is superior compared to naive sequential join, careful analysis of its performance reveals that such a strategy is plagued by the curse of the last reducer, wherein the skew inherently present in the datasets and the skew introduced by replication operation, causes some of the reducers to take much longer time compared to others. In this work, we design an algorithm GEMS (G eneralized Communication cost E fficient M ulti-Way S patial Join) to address the skewness inherent in the connectivity of spatial objects while performing a multi-way join. We analysed all the algorithms concerned, in terms of I/O and communication costs. We prove that the communication cost of GEMS approach is better than that of C-Rep by a factor O(α) where α is the number of reducers in a single row/column of a grid of reducers. Our experimental results on different datasets indicate that GEMS approach is three times superior(in terms of turn around time) compared to C-Rep.
Similar content being viewed by others
Notes
Here we use the set of cells \(\mathcal {C}_{cr}\) to notionally distinguish between C-Rep and All-Replicate. Otherwise, \(\mathcal {C}_{cr}\) is equivalent to \(\mathcal {C}_{f}\) in the sense that it also represents the cells of the first quadrant.
The code for all the algorithms implemented is available in https://github.com/nageshbhattu/SpatialJoin.git
References
Afrati F, Stasinopoulos N, Ullman J D, Vassilakopoulos A (2018) Sharesskew: An algorithm to handle skew for joins in mapreduce. Information Systems
Afrati FN, Ullman JD (2010) Optimizing joins in a map-reduce environment. In: Proceedings of the 13th International Conference on Extending Database Technology, EDBT ’10. ACM, New York, pp 99–110
Aji A, Wang F, Vo H, Lee R, Liu Q, Zhang X, Saltz J (2013) Hadoop gis: a high performance spatial data warehousing system over mapreduce. Proc VLDB Endowment 6(11):1009–1020
Aji A, Hoang V, Wang F (2015) Effective spatial data partitioning for scalable query processing. arXiv:150900910
Arge L, Procopiuc O, Ramaswamy S, Suel T, Vitter J S (1998) Scalable sweeping-based spatial join. In: VLDB, vol 98, pp 570–581
Blanas S, Patel J M, Ercegovac V, Rao J, Shekita E J, Tian Y (2010) A comparison of join algorithms for log processing in mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, pp 975–986
Bouros P, Mamoulis N (2017) A forward scan based plane sweep algorithm for parallel interval joins. Proc VLDB Endowment 10(11):1346–1357
Bozanis P, Foteinos P (2007) Wer-trees. Data Knowl Eng 63(2):397–413
Brinkhoff T, Kriegel H P, Seeger B (1996) Parallel processing of spatial joins using r-trees. In: 1996. Proceedings of the Twelfth International Conference on Data engineering. IEEE, pp 258–265
Chaudhuri S (1998) An overview of query optimization in relational systems. In: Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS ’98. ACM, New York, pp 34–43. https://doi.org/10.1145/275487.275492
Cho E, Myers SA, Leskovec J (2011) Friendship and mobility: user movement in location-based social networks. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 1082–1090
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Dittrich JP, Seeger B (2000) Data redundancy and duplicate detection in spatial join processing. In: 2000. Proceedings. 16th International Conference on Data Engineering. IEEE, pp 535–546
Doulkeridis C, NOrvag K (2014) A survey of large-scale analytical query processing in mapreduce. The VLDB J 23(3):355–380. https://doi.org/10.1007/s00778-013-0319-9
Du Z, Zhao X, Ye X, Zhou J, Zhang F, Liu R (2017) An effective high-performance multiway spatial join algorithm with spark. ISPRS Int J Geo-Inf 6 (4):96
Eldawy A, Mokbel MF (2015a) The era of big spatial data. In: 2015 31st IEEE International Conference on Data Engineering Workshops. IEEE, pp 42–49
Eldawy A, Mokbel MF (2015b) Spatialhadoop: A mapreduce framework for spatial data. In: 2015 IEEE 31st International Conference on Data Engineering (ICDE). IEEE, pp 1352–1363
Eldawy A, Li Y, Mokbel MF, Janardan R (2013) Cg_hadoop: computational geometry in mapreduce. In: Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. ACM, pp 294–303
Günther O (1993) Efficient computation of spatial joins. In: Proceedings of the Ninth International Conference on Data Engineering. IEEE Computer Society, Washington, pp 50–59. http://dl.acm.org/citation.cfm?id=645478.654973
Gupta H, Chawda B (2014) ε-controlled-replicate: An improvedcontrolled-replicate algorithm for multi-way spatial join processing on map-reduce. In: International Conference on Web Information Systems Engineering. Springer, pp 278–293
Gupta H, Chawda B, Negi S, Faruquie TA, Subramaniam LV, Mohania M (2013) Processing multi-way spatial joins on map-reduce. In: Proceedings of the 16th International Conference on Extending Database Technology, EDBT ’13. ACM, New York, pp 113–124. https://doi.org/10.1145/2452376.2452390
Güting R H (1994) An introduction to spatial database systems. VLDB J Int J Very Large Data Bases 3(4):357–399
Jacox E H, Samet H (2003) Iterative spatial join. ACM Trans Database Syst 28(3):230–256. https://doi.org/10.1145/937598.937600
Jacox E H, Samet H (2007) Spatial join techniques. ACM Trans Database Syst (TODS) 32(1):7
Kipf A, Lang H, Pandey V, Persa RA, Boncz P, Neumann T, Kemper A (2018) Adaptive geospatial joins for modern hardware. arXiv:180209488
Kriegel N B H P, Schneider R, Seeger B (1990) The r*-tree: an e cient and robust access method for points and rectangles. In: Proceedings of the ACM SIGMOD Conference on Management of Data
Leskovec J, Rajaraman A, Ullman JD (2014) Mining of Massive Datasets, 2nd Ed. Cambridge University Press, Cambridge
Lin J, et al. (2009) The curse of zipf and limits to parallelization: a look at the stragglers problem in mapreduce. In: 7Th workshop on large-scale distributed systems for information retrieval. ACM Boston, vol 1, pp 57–62
Liroz-Gistau M, Akbarinia R, Agrawal D, Valduriez P (2016) Fp-hadoop: Efficient processing of skewed mapreduce jobs. Inf Syst 60:69–84
Liu Z, Zhang Q, Ahmed R, Boutaba R, Liu Y, Gong Z (2016) Dynamic resource allocation for mapreduce with partitioning skew. IEEE Trans Comput 65(11):3304–3317. https://doi.org/10.1109/TC.2016.2532860
Lo M L, Ravishankar C V (1996) Spatial hash-joins. In: ACM SIGMOD Record. ACM, vol 25, pp 247–258
Loboz C (2012) Cloud resource usage—heavy tailed distributions invalidating traditional capacity planning models. J Grid Comput 10(1):85–108
Mamoulis N, Papadias D (2001) Multiway spatial joins. ACM Trans Database Syst (TODS) 26(4):424–475
Nishimura S, Das S, Agrawal D, El Abbadi A (2013) Hbase: design and implementation of an elastic data infrastructure for cloud-scale location services. Distrib Parallel Database 31(2):289– 319
Nobari S, Tauheed F, Heinis T, Karras P, Bressan S, Ailamaki A (2013) Touch: in-memory spatial join by hierarchical data-oriented partitioning. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM, pp 701–712
Okcan A, Riedewald M (2011) Processing theta-joins using mapreduce. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, pp 949–960
Papadias D, Arkoumanis D (2002) Search algorithms for multiway spatial joins. Int J Geograph Inf Sci 16(7):613–639
Papadias D, Mamoulis N, Delis B (1998) Algorithms for querying by spatial structure In: Proceedings of Very Large Data Bases Conference (VLDB), New York
Papadias D, Mamoulis N, Theodoridis Y (1999) Processing and optimization of multiway spatial joins using r-trees. In: Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, pp 44–55
Papadias D, Mamoulis N, Theodoridis Y (2001) Constraint-based processing of multiway spatial joins. Algorithmica 30(2):188–215
Park HH, Cha GH, Chung CW (1999) Multi-way spatial joins using r-trees: Methodology and performance evaluation. In: Advances in Spatial Databases. Springer, pp 229–250
Patel J M, Patel and DeWitt D J (1996) Partition based spatial-merge join. In: ACM SIGMOD Record. ACM, vol 25, pp 259–270
Patel JM, DeWitt DJ (2000) Clone join and shadow join: two parallel spatial join algorithms. In: Proceedings of the 8th ACM international symposium on Advances in geographic information systems. ACM, pp 54–61
Pavlo A, Paulson E, Rasin A, Abadi DJ, DeWitt DJ, Madden S, Stonebraker M (2009) A comparison of approaches to large-scale data analysis. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, SIGMOD ’09. ACM, New York, pp 165–178. https://doi.org/10.1145/1559845.1559865
Pearce O, Gamblin T, de Supinski BR, Schulz M, Amato NM (2012) Quantifying the effectiveness of load balance algorithms. In: Proceedings of the 26th ACM International Conference on Supercomputing, ICS ’12. ACM, New York, pp 185–194
Sabek I, Mokbel MF (2017) On spatial joins in mapreduce. In: Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. ACM, pp 21
Singh H, Bawa S (2017) A survey of traditional and mapreducebased spatial query processing approaches. SIGMOD Rec 46(2):18–29. https://doi.org/10.1145/3137586.3137590
Vassilakopoulos M, Corral A, Karanikolas N (2011) Join-queries between two spatial datasets indexed by a single r*-tree. SOFSEM 2011: Theory and Practice of Computer Science, pp 533–544
Vernica R, Carey M J, Li C (2010) Efficient parallel set-similarity joins using mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, pp 495–506
Wang K, Han J, Tu B, Dai J, Zhou W, Song X (2010) Accelerating spatial data processing with mapreduce. In: 2010 IEEE 16th International Conference on Parallel and Distributed Systems (ICPADS). IEEE, pp 229–236
Zhang S, Han J, Liu Z, Wang K, Feng S (2009a) Spatial queries evaluation with mapreduce. In: 2009. GCC’09. Eighth International Conference on Grid and cooperative computing. IEEE, pp 287– 292
Zhang S, Han J, Liu Z, Wang K, Xu Z (2009B) Sjmr: Parallelizing spatial join with mapreduce on clusters. In: 2009. CLUSTER’09. IEEE international conference on Cluster computing and workshops. IEEE, pp 1–8
Zhang X, Chen L, Wang M (2012) Efficient multi-way theta-join processing using mapreduce. Proc VLDB Endow 5(11):1184–1195
Zhong Y, Han J, Zhang T, Li Z, Fang J, Chen G (2012) Towards parallel spatial query processing for big spatial data. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum. IEEE, pp 2085–2094
Zhou X, Abel D J, Truffet D (1998) Data partitioning for parallel spatial join processing. Geoinformatica 2(2):175–204
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix
A Asymptotic complexity of GEMS vs Diagonal based Replication Schemes
The proof for Theorem 1 is provided below:
Proof
Let the grid structure be as given in Fig. 16. Let the number of rows /columns be α. Let the spatial dataset of size D be uniformly distirbuted. Each cell gets \(P = \frac {D}{\alpha ^{2}}\) of the spatial data. Using lemma 1, we can see that the diagonal based replication schemes have a replication cost of O(α4 ∗ P). Using lemma 2, we can also see that the GEMS approach has communication cost of O(α3 ∗ P). As a consequence we can observe that the GEMS approach improves over diagonal based replication schemes by O(α) □
Lemma 1
The communication cost of diagonal based replication schemes is O(α4 ∗ P)
Proof
The diagonal based replication schemes perform the replication along the diagonal of the grid structure. Let (i,i) be a cell on the diagonal where i is counted from top right corner of the grid structure as depicted in Fig. 16. Consider the replication cost of all the cells of the grid structure which are on i’th row to the right of cell (i,i). Let j be the index varying from 1 to i-1 where j indicates the position of the cell counted from the right end of the grid structure. The replication cost of j’th cell on the i’th row is j*i*P as marked by the hashed vertical rectangle. Here, P is the amount of data to be replicated. The replication cost of all the cells on the i’th row to the right of cell (i,i) is \({\sum }_{j=1}^{i-1} i*j*P\). The replication cost of cells on the i’th column which are on the top of cell (i,i) can be similarly computed as \({\sum }_{j=1}^{i-1} i*j*P\). The replication cost (Cd) of all the cells in the grid structure can be computed by varying the position of i from 1 to α. Such computation is summarized as:
The first term on right hand side of the Eq. 2 refers to the communication cost of cells on the diagonal of the grid structure which is (i2 ∗ P). □
Lemma 2
The communication cost of GEMS based replication scheme is O(α3 ∗ P)
Proof
There are altogether α2 cells in a grid of α rows/columns. If a cell is present in i’th row and j’th column of the grid (as marked in Fig. 17), the replication will be done to all the cells in i’th row and j’th column. Hence, each cell is replicated to 2 ∗ α − 1 cells using GEMS approach. So the total replication cost of GEMS approach is
□
Rights and permissions
About this article
Cite this article
Bhattu, S.N., Potluri, A., Kadari, P. et al. Generalized communication cost efficient multi-way spatial join: revisiting the curse of the last reducer. Geoinformatica 24, 557–589 (2020). https://doi.org/10.1007/s10707-019-00387-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10707-019-00387-6