SliceNBound: Solving Closest Pairs and Distance Join Queries in Apache Spark

Mavrommatis, George; Moutafis, Panagiotis; Vassilakopoulos, Michael; García-García, Francisco; Corral, Antonio

doi:10.1007/978-3-319-66917-5_14

George Mavrommatis¹⁶,
Panagiotis Moutafis¹⁶,
Michael Vassilakopoulos¹⁶,
Francisco García-García¹⁷ &
…
Antonio Corral¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10509))

Included in the following conference series:

European Conference on Advances in Databases and Information Systems

1234 Accesses
2 Citations

Abstract

The (K) Closest-Pair(s) Query, KCPQ, consists in finding the (K) closest pair(s) of objects between two spatial datasets. Recently, several systems that enhance Apache Spark with spatial-awareness have been presented, providing a variety of queries for spatial computation, but not the KCPQ. Since queries are of different nature and one processing technique does not fit all cases, we need specialized algorithms for specific queries that exploit the power provided by parallel systems such as Apache Spark. This paper addresses the problem of answering the KCPQ in Apache Spark, by presenting such a specialized, fast algorithm that can easily be imported in any, spatial-oriented or general, Spark-based system. Furthermore, it presents a variant of this algorithm that solves the Distance Join Query. Experiments and comparison to other solutions indicate that our method is fast and efficient.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Smid, M.: Closest-point problems in computational geometry. In: Sack, J.-R., Urrutia, J. (eds.) Handbook of Computational Geometry, Ch. 20, pp. 877–935. Elsevier (2000)
Google Scholar
Gao, Y., Chen, L., Li, X., Yao, B., Chen, G.: Efficient k-closest pair queries in general metric spaces. VLDB J. 24(3), 415–439 (2015)
Article Google Scholar
Corral, A., Manolopoulos, Y., Theodoridis, Y., Vassilakopoulos, M.: Algorithms for processing k-closest-pair queries in spatial databases. Data Knowl. Eng. 49(1), 67–104 (2004)
Article Google Scholar
Gutierrez, G., Sáez, P.: The k closest pairs in spatial databases - when only one set is indexed. GeoInformatica 17(4), 543–565 (2013)
Article Google Scholar
Roumelis, G., Corral, A., Vassilakopoulos, M., Manolopoulos, Y.: New plane-sweep algorithms for distance-based join queries in spatial databases. GeoInformatica 20(4), 571–628 (2016)
Article Google Scholar
Chen, M., Mao, S., Liu, Y.: Big data: a survey. Mob. Netw. Appl. 19, 171–209 (2014)
Article Google Scholar
Shekhar, S., Feiner, S.K., Aref, W.G.: Spatial computing. Commun. ACM 59(1), 72–81 (2016)
Article Google Scholar
Eldawy, A., Mokbel, M.F.: The era of big spatial data: a survey. DBSJ J. 13(1), 25–36 (2015)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI 2004, pp. 137–150 (2004)
Google Scholar
Eldawy, A., Mokbel, M.F.: SpatialHadoop: a MapReduce framework for spatial data. In: ICDE Conference, pp. 1352–1363 (2015)
Google Scholar
Aji, A., Wang, F., Vo, H., Lee, R., Liu, Q., Zhang, X., Saltz, J.H.: Hadoop-GIS: a high performance spatial data warehousing system over MapReduce. PVLDB 6(11), 1009–1020 (2013)
Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI 2012, pp. 15–28. USENIX (2012)
Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (2010)
Google Scholar
Chen, D., Shen, C., Feng, J., Le, J.: An efficient parallel top-k similarity join for massive multidimensional data using spark. Int. J. Database Theor. Appl. 8(3), 57–68 (2015)
Article Google Scholar
Dustakar, N.R., Dustakar, S.R.: Computational geometry leveraged by apache spark. J. Innov. Electron. Commun. Eng. 5(2), 15–31 (2015)
Google Scholar
Yu, J., Wu, J., Sarwat, M.: GeoSpark: a cluster computing framework for processing large-scale spatial data. In: SIGSPATIAL 2015, Bellevue, WA (2015)
Google Scholar
You, S., Zhang, J., Gruenwald, L.: Large-scale spatial join query processing in cloud. In: CloudDM Workshop (2015)
Google Scholar
Tang, M., Yu, Y., Malluhi, Q.M., Ouzzani, M., Aref, W.G.: Locationspark: a distributed in-memory data management system for big spatial data. Proc. VLDB Endowment 9, 1565–1568 (2016)
Article Google Scholar
Xie, D., Li, F., Yao, B., Li, G., Zhou, L., Guo, M.: Simba: efficient in-memory spatial analytics. In: SIGMOD 2016, San Francisco (2016)
Google Scholar
García-García, F., Corral, A., Iribarne, L., Vassilakopoulos, M., Manolopoulos, Y.: Enhancing SpatialHadoop with closest pair queries. In: Pokorný, J., Ivanović, M., Thalheim, B., Šaloun, P. (eds.) ADBIS 2016. LNCS, vol. 9809, pp. 212–225. Springer, Cham (2016). doi:10.1007/978-3-319-44039-2_15
Chapter Google Scholar
Mavrommatis, G., Moutafis, P., Vassilakopoulos, M.: Closest-pairs query processing in apache spark. In: Proceedings of the Eighth International Conference on Cloud Computing, GRIDs, and Virtualization, pp. 26–31. IARIA (2017)
Google Scholar
Aji, A., Vo, H, Wang, F.: Effective Spatial Data Partitioning for Scalable Query Processing. arXiv:1509.00910v1 [cs.DB]. Downloaded from https://arxiv.org/pdf/1509.00910v1. 21 December 2016
Guller, M.: Big Data Analytics with Spark. Apress, distributed by Springer Science+Business Media, New York (2015)
Google Scholar
Carraghan, R., Pardalos, P.M.: An exact algorithm for the maximum clique problem. Oper. Res. Lett. 9, 375–382 (1990)
Article MATH Google Scholar
Borges, F., Gutierrez-Milla, A., Suppi, R., Luque, E.: Strip partitioning for ant colony parallel and distributed discrete-event simulation. Procedia Comput. Sci. 51, 483–492 (2015)
Article Google Scholar
Eldawy, A., Alarabi, L., Mokbel, M.F.: Spatial partitioning techniques in SpatialHadoop. Proc. VLDB Endowment 8(12), 1602–1605 (2015)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Data Structuring & Engineering Lab, Department of Electrical and Computer Engineering, University of Thessaly, Volos, Greece
George Mavrommatis, Panagiotis Moutafis & Michael Vassilakopoulos
Department of Informatics, University of Almeria, Almeria, Spain
Francisco García-García & Antonio Corral

Authors

George Mavrommatis
View author publications
You can also search for this author in PubMed Google Scholar
Panagiotis Moutafis
View author publications
You can also search for this author in PubMed Google Scholar
Michael Vassilakopoulos
View author publications
You can also search for this author in PubMed Google Scholar
Francisco García-García
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Corral
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to George Mavrommatis .

Editor information

Editors and Affiliations

Riga Technical University , Riga, Latvia
Mārīte Kirikova
Norwegian University of Science and Technology, Trondheim, Norway
Kjetil Nørvåg
University of Cyprus , Nicosia, Cyprus
George A. Papadopoulos

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mavrommatis, G., Moutafis, P., Vassilakopoulos, M., García-García, F., Corral, A. (2017). SliceNBound: Solving Closest Pairs and Distance Join Queries in Apache Spark. In: Kirikova, M., Nørvåg, K., Papadopoulos, G. (eds) Advances in Databases and Information Systems. ADBIS 2017. Lecture Notes in Computer Science(), vol 10509. Springer, Cham. https://doi.org/10.1007/978-3-319-66917-5_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-66917-5_14
Published: 25 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66916-8
Online ISBN: 978-3-319-66917-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics