Abstract
The (K) Closest-Pair(s) Query, KCPQ, consists in finding the (K) closest pair(s) of objects between two spatial datasets. Recently, several systems that enhance Apache Spark with spatial-awareness have been presented, providing a variety of queries for spatial computation, but not the KCPQ. Since queries are of different nature and one processing technique does not fit all cases, we need specialized algorithms for specific queries that exploit the power provided by parallel systems such as Apache Spark. This paper addresses the problem of answering the KCPQ in Apache Spark, by presenting such a specialized, fast algorithm that can easily be imported in any, spatial-oriented or general, Spark-based system. Furthermore, it presents a variant of this algorithm that solves the Distance Join Query. Experiments and comparison to other solutions indicate that our method is fast and efficient.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Smid, M.: Closest-point problems in computational geometry. In: Sack, J.-R., Urrutia, J. (eds.) Handbook of Computational Geometry, Ch. 20, pp. 877–935. Elsevier (2000)
Gao, Y., Chen, L., Li, X., Yao, B., Chen, G.: Efficient k-closest pair queries in general metric spaces. VLDB J. 24(3), 415–439 (2015)
Corral, A., Manolopoulos, Y., Theodoridis, Y., Vassilakopoulos, M.: Algorithms for processing k-closest-pair queries in spatial databases. Data Knowl. Eng. 49(1), 67–104 (2004)
Gutierrez, G., Sáez, P.: The k closest pairs in spatial databases - when only one set is indexed. GeoInformatica 17(4), 543–565 (2013)
Roumelis, G., Corral, A., Vassilakopoulos, M., Manolopoulos, Y.: New plane-sweep algorithms for distance-based join queries in spatial databases. GeoInformatica 20(4), 571–628 (2016)
Chen, M., Mao, S., Liu, Y.: Big data: a survey. Mob. Netw. Appl. 19, 171–209 (2014)
Shekhar, S., Feiner, S.K., Aref, W.G.: Spatial computing. Commun. ACM 59(1), 72–81 (2016)
Eldawy, A., Mokbel, M.F.: The era of big spatial data: a survey. DBSJ J. 13(1), 25–36 (2015)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI 2004, pp. 137–150 (2004)
Eldawy, A., Mokbel, M.F.: SpatialHadoop: a MapReduce framework for spatial data. In: ICDE Conference, pp. 1352–1363 (2015)
Aji, A., Wang, F., Vo, H., Lee, R., Liu, Q., Zhang, X., Saltz, J.H.: Hadoop-GIS: a high performance spatial data warehousing system over MapReduce. PVLDB 6(11), 1009–1020 (2013)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI 2012, pp. 15–28. USENIX (2012)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (2010)
Chen, D., Shen, C., Feng, J., Le, J.: An efficient parallel top-k similarity join for massive multidimensional data using spark. Int. J. Database Theor. Appl. 8(3), 57–68 (2015)
Dustakar, N.R., Dustakar, S.R.: Computational geometry leveraged by apache spark. J. Innov. Electron. Commun. Eng. 5(2), 15–31 (2015)
Yu, J., Wu, J., Sarwat, M.: GeoSpark: a cluster computing framework for processing large-scale spatial data. In: SIGSPATIAL 2015, Bellevue, WA (2015)
You, S., Zhang, J., Gruenwald, L.: Large-scale spatial join query processing in cloud. In: CloudDM Workshop (2015)
Tang, M., Yu, Y., Malluhi, Q.M., Ouzzani, M., Aref, W.G.: Locationspark: a distributed in-memory data management system for big spatial data. Proc. VLDB Endowment 9, 1565–1568 (2016)
Xie, D., Li, F., Yao, B., Li, G., Zhou, L., Guo, M.: Simba: efficient in-memory spatial analytics. In: SIGMOD 2016, San Francisco (2016)
García-García, F., Corral, A., Iribarne, L., Vassilakopoulos, M., Manolopoulos, Y.: Enhancing SpatialHadoop with closest pair queries. In: Pokorný, J., Ivanović, M., Thalheim, B., Šaloun, P. (eds.) ADBIS 2016. LNCS, vol. 9809, pp. 212–225. Springer, Cham (2016). doi:10.1007/978-3-319-44039-2_15
Mavrommatis, G., Moutafis, P., Vassilakopoulos, M.: Closest-pairs query processing in apache spark. In: Proceedings of the Eighth International Conference on Cloud Computing, GRIDs, and Virtualization, pp. 26–31. IARIA (2017)
Aji, A., Vo, H, Wang, F.: Effective Spatial Data Partitioning for Scalable Query Processing. arXiv:1509.00910v1 [cs.DB]. Downloaded from https://arxiv.org/pdf/1509.00910v1. 21 December 2016
Guller, M.: Big Data Analytics with Spark. Apress, distributed by Springer Science+Business Media, New York (2015)
Carraghan, R., Pardalos, P.M.: An exact algorithm for the maximum clique problem. Oper. Res. Lett. 9, 375–382 (1990)
Borges, F., Gutierrez-Milla, A., Suppi, R., Luque, E.: Strip partitioning for ant colony parallel and distributed discrete-event simulation. Procedia Comput. Sci. 51, 483–492 (2015)
Eldawy, A., Alarabi, L., Mokbel, M.F.: Spatial partitioning techniques in SpatialHadoop. Proc. VLDB Endowment 8(12), 1602–1605 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Mavrommatis, G., Moutafis, P., Vassilakopoulos, M., García-García, F., Corral, A. (2017). SliceNBound: Solving Closest Pairs and Distance Join Queries in Apache Spark. In: Kirikova, M., Nørvåg, K., Papadopoulos, G. (eds) Advances in Databases and Information Systems. ADBIS 2017. Lecture Notes in Computer Science(), vol 10509. Springer, Cham. https://doi.org/10.1007/978-3-319-66917-5_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-66917-5_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66916-8
Online ISBN: 978-3-319-66917-5
eBook Packages: Computer ScienceComputer Science (R0)