SliceNBound: Solving Closest Pairs and Distance Join Queries in Apache Spark

  • George MavrommatisEmail author
  • Panagiotis Moutafis
  • Michael Vassilakopoulos
  • Francisco García-García
  • Antonio Corral
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10509)


The (K) Closest-Pair(s) Query, KCPQ, consists in finding the (K) closest pair(s) of objects between two spatial datasets. Recently, several systems that enhance Apache Spark with spatial-awareness have been presented, providing a variety of queries for spatial computation, but not the KCPQ. Since queries are of different nature and one processing technique does not fit all cases, we need specialized algorithms for specific queries that exploit the power provided by parallel systems such as Apache Spark. This paper addresses the problem of answering the KCPQ in Apache Spark, by presenting such a specialized, fast algorithm that can easily be imported in any, spatial-oriented or general, Spark-based system. Furthermore, it presents a variant of this algorithm that solves the Distance Join Query. Experiments and comparison to other solutions indicate that our method is fast and efficient.


Spatial Query Processing Closest-pairs query Distance Join Query Data partitioning Apache Spark 


  1. 1.
    Smid, M.: Closest-point problems in computational geometry. In: Sack, J.-R., Urrutia, J. (eds.) Handbook of Computational Geometry, Ch. 20, pp. 877–935. Elsevier (2000)Google Scholar
  2. 2.
    Gao, Y., Chen, L., Li, X., Yao, B., Chen, G.: Efficient k-closest pair queries in general metric spaces. VLDB J. 24(3), 415–439 (2015)CrossRefGoogle Scholar
  3. 3.
    Corral, A., Manolopoulos, Y., Theodoridis, Y., Vassilakopoulos, M.: Algorithms for processing k-closest-pair queries in spatial databases. Data Knowl. Eng. 49(1), 67–104 (2004)CrossRefGoogle Scholar
  4. 4.
    Gutierrez, G., Sáez, P.: The k closest pairs in spatial databases - when only one set is indexed. GeoInformatica 17(4), 543–565 (2013)CrossRefGoogle Scholar
  5. 5.
    Roumelis, G., Corral, A., Vassilakopoulos, M., Manolopoulos, Y.: New plane-sweep algorithms for distance-based join queries in spatial databases. GeoInformatica 20(4), 571–628 (2016)CrossRefGoogle Scholar
  6. 6.
    Chen, M., Mao, S., Liu, Y.: Big data: a survey. Mob. Netw. Appl. 19, 171–209 (2014)CrossRefGoogle Scholar
  7. 7.
    Shekhar, S., Feiner, S.K., Aref, W.G.: Spatial computing. Commun. ACM 59(1), 72–81 (2016)CrossRefGoogle Scholar
  8. 8.
    Eldawy, A., Mokbel, M.F.: The era of big spatial data: a survey. DBSJ J. 13(1), 25–36 (2015)Google Scholar
  9. 9.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI 2004, pp. 137–150 (2004)Google Scholar
  10. 10.
    Eldawy, A., Mokbel, M.F.: SpatialHadoop: a MapReduce framework for spatial data. In: ICDE Conference, pp. 1352–1363 (2015)Google Scholar
  11. 11.
    Aji, A., Wang, F., Vo, H., Lee, R., Liu, Q., Zhang, X., Saltz, J.H.: Hadoop-GIS: a high performance spatial data warehousing system over MapReduce. PVLDB 6(11), 1009–1020 (2013)Google Scholar
  12. 12.
    Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI 2012, pp. 15–28. USENIX (2012)Google Scholar
  13. 13.
    Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (2010)Google Scholar
  14. 14.
    Chen, D., Shen, C., Feng, J., Le, J.: An efficient parallel top-k similarity join for massive multidimensional data using spark. Int. J. Database Theor. Appl. 8(3), 57–68 (2015)CrossRefGoogle Scholar
  15. 15.
    Dustakar, N.R., Dustakar, S.R.: Computational geometry leveraged by apache spark. J. Innov. Electron. Commun. Eng. 5(2), 15–31 (2015)Google Scholar
  16. 16.
    Yu, J., Wu, J., Sarwat, M.: GeoSpark: a cluster computing framework for processing large-scale spatial data. In: SIGSPATIAL 2015, Bellevue, WA (2015)Google Scholar
  17. 17.
    You, S., Zhang, J., Gruenwald, L.: Large-scale spatial join query processing in cloud. In: CloudDM Workshop (2015)Google Scholar
  18. 18.
    Tang, M., Yu, Y., Malluhi, Q.M., Ouzzani, M., Aref, W.G.: Locationspark: a distributed in-memory data management system for big spatial data. Proc. VLDB Endowment 9, 1565–1568 (2016)CrossRefGoogle Scholar
  19. 19.
    Xie, D., Li, F., Yao, B., Li, G., Zhou, L., Guo, M.: Simba: efficient in-memory spatial analytics. In: SIGMOD 2016, San Francisco (2016)Google Scholar
  20. 20.
    García-García, F., Corral, A., Iribarne, L., Vassilakopoulos, M., Manolopoulos, Y.: Enhancing SpatialHadoop with closest pair queries. In: Pokorný, J., Ivanović, M., Thalheim, B., Šaloun, P. (eds.) ADBIS 2016. LNCS, vol. 9809, pp. 212–225. Springer, Cham (2016). doi: 10.1007/978-3-319-44039-2_15 CrossRefGoogle Scholar
  21. 21.
    Mavrommatis, G., Moutafis, P., Vassilakopoulos, M.: Closest-pairs query processing in apache spark. In: Proceedings of the Eighth International Conference on Cloud Computing, GRIDs, and Virtualization, pp. 26–31. IARIA (2017)Google Scholar
  22. 22.
    Aji, A., Vo, H, Wang, F.: Effective Spatial Data Partitioning for Scalable Query Processing. arXiv:1509.00910v1 [cs.DB]. Downloaded from 21 December 2016
  23. 23.
    Guller, M.: Big Data Analytics with Spark. Apress, distributed by Springer Science+Business Media, New York (2015)Google Scholar
  24. 24.
    Carraghan, R., Pardalos, P.M.: An exact algorithm for the maximum clique problem. Oper. Res. Lett. 9, 375–382 (1990)CrossRefzbMATHGoogle Scholar
  25. 25.
    Borges, F., Gutierrez-Milla, A., Suppi, R., Luque, E.: Strip partitioning for ant colony parallel and distributed discrete-event simulation. Procedia Comput. Sci. 51, 483–492 (2015)CrossRefGoogle Scholar
  26. 26.
    Eldawy, A., Alarabi, L., Mokbel, M.F.: Spatial partitioning techniques in SpatialHadoop. Proc. VLDB Endowment 8(12), 1602–1605 (2015)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • George Mavrommatis
    • 1
    Email author
  • Panagiotis Moutafis
    • 1
  • Michael Vassilakopoulos
    • 1
  • Francisco García-García
    • 2
  • Antonio Corral
    • 2
  1. 1.Data Structuring & Engineering Lab, Department of Electrical and Computer EngineeringUniversity of ThessalyVolosGreece
  2. 2.Department of InformaticsUniversity of AlmeriaAlmeriaSpain

Personalised recommendations