Advertisement

A Comparison of Distributed Spatial Data Management Systems for Processing Distance Join Queries

  • Francisco García-García
  • Antonio CorralEmail author
  • Luis Iribarne
  • George Mavrommatis
  • Michael Vassilakopoulos
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10509)

Abstract

Due to the ubiquitous use of spatial data applications and the large amounts of spatial data that these applications generate, the processing of large-scale distance joins in distributed systems is becoming increasingly popular. Two of the most studied distance join queries are the K Closest Pair Query (KCPQ) and the \(\varepsilon \) Distance Join Query (\(\varepsilon \) DJQ). The KCPQ finds the K closest pairs of points from two datasets and the \(\varepsilon \) DJQ finds all the possible pairs of points from two datasets, that are within a distance threshold \(\varepsilon \) of each other. Distributed cluster-based computing systems can be classified in Hadoop-based and Spark-based systems. Based on this classification, in this paper, we compare two of the most current and leading distributed spatial data management systems, namely SpatialHadoop and LocationSpark, by evaluating the performance of existing and newly proposed parallel and distributed distance join query algorithms in different situations with big real-world datasets. As a general conclusion, while SpatialHadoop is more mature and robust system, LocationSpark is the winner with respect to the total execution time.

Keywords

Spatial data processing Distance joins SpatialHadoop LocationSpark 

References

  1. 1.
    Aji, A., Wang, F., Vo, H., Lee, R., Liu, Q., Zhang, X., Saltz, J.H.: Hadoop-GIS: a high performance spatial data warehousing system over MapReduce. PVLDB 6(11), 1009–1020 (2013)Google Scholar
  2. 2.
    Corral, A., Manolopoulos, Y., Theodoridis, Y., Vassilakopoulos, M.: Algorithms for processing \(K\)-closest-pair queries in spatial databases. Data Knowl. Eng. 49(1), 67–104 (2004)CrossRefGoogle Scholar
  3. 3.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI Conference, pp. 137–150 (2004)Google Scholar
  4. 4.
    Eldawy, A., Alarabi, L., Mokbel, M.F.: Spatial partitioning techniques in SpatialHadoop. PVLDB 8(12), 1602–1613 (2015)Google Scholar
  5. 5.
    Eldawy, A., Mokbel, M.F.: SpatialHadoop: a MapReduce framework for spatial data. In: ICDE Conference, pp. 1352–1363 (2015)Google Scholar
  6. 6.
    García-García, F., Corral, A., Iribarne, L., Vassilakopoulos, M., Manolopoulos, Y.: Enhancing SpatialHadoop with closest pair queries. In: Pokorný, J., Ivanović, M., Thalheim, B., Šaloun, P. (eds.) ADBIS 2016. LNCS, vol. 9809, pp. 212–225. Springer, Cham (2016). doi: 10.1007/978-3-319-44039-2_15 CrossRefGoogle Scholar
  7. 7.
    Lenka, R.K., Barik, R.K., Gupta, N., Ali, S.M., Rath, A., Dubey, H.: Comparative analysis of SpatialHadoop and GeoSpark for geospatial big data analytics, CoRR abs/1612.07433 (2016)Google Scholar
  8. 8.
    Li, F., Ooi, B.C., Özsu, M.T., Wu, S.: Distributed data management using MapReduce. ACM Comput. Surv. 46(3), 31:1–31:42 (2014)Google Scholar
  9. 9.
    Roumelis, G., Corral, A., Vassilakopoulos, M., Manolopoulos, Y.: New plane-sweep algorithms for distance-based join queries in spatial databases. GeoInformatica 20(4), 571–628 (2016)CrossRefGoogle Scholar
  10. 10.
    Shi, J., Qiu, Y., Minhas, U.F., Jiao, L., Wang, C., Reinwald, B., Özcan, F.: Clash of the titans: mapreduce vs. spark for large scale data analytics. PVLDB 8(13), 2110–2121 (2015)Google Scholar
  11. 11.
    Tang, M., Yu, Y., Malluhi, Q.M., Ouzzani, M., Aref, W.G.: Locationspark: a distributed in-memory data management system for big spatial data. PVLDB 9(13), 1565–1568 (2016)Google Scholar
  12. 12.
    Tang, M., Yu, Y., Aref, W.G., Mahmood, A.R., Malluhi, Q.M., Ouzzani, M.: In-memory distributed spatial query processing and optimization, April 2017. http://merlintang.github.io/paper/memory-distributed-spatial.pdf
  13. 13.
    Xie, D., Li, F., Yao, B., Li, G., Zhou, L., Guo, M.: Simba: efficient in-memory spatial analytics. In: SIGMOD Conference, pp. 1071–1085 (2016)Google Scholar
  14. 14.
    You, S., Zhang, J., Gruenwald, L.: Large-scale spatial join query processing in cloud. In: ICDE Workshops, pp. 34–41 (2015)Google Scholar
  15. 15.
    You, S., Zhang, J., Gruenwald, L.: Spatial join query processing in cloud: Analyzing design choices and performance comparisons. In: ICPPW Conference, pp. 90–97 (2015)Google Scholar
  16. 16.
    Yu, J., Wu, J., Sarwat, M.: GeoSpark: a cluster computing framework for processing large-scale spatial data. In: SIGSPATIAL Conference, pp. 70:1–70:4 (2015)Google Scholar
  17. 17.
    Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI Conference, pp. 15–28 (2012)Google Scholar
  18. 18.
    Zhang, H., Chen, G., Ooi, B.C., Tan, K.-L., Zhang, M.: In-memory big data management and processing: a survey. TKDE 27(7), 1920–1948 (2015)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Francisco García-García
    • 1
  • Antonio Corral
    • 1
    Email author
  • Luis Iribarne
    • 1
  • George Mavrommatis
    • 2
  • Michael Vassilakopoulos
    • 2
  1. 1.Department of InformaticsUniversity of AlmeriaAlmeriaSpain
  2. 2.DaSE Lab, Department of Electrical and Computer EngineeringUniversity of ThessalyVolosGreece

Personalised recommendations