, Volume 22, Issue 2, pp 171–209 | Cite as

Efficient large-scale distance-based join queries in spatialhadoop

  • Francisco García-García
  • Antonio Corral
  • Luis Iribarne
  • Michael Vassilakopoulos
  • Yannis Manolopoulos


Efficient processing of Distance-Based Join Queries (DBJQs) in spatial databases is of paramount importance in many application domains. The most representative and known DBJQs are the K Closest Pairs Query (KCPQ) and the ε Distance Join Query (εDJQ). These types of join queries are characterized by a number of desired pairs (K) or a distance threshold (ε) between the components of the pairs in the final result, over two spatial datasets. Both are expensive operations, since two spatial datasets are combined with additional constraints. Given the increasing volume of spatial data originating from multiple sources and stored in distributed servers, it is not always efficient to perform DBJQs on a centralized server. For this reason, this paper addresses the problem of computing DBJQs on big spatial datasets in SpatialHadoop, an extension of Hadoop that supports efficient processing of spatial queries in a cloud-based setting. We propose novel algorithms, based on plane-sweep, to perform efficient parallel DBJQs on large-scale spatial datasets in SpatialHadoop. We evaluate the performance of the proposed algorithms in several situations with large real-world as well as synthetic datasets. The experiments demonstrate the efficiency and scalability of our proposed methodologies.


Distance-based join queries Spatial data processing SpatialHadoop MapReduce Spatial query evaluation 



Work of all authors funded by the MINECO research project [TIN2013-41576-R]. We would like to thank Prof. Goce Trajcevski for providing us interesting comments to enrich the article, and we would like also thank the anonymous reviewers for their constructive remarks.


  1. 1.
    García-García F, Corral A, Iribarne L, Vassilakopoulos M, Manolopoulos Y (2016) Enhancing spatialhadoop with closest pair queries. In: ADBIS Conference, pp 212–225Google Scholar
  2. 2.
    Shekhar S, Chawla S (2003) Spatial databases - a tour. Prentice Hall, New JerseyGoogle Scholar
  3. 3.
    Samet H (1990) Applications of Spatial Data Structures: Computer Graphics, Image Processing, and GIS. Addison-Wesley, BostonGoogle Scholar
  4. 4.
    Schiller JH, Voisard A (eds) (2004) Location-Based Services. Morgan Kaufmann, BurlingtonGoogle Scholar
  5. 5.
    Rigaux P, Scholl M, Voisard A (2002) Spatial databases - with applications to GIS. Elsevier, San FranciscoGoogle Scholar
  6. 6.
    Leong Hou U, Mamoulis N, Yiu ML (2008) Computation and monitoring of exclusive closest pairs. Trans Knowl Data Eng 20(12):1641–1654CrossRefGoogle Scholar
  7. 7.
    Ahmadi E, Nascimento MA (2016) K-closest pairs queries in road networks. In: MDM Conference, pp 232–241Google Scholar
  8. 8.
    Corral A, Manolopoulos Y, Theodoridis Y, Vassilakopoulos M (2004) Algorithms for processing k-closest-pair queries in spatial databases. Data Knowl Eng 49(1):67–104CrossRefGoogle Scholar
  9. 9.
    Roumelis G, Corral A, Vassilakopoulos M, Manolopoulos Y (2014) A new plane-sweep algorithm for the k-closest-pairs query. In: SOFSEM Conference, pp 478–490Google Scholar
  10. 10.
    Gao Y, Chen L, Li X, Yao B, Chen G (2015) Efficient k-closest pair queries in general metric spaces. VLDB J 24(3):415–439CrossRefGoogle Scholar
  11. 11.
    Roumelis G, Vassilakopoulos M, Corral A, Manolopoulos Y (2016) New plane-sweep algorithms for distance-based join queries in spatial databases. GeoInformatica 20(4):571–628CrossRefGoogle Scholar
  12. 12.
    Zhang C, Li F, Jestes J (2012) Efficient parallel kNN joins for large data in MapReduce. In: EDBT Conference, pp 38–49Google Scholar
  13. 13.
    Lu W, Shen Y, Chen S, Ooi BC (2012) Efficient processing of k nearest neighbor joins using MapReduce. PVLDB 5(10):1016–1027Google Scholar
  14. 14.
    Wang K, Han J, Tu B, Dai J, Zhou W, Song X (2010) Accelerating spatial data processing with MapReduce. In: ICPADS Conference, pp 229–236Google Scholar
  15. 15.
    Nodarakis N, Pitoura E, Sioutas S, Tsakalidis AK, Tsoumakos D, Tzimas G (2016) kdann+: A rapid aknn classifier for big data. Trans Large-Scale Data-Knowl-Centered Syst 24:139–168Google Scholar
  16. 16.
    Silva YN, Reed JM (2012) Exploiting mapreduce-based similarity joins. In: SIGMOD Conference, pp 693–696Google Scholar
  17. 17.
    Dean J, Ghemawat S (2004) Mapreduce: Simplified data processing on large clusters. In: 137–150Google Scholar
  18. 18.
    Li F, Ooi BC, Özsu MT, Wu S (2014) Distributed data management using mapreduce. ACM Comput Surv 46(3):31:1–31:42Google Scholar
  19. 19.
    Chen CLP, Zhang C (2014) Data-intensive applications, challenges, techniques and technologies: A survey on big data. Inf Sci 275:314–347CrossRefGoogle Scholar
  20. 20.
    Giachetta R (2015) A framework for processing large scale geospatial and remote sensing data in mapreduce environment. Comput Graph 49:37–46CrossRefGoogle Scholar
  21. 21.
    Gani A, Siddiqa A, Shamshirband S, Hanum F (2016) A survey on indexing techniques for big data: taxonomy and performance evaluation. Knowl Inf Syst 46(2):241–284CrossRefGoogle Scholar
  22. 22.
    Doulkeridis C, Nørvåg K (2014) A survey of large-scale analytical query processing in mapreduce. VLDB J 23(3):355–380CrossRefGoogle Scholar
  23. 23.
    Eldawy A, Mokbel MF (2015) Spatialhadoop: A mapreduce framework for spatial data. In: ICDE Conference, pp 1352–1363Google Scholar
  24. 24.
    Shi J, Qiu Y, Minhas UF, Jiao L, Wang C, Reinwald B, Ȯzcan F (2015) Clash of the titans: Mapreduce vs. spark for large scale data analytics. PVLDB 8(13):2110–2121Google Scholar
  25. 25.
    Lu J, Güting RH (2012) Parallel secondo: Boosting database engines with Hadoop. In: ICPADS Conference, pp 738–743Google Scholar
  26. 26.
    Aji A, Wang F, Vo H, Lee R, Liu Q, Zhang X, Saltz JH (2013) Hadoop-GIS: A high performance spatial data warehousing system over MapReduce. PVLDB 6(11):1009–1020Google Scholar
  27. 27.
    Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R (2009) Hive - A warehousing solution over a MapReduce framework. PVLDB 2(2):1626–1629Google Scholar
  28. 28.
    You S, Zhang J, Gruenwald L (2015) Large-scale spatial join query processing in cloud. In: ICDE Workshops, pp 34–41Google Scholar
  29. 29.
    Yu J, Wu J, Sarwat M (2015) Geospark: a cluster computing framework for processing large-scale spatial data. In: SIGSPATIAL Conference, pp 70:1–70:4Google Scholar
  30. 30.
    Xie D, Li F, Yao B, Li G, Zhou L, Guo M (2016) Simba: Efficient in-memory spatial analytics. In: SIGMOD Conference, pp 1071–1085Google Scholar
  31. 31.
    Tang M, Yu Y, Malluhi QM, Ouzzani M, Aref WG (2016) Locationspark: A distributed in-memory data management system for big spatial data. PVLDB 9(13):1565–1568Google Scholar
  32. 32.
    Li Z, Huang Q, Carbone GJ, Hu F (2017) A high performance query analytical framework for supporting data-intensive climate studies, Computers. Comput Environ Urban Syst 62:210–221CrossRefGoogle Scholar
  33. 33.
    Buck JB, Watkins N, LeFevre J, Ioannidou K, Maltzahn C, Polyzotis N, Brandt SA (2011) Scihadoop: array-based query processing in hadoop. In: SC Conference, pp 66:1–66:11Google Scholar
  34. 34.
    Eldawy A, Mokbel MF, Al-Harthi S, Alzaidy A, Tarek K, Ghani S (2015) SHAHED: A mapreduce-based system for querying and visualizing spatio-temporal satellite data. In: ICDE Conference, pp 1585–1596Google Scholar
  35. 35.
    Palamuttam R, Mogrovejo RM, Mattmann C, Wilson B, Whitehall K, Verma R, McGibbney LJ, Ramirez PM (2015) Scispark: Applying in-memory distributed computing to weather event detection and tracking. In: Conference on Big Data, pp 2020–2026Google Scholar
  36. 36.
    Zhang S, Han J, Liu Z, Wang K, Feng S (2009) Spatial queries evaluation with MapReduce. In: GCC Conference, pp 287–292Google Scholar
  37. 37.
    Ma Q, Yang B, Qian W, Zhou A (2009) Query processing of massive trajectory data based on MapReduce. In: CloudDb Conference, pp 9–16Google Scholar
  38. 38.
    Akdogan A, Demiryurek U, Demiryurek FB, Shahabi C (2010) Voronoi-based geospatial query processing with MapReduce. In: CloudCom Conference, pp 9–16Google Scholar
  39. 39.
    Maillo J, Triguero I, Herrera F (2015) A mapreduce-based k-nearest neighbor approach for big data classification. In: TrustCom/BigDataSE/ISPA Conference, pp 167–172Google Scholar
  40. 40.
    Park Y, Min J, Shim K (2013) Parallel computation of skyline and reverse skyline queries using mapreduce. PVLDB 6(14):2002–2013Google Scholar
  41. 41.
    Zhang J, Jiang X, Ku W, Qin X (2016) Efficient parallel skyline evaluation using mapreduce. IEEE Trans Parallel Distrib Syst 27(7):1996–2009CrossRefGoogle Scholar
  42. 42.
    Ji C, Li Z, Qu W, Xu Y, Li Y (2014) Scalable nearest neighbor query processing based on inverted grid index. J Netw Comput Appl 44:172–182CrossRefGoogle Scholar
  43. 43.
    Zhang S, Han J, Liu Z, Wang K, Xu Z (2009) SJMR: parallelizing spatial join with MapReduce on clusters. In: CLUSTER Conference, pp 1–8Google Scholar
  44. 44.
    Patel JM, DeWitt DJ (1996) Partition based spatial-merge join. In: SIGMOD Conference, pp 259–270Google Scholar
  45. 45.
    Kim Y, Shim K (2012) Parallel top-k similarity join algorithms using MapReduce. In: ICDE Conference, pp 510–521Google Scholar
  46. 46.
    Jacox EH, Samet H (2008) Metric space similarity joins. ACM Trans Database Syst 33(2):1–38CrossRefGoogle Scholar
  47. 47.
    Gupta H, Chawda B, Negi S, Faruquie TA, Subramaniam LV, Mohania MK (2013) Processing multi-way spatial joins on map-reduce. In: EDBT Conference, pp 113–124Google Scholar
  48. 48.
    Wang H, Belhassena A (2017) Parallel trajectory search based on distributed index. Inf Sci 388-399:62–83CrossRefGoogle Scholar
  49. 49.
    Eldawy A, Li Y, Mokbel MF, Janardan R (2013) Cg_hadoop: computational geometry in mapreduce. In: SIGSPATIAL Conference, pp 284–293Google Scholar
  50. 50.
    Pertesis D, Doulkeridis C (2015) Efficient skyline query processing in spatialhadoop. Inf Syst 54:325–335CrossRefGoogle Scholar
  51. 51.
    Corral A, Manolopoulos Y, Theodoridis Y, Vassilakopoulos M (2000) Closest pair queries in spatial databases. In: SIGMOD Conference, pp 189–200Google Scholar
  52. 52.
    Hjaltason GR, Samet H (1998) Incremental distance join algorithms for spatial databases. In: SIGMOD Conference, pp 237–248Google Scholar
  53. 53.
    Shin H, Moon B, Lee S (2003) Adaptive and incremental processing for distance join queries. IEEE Trans Knowl Data Eng 15(6):1561–1578CrossRefGoogle Scholar
  54. 54.
    Yang C, Lin K (2002) An index structure for improving closest pairs and related join queries in spatial databases. In: IDEAS Conference, pp 140–149Google Scholar
  55. 55.
    Gutierrez G, Sȧez P (2013) The k closest pairs in spatial databases - when only one set is indexed. GeoInformatica 17(4):543–565CrossRefGoogle Scholar
  56. 56.
    Eldawy A, Alarabi L, Mokbel MF (2015) Spatial partitioning techniques in spatial hadoop. PVLDB 8(12):1602–1613Google Scholar
  57. 57.
    Preparata FP, Shamos MI (1985) Computational Geometry - An Introduction. Springer, BerlinCrossRefGoogle Scholar
  58. 58.
    Corral A, Almendros-Jimėnez JM (2007) A performance comparison of distance-based query algorithms using r-trees in spatial databases. Inf Sci 177(11):2207–2237CrossRefGoogle Scholar
  59. 59.
    Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to Algorithms, 3rd edn. MIT Press, CambridgeGoogle Scholar
  60. 60.
    Chaudhuri S, Motwani R, Narasayya VR (1999) On random sampling over joins. In: SIGMOD Conference, pp 263–274Google Scholar
  61. 61.
    Corral A, Vassilakopoulos M (2005) On approximate algorithms for distance-based queries using r-trees. Comput J 48(2):220–238CrossRefGoogle Scholar
  62. 62.
    Leutenegger ST, Edgington JM, Lopez MA (1997) Str: A simple and efficient algorithm for r-tree packing. In: ICDE Conference, pp 497–506Google Scholar
  63. 63.
    Papadopoulos AN, Nanopoulos A, Manolopoulos Y (2006) Processing distance join queries with constraints. Comput J 49(3):281–296CrossRefGoogle Scholar
  64. 64.
    Mamoulis N, Papadias D, Multiway spatial joins ACM (2001) Trans. Database Syst 26(4):424–475CrossRefGoogle Scholar
  65. 65.
    Corral A, Manolopoulos Y, Theodoridis Y, Vassilakopoulos M (2004) Multi-way distance join queries in spatial databases. GeoInformatica 8(4):373–402CrossRefGoogle Scholar
  66. 66.
    Vo H, Aji A, Wang F (2014) SATO: a spatial data partitioning framework for scalable query processing. In: SIGSPATIAL Conference, pp 545–548Google Scholar
  67. 67.
    Aji A, Vo H, Wang F Effective spatial data partitioning for scalable query processing. arXiv:1509.00910

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  1. 1.Department on InformaticsUniversity of AlmeriaAlmeriaSpain
  2. 2.Department of Electrical and Computer EngineeringUniversity of ThessalyVolosGreece
  3. 3.Department of InformaticsAristotle UniversityThessalonikiGreece

Personalised recommendations