Skip to main content
Log in

Spatial data management in apache spark: the GeoSpark perspective and beyond

  • Published:
GeoInformatica Aims and scope Submit manuscript

Abstract

The paper presents the details of designing and developing GeoSpark, which extends the core engine of Apache Spark and SparkSQL to support spatial data types, indexes, and geometrical operations at scale. The paper also gives a detailed analysis of the technical challenges and opportunities of extending Apache Spark to support state-of-the-art spatial data partitioning techniques: uniform grid, R-tree, Quad-Tree, and KDB-Tree. The paper also shows how building local spatial indexes, e.g., R-Tree or Quad-Tree, on each Spark data partition can speed up the local computation and hence decrease the overall runtime of the spatial analytics program. Furthermore, the paper introduces a comprehensive experiment analysis that surveys and experimentally evaluates the performance of running de-facto spatial operations like spatial range, spatial K-Nearest Neighbors (KNN), and spatial join queries in the Apache Spark ecosystem. Extensive experiments on real spatial datasets show that GeoSpark achieves up to two orders of magnitude faster run time performance than existing Hadoop-based systems and up to an order of magnitude faster performance than Spark-based systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22

Similar content being viewed by others

Notes

  1. source code: https://github.com/DataSystemsLab/GeoSpark

  2. Runnable example: https://github.com/jiayuasu/GeoSparkTemplateProject

References

  1. NRC (2001) Committee on the science of climate change, climate change science: an analysis of some key questions, National Academies Press, Washington

  2. Zeng N, Dickinson RE, Zeng X (1996) Climatic impact of amazon Deforestation? A mechanistic model study. Journal of Climate 9:859–883

    Article  Google Scholar 

  3. Chen C, Burton M, Greenberger E, Dmitrieva J (1999) Population migration and the variation of dopamine D4 receptor (DRD4) allele frequencies around the globe. Evol Hum Behav 20(5):309–324

    Article  Google Scholar 

  4. Woodworth PL, Menéndez M, Gehrels WR (2011) Evidence for century-timescale acceleration in mean sea levels and for recent changes in extreme sea levels. Surv Geophys 32(4-5):603–618

    Article  Google Scholar 

  5. Dhar S, Varshney U (2011) Challenges and business models for mobile location-based services and advertising. Commun ACM 54(5):121–128

    Article  Google Scholar 

  6. PostGIS Postgis. http://postgis.net/

  7. Open Geospatial Consortium. http://www.opengeospatial.org/

  8. Aji A, Wang F, Vo H, Lee R, Liu Q, Zhang X, Saltz JH (2013) Hadoop-GIS: a high performance spatial data warehousing system over MapReduce. Proc Int Conf on Very Large Data Bases, VLDB 6(11):1009–1020

    Google Scholar 

  9. Eldawy A, Mokbel MF (2015) Spatialhadoop: a mapreduce framework for spatial data. In: Proceedings of the IEEE International Conference on Data Engineering, ICDE, pp 1352–1363

  10. Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauly M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the USENIX symposium on Networked Systems Design and Implementation, NSDI, pp 15–28

  11. Ashworth M (2016) Information technology – database languages – sql multimedia and application packages – part 3: Spatial, standard, International organization for standardization, Geneva, Switzerland

  12. Pagel B-U, Six H-W, Toben H, Widmayer P (1993) Towards an analysis of range query performance in spatial data structures. In: Proceedings of the Twelfth ACM SIGACT-SIGMOD-SIGART symposium on Principles of Database Systems PODS ’93

  13. Patel JM, DeWitt DJ (1996) Partition based spatial-merge join. In: Proceedings of the ACM international conference on management of data, SIGMOD, pp 259–270

  14. Guttman A (1984) R-trees: a dynamic index structure for spatial searching. In: Proceedings of the ACM international conference on management of data, SIGMOD, pp 47–57

  15. Samet H (1984) The quadtree and related hierarchical data structures. ACM Comput Surv (CSUR) 16(2):187–260

    Article  Google Scholar 

  16. Eldawy A, Alarabi L, Mokbel MF (2015) Spatial partitioning techniques in spatial hadoop. Proc Int Conf on Very Large Data Bases, VLDB 8(12):1602–1605

    Google Scholar 

  17. Eldawy A, Mokbel MF, Jonathan C (2016) Hadoopviz: A mapreduce framework for extensible visualization of big spatial data. In: Proceedings of the IEEE International Conference on Data Engineering, ICDE, pp 601–612

  18. Eldawy A, Mokbel MF (2014) Pigeon: a spatial mapreduce language. In: IEEE 30th International Conference on Data Engineering, Chicago, ICDE 2014, IL, USA, March 31 - April 4, 2014, pp 1242–1245

  19. Lu J, Guting RH (2012) Parallel secondo: boosting database engines with Hadoop. In: International conference on parallel and distributed systems, pp 738 –743

  20. Vo H, Aji A, Wang F (2014) SATO: a spatial data partitioning framework for scalable query processing. In: Proceedings of the ACM international conference on advances in geographic information systems, ACM SIGSPATIAL, pp 545–548

  21. Thusoo A, Sen JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R (2009) Hive: a warehousing solution over a Map-Reduce framework. In: Proceedings of the International Conference on Very Large Data Bases, VLDB, pp 1626–1629

  22. Armbrust M, Xin RS, Lian C, Huai Y, Liu D, Bradley JK, Meng X, Kaftan T, Franklin MJ, Ghodsi A, Zaharia M (2015) Spark SQL: relational data processing in spark. In: Proceedings of the ACM international conference on management of data, SIGMOD, pp 1383–1394

  23. Xie D, Li F, Yao B, Li G, Zhou L, Guo M (2016) Simba: efficient in-memory spatial analytics. In: Proceedings of the ACM international conference on management of data, SIGMOD

  24. Sriharsha R Geospatial analytics using spark. https://github.com/harsha2010/magellan

  25. You S, Zhang J, Gruenwald L (2015) Large-scale spatial join query processing in cloud. In: Proceedings of the IEEE International Conference on Data Engineering Workshop, ICDEW, pp 34–41

  26. Hughes NJ, Annex A, Eichelberger CN, Fox A, Hulbert A, Ronquest M (2015) Geomesa: a distributed architecture for spatio-temporal fusion. In: SPIE defense+ security, pp 94730F–94730F, International society for optics and photonics

  27. Finkel RA, Bentley JL (1974) Quad trees a data structure for retrieval on composite keys. Acta informatica 4(1):1–9

    Article  Google Scholar 

  28. Herring JR (2006) Opengis implementation specification for geographic information-simple feature access-part 2: Sql option, Open Geospatial Consortium Inc

  29. Apache Accumulo. https://accumulo.apache.org/

  30. Hunt P, Konar M, Junqueira FP, Reed B (2010) Zookeeper: Wait-free coordination for internet-scale systems. In: USENIX annual technical conference, Boston, MA, USA June 23-25

  31. Butler H, Daly M, Doyle A, Gillies S, Schaub T, Schmidt C (2014) Geojson, Electronic. http://geojson.org

  32. Perry M, Herring J (2012) Ogc geosparql-a geographic query language for rdf data, OGC Implementation Standard Sept

  33. Group H et al (2014) Hierarchical data format version 5

  34. ESRI E (1998) Shapefile technical description, an ESRI white paper

  35. Yu J, Sarwat M (2016) Two birds, one stone: A fast, yet lightweight, indexing scheme for modern database systems. Proc Int Conf on Very Large Data Bases, VLDB 10(4):385–396

    Google Scholar 

  36. Yu J, Sarwat M (2017) Indexing the pickup and drop-off locations of NYC taxi trips in postgresql - lessons from the road. In: Proceedings of the international symposium on advances in spatial and temporal databases, SSTD, pp 145–162

  37. Taxi NYC, Commission L Nyc tlc trip data. http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

  38. Robinson JT (1981) The k-d-b-tree: a search structure for large multidimensional dynamic indexes. In: Proceedings of the 1981 ACM SIGMOD international conference on management of data, Ann Arbor, Michigan, April 29 - May 1, 1981, pp 10–18

  39. Opyrchal L, Prakash A (1999) Efficient object serialization in java. In: Proceedings of the 19th IEEE international conference on distributed computing systems workshops on electronic commerce and web-based applications/middleware, 1999, IEEE, pp 96–101

  40. Cao P, Wang Z (2004) Efficient top-k query calculation in distributed networks. In: Proceedings of the twenty-third annual ACM symposium on principles of distributed computing, PODC 2004, St. John’s, Newfoundland, Canada, July 25-28, 2004, pp 206–215

  41. Roussopoulos N, Kelley S, Vincent F (1995) Nearest neighbor queries. In: Proceedings of the ACM international conference on management of data, SIGMOD, pp 71–79

  42. Zhou X, Abel DJ, Truffet D (1998) Data partitioning for parallel spatial join processing. Geoinformatica 2(2):175–204

    Article  Google Scholar 

  43. Luo G, Naughton JF, Ellmann CJ (2002) A non-blocking parallel spatial join algorithm

  44. Zhang S, Han J, Liu Z, Wang K, Xu Z (2009) SJMR: parallelizing spatial join with mapreduce on clusters. In: Proceedings of the 2009 IEEE international conference on cluster computing, August 31 - September 4, 2009, New Orleans, Louisiana, USA, pp 1–8

  45. Dittrich J, Seeger B (2000) Data redundancy and duplicate detection in spatial join processing. In: Proceedings of the 16th international conference on data engineering, San Diego, California, USA, February 28 - March 3, 2000, pp 535–546

  46. Consortium OG (2010) Opengis web map tile service implementation standard, tech. rep., Tech. Rep. OGC 07-057r7. In: Masó J, Pomakis K, Julià N (eds) Open Geospatial Consortium. Available at http://portal.opengeospatial.org/files

  47. Ripley BD (2005) Spatial statistics, vol 575, Wiley, New York

  48. Haklay MM, Weber P (2008) Openstreetmap: User-generated street maps. IEEE Pervasive Computing 7(4):12–18

    Article  Google Scholar 

  49. TIGER data. https://www.census.gov/geo/maps-data/data/tiger.html

  50. OpenStreetMap. http://www.openstreetmap.org/

  51. OpenStreetMap. Open street map zoom levels. http://wiki.openstreetmap.org/wiki/Zoom_levels

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jia Yu.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, J., Zhang, Z. & Sarwat, M. Spatial data management in apache spark: the GeoSpark perspective and beyond. Geoinformatica 23, 37–78 (2019). https://doi.org/10.1007/s10707-018-0330-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10707-018-0330-9

Keywords

Navigation