ST-Hadoop: a MapReduce framework for spatio-temporal data

Abstract

This paper presents ST-Hadoop; the first full-fledged open-source MapReduce framework with a native support for spatio-temporal data. ST-Hadoop is a comprehensive extension to Hadoop and SpatialHadoop that injects spatio-temporal data awareness inside each of their layers, mainly, language, indexing, and operations layers. In the language layer, ST-Hadoop provides built in spatio-temporal data types and operations. In the indexing layer, ST-Hadoop spatiotemporally loads and divides data across computation nodes in Hadoop Distributed File System in a way that mimics spatio-temporal index structures, which result in achieving orders of magnitude better performance than Hadoop and SpatialHadoop when dealing with spatio-temporal data and queries. In the operations layer, ST-Hadoop shipped with support for three fundamental spatio-temporal queries, namely, spatio-temporal range, top-k nearest neighbor, and join queries. Extensibility of ST-Hadoop allows others to extend features and operations easily using similar approaches described in the paper. Extensive experiments conducted on large-scale dataset of size 10 TB that contains over 1 Billion spatio-temporal records, to show that ST-Hadoop achieves orders of magnitude better performance than Hadoop and SpaitalHadoop when dealing with spatio-temporal data and operations. The key idea behind the performance gained in ST-Hadoop is its ability in indexing spatio-temporal data within Hadoop Distributed File System.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21

References

  1. 1.

    https://about.twitter.com/company (2017)

  2. 2.

    Accumulo. https://accumulo.apache.org/

  3. 3.

    Aji A, Wang F, Vo H, Lee R, Liu Q, Zhang X, Saltz J (2013) Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce. In: VLDB

  4. 4.

    Al-Naami KM, Seker SE, Khan L (2014) GISQF: An Efficient Spatial Query Processing System. In: CLOUDCOM

  5. 5.

    Alarabi L, Mokbel MF, Musleh M (2017) St-hadoop: A mapreduce framework for spatio-temporal data. In: SSTD

  6. 6.

    Apache. Hadoop. http://hadoop.apache.org/

  7. 7.

    Apache. Spark. http://spark.apache.org/

  8. 8.

    Eldawy A, Mokbel MF (2014) Pigeon: A spatial mapreduce language. In: ICDE

  9. 9.

    Eldawy A, Mokbel MF (2015) SpatialHadoop: A MapReduce Framework for Spatial Data. In: ICDE

  10. 10.

    Eldawy A, Mokbel MF, Alharthi S, Alzaidy A, Tarek K, Ghani S (2015) SHAHED: A MapReduce-based System for Querying and Visualizing Spatio-temporal Satellite Data. In: ICDE

  11. 11.

    Erwig M, Schneider M (2002) Spatio-temporal predicates. In: TKDE

  12. 12.

    European XFEL: The Data Challenge, Sept. 2012. http://www.xfel.eu/news/2012/the_data_challenge

  13. 13.

    Fox AD, Eichelberger CN, Hughes JN, Lyon S (2013) Spatio-temporal indexing in non-relational distributed databases. In: BIGDATA

  14. 14.

    Fries S, Boden B, Stepien G, Seidl T (2014) Phidj: Parallel similarity self-join for high-dimensional vector data with mapreduce. In: ICDE

  15. 15.

    GeoWave. https://ngageoint.github.io/geowave/

  16. 16.

    Han W, Kim J, Lee BS, Tao Y, Rantzau R, Markl V (2009) Cost-based predictive spatiotemporal join

  17. 17.

    Kini A, Emanuele R Geotrellis: Adding Geospatial Capabilities to Spark, 2014. http://spark-summit.org/2014/talk/geotrellis-adding-geospatial-capabilities-to-spark

  18. 18.

    Li Z, Hu F, Schnase JL, Duffy DQ, Lee T, Bowen MK, Yang C (2016) A spatiotemporal indexing approach for efficient processing of big array-based climate data with mapreduce. IJGIS

  19. 19.

    Lo M-L, Ravishankar CV (1996) Spatial Hash-joins. In: SIGMODR

  20. 20.

    Lu J, Guting RH (2012) Parallel Secondo: Boosting Database Engines with Hadoop. In: ICPADS

  21. 21.

    Lu P, Chen G, Ooi BC, Vo HT, Wu S (2014) ScalaGiST: Scalable Generalized Search Trees for MapReduce Systems. PVLDB

  22. 22.

    Ma Q, Yang B, Qian W, Zhou A (2009) Query Processing of Massive Trajectory Data Based on MapReduce. In: CLOUDDB

  23. 23.

    Land Process Distributed Active Archive Center, Mar. 2015. https://lpdaac.usgs.gov/about

  24. 24.

    Data from NASA’s Missions, Research, and Activities, 2016. http://www.nasa.gov/open/data.html

  25. 25.

    Nishimura S, Das S, Agrawal D, El Abbadi A \(\mathcal {M}\mathcal {D}\)-HBase: Design and Implementation of an Elastic Data Infrastructure for Cloud-scale Location Services. DAPD

  26. 26.

    NYC Taxi and Limousine Commission, 2017. http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

  27. 27.

    Pavlo A, Paulson E, Rasin A, Abadi D, DeWitt D, Madden S, Stonebraker M (2009) A Comparison of Approaches to Large-Scale Data Analysis. In: SIGMOD

  28. 28.

    ST-Hadoop website. http://st-hadoop.cs.umn.edu/

  29. 29.

    Stonebraker M, Brown P, Zhang D, Becla J (2013) SciDB: A Database Management System for Applications with Complex Analytics. Computing in Science and Engineering

  30. 30.

    Tan H, Luo W, Ni LM (2012) Clost: a hadoop-based storage system for big spatio-temporal data analytics. In: CIKM

  31. 31.

    Wang G, Salles M, Sowell B, Wang X, Cao T, Demers A, Gehrke J, White W (2010) Behavioral Simulations in MapReduce. PVLDB

  32. 32.

    Whitby MA, Fecher R, Bennight C (2017) Geowave: Utilizing distributed key-value stores for multidimensional data. In: Proceedings of the International Symposium on Advances in Spatial and Temporal Databases, SSTD

  33. 33.

    Whitman RT, Park MB, Ambrose SA, Hoel EG (2014) Spatial Indexing and Analytics on Hadoop. In: SIGSPATIAL

  34. 34.

    Yokoyama T, Ishikawa Y, Suzuki Y (2012) Processing all k-nearest neighbor queries in hadoop. In: WAIM

  35. 35.

    Yu J, Wu J, Sarwat M (2015) GeoSpark: A Cluster Computing Framework for Processing Large-Scale Spatial Data. In: SIGSPATIAL

  36. 36.

    Zhang S, Han J, Liu Z, Wang K, Feng S (2009) Spatial Queries Evaluation with MapReduce. In: GCC

  37. 37.

    Zhang X, Ai J, Wang Z, Lu J, Meng X (2009) An efficient multi-dimensional index for cloud data management. In: CIKM

  38. 38.

    Zhong Y, Zhu X, Fang J (2012) Elastic and Effective Spatio-Temporal Query Processing Scheme on Hadoop. In: BIGSPATIAL

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Louai Alarabi.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Alarabi, L., Mokbel, M.F. & Musleh, M. ST-Hadoop: a MapReduce framework for spatio-temporal data. Geoinformatica 22, 785–813 (2018). https://doi.org/10.1007/s10707-018-0325-6

Download citation

Keywords

  • MapReduce-based systems
  • Spatio-temporal systems
  • Spatio-temporal range query
  • Spatio-temporal nearest neighbor query
  • Spatio-temporal join query