Abstract
This paper presents ST-Hadoop; the first full-fledged open-source MapReduce framework with a native support for spatio-temporal data. ST-Hadoop is a comprehensive extension to Hadoop and SpatialHadoop that injects spatio-temporal data awareness inside each of their layers, mainly, language, indexing, and operations layers. In the language layer, ST-Hadoop provides built in spatio-temporal data types and operations. In the indexing layer, ST-Hadoop spatiotemporally loads and divides data across computation nodes in Hadoop Distributed File System in a way that mimics spatio-temporal index structures, which result in achieving orders of magnitude better performance than Hadoop and SpatialHadoop when dealing with spatio-temporal data and queries. In the operations layer, ST-Hadoop shipped with support for three fundamental spatio-temporal queries, namely, spatio-temporal range, top-k nearest neighbor, and join queries. Extensibility of ST-Hadoop allows others to extend features and operations easily using similar approaches described in the paper. Extensive experiments conducted on large-scale dataset of size 10 TB that contains over 1 Billion spatio-temporal records, to show that ST-Hadoop achieves orders of magnitude better performance than Hadoop and SpaitalHadoop when dealing with spatio-temporal data and operations. The key idea behind the performance gained in ST-Hadoop is its ability in indexing spatio-temporal data within Hadoop Distributed File System.
Similar content being viewed by others
References
Accumulo. https://accumulo.apache.org/
Aji A, Wang F, Vo H, Lee R, Liu Q, Zhang X, Saltz J (2013) Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce. In: VLDB
Al-Naami KM, Seker SE, Khan L (2014) GISQF: An Efficient Spatial Query Processing System. In: CLOUDCOM
Alarabi L, Mokbel MF, Musleh M (2017) St-hadoop: A mapreduce framework for spatio-temporal data. In: SSTD
Apache. Hadoop. http://hadoop.apache.org/
Apache. Spark. http://spark.apache.org/
Eldawy A, Mokbel MF (2014) Pigeon: A spatial mapreduce language. In: ICDE
Eldawy A, Mokbel MF (2015) SpatialHadoop: A MapReduce Framework for Spatial Data. In: ICDE
Eldawy A, Mokbel MF, Alharthi S, Alzaidy A, Tarek K, Ghani S (2015) SHAHED: A MapReduce-based System for Querying and Visualizing Spatio-temporal Satellite Data. In: ICDE
Erwig M, Schneider M (2002) Spatio-temporal predicates. In: TKDE
European XFEL: The Data Challenge, Sept. 2012. http://www.xfel.eu/news/2012/the_data_challenge
Fox AD, Eichelberger CN, Hughes JN, Lyon S (2013) Spatio-temporal indexing in non-relational distributed databases. In: BIGDATA
Fries S, Boden B, Stepien G, Seidl T (2014) Phidj: Parallel similarity self-join for high-dimensional vector data with mapreduce. In: ICDE
GeoWave. https://ngageoint.github.io/geowave/
Han W, Kim J, Lee BS, Tao Y, Rantzau R, Markl V (2009) Cost-based predictive spatiotemporal join
Kini A, Emanuele R Geotrellis: Adding Geospatial Capabilities to Spark, 2014. http://spark-summit.org/2014/talk/geotrellis-adding-geospatial-capabilities-to-spark
Li Z, Hu F, Schnase JL, Duffy DQ, Lee T, Bowen MK, Yang C (2016) A spatiotemporal indexing approach for efficient processing of big array-based climate data with mapreduce. IJGIS
Lo M-L, Ravishankar CV (1996) Spatial Hash-joins. In: SIGMODR
Lu J, Guting RH (2012) Parallel Secondo: Boosting Database Engines with Hadoop. In: ICPADS
Lu P, Chen G, Ooi BC, Vo HT, Wu S (2014) ScalaGiST: Scalable Generalized Search Trees for MapReduce Systems. PVLDB
Ma Q, Yang B, Qian W, Zhou A (2009) Query Processing of Massive Trajectory Data Based on MapReduce. In: CLOUDDB
Land Process Distributed Active Archive Center, Mar. 2015. https://lpdaac.usgs.gov/about
Data from NASA’s Missions, Research, and Activities, 2016. http://www.nasa.gov/open/data.html
Nishimura S, Das S, Agrawal D, El Abbadi A \(\mathcal {M}\mathcal {D}\)-HBase: Design and Implementation of an Elastic Data Infrastructure for Cloud-scale Location Services. DAPD
NYC Taxi and Limousine Commission, 2017. http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
Pavlo A, Paulson E, Rasin A, Abadi D, DeWitt D, Madden S, Stonebraker M (2009) A Comparison of Approaches to Large-Scale Data Analysis. In: SIGMOD
ST-Hadoop website. http://st-hadoop.cs.umn.edu/
Stonebraker M, Brown P, Zhang D, Becla J (2013) SciDB: A Database Management System for Applications with Complex Analytics. Computing in Science and Engineering
Tan H, Luo W, Ni LM (2012) Clost: a hadoop-based storage system for big spatio-temporal data analytics. In: CIKM
Wang G, Salles M, Sowell B, Wang X, Cao T, Demers A, Gehrke J, White W (2010) Behavioral Simulations in MapReduce. PVLDB
Whitby MA, Fecher R, Bennight C (2017) Geowave: Utilizing distributed key-value stores for multidimensional data. In: Proceedings of the International Symposium on Advances in Spatial and Temporal Databases, SSTD
Whitman RT, Park MB, Ambrose SA, Hoel EG (2014) Spatial Indexing and Analytics on Hadoop. In: SIGSPATIAL
Yokoyama T, Ishikawa Y, Suzuki Y (2012) Processing all k-nearest neighbor queries in hadoop. In: WAIM
Yu J, Wu J, Sarwat M (2015) GeoSpark: A Cluster Computing Framework for Processing Large-Scale Spatial Data. In: SIGSPATIAL
Zhang S, Han J, Liu Z, Wang K, Feng S (2009) Spatial Queries Evaluation with MapReduce. In: GCC
Zhang X, Ai J, Wang Z, Lu J, Meng X (2009) An efficient multi-dimensional index for cloud data management. In: CIKM
Zhong Y, Zhu X, Fang J (2012) Elastic and Effective Spatio-Temporal Query Processing Scheme on Hadoop. In: BIGSPATIAL
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Alarabi, L., Mokbel, M.F. & Musleh, M. ST-Hadoop: a MapReduce framework for spatio-temporal data. Geoinformatica 22, 785–813 (2018). https://doi.org/10.1007/s10707-018-0325-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10707-018-0325-6