- 155 Downloads
Spatial big data is a spatio-temporal data that is too large or requires data-intensive computation that is too demanding for traditional computing architectures. Stream processing in this context is the processing of spatio-temporal data in motion. The data is observational; it is produced by sensors – moving or otherwise. Computations on the data are made as the data is produced or received. A distributed processing cluster is a networked collection of computers that communicate and process data in a coordinated manner. Computers in the cluster are coordinated to solve a common problem. A lambda architecture is a scalable, fault-tolerant data-processing architecture that is designed to handle large quantities of data by exploiting both stream and batch processing methods. Data partitioning involves physically dividing a dataset into separate data stores on a distributed processing cluster. This is done to achieve improved scalability, performance, availability, and fault-tolerance. Distributed file systems, in the context of big data architectures, are similar to traditional distributed file systems but are intended to persist large datasets on commodity hardware in a fault-tolerant manner with simple coherency models. The MapReduce programming model is intended for large-scale distributed data processing and is based upon simple concepts involving iterating over data, performing a computation on key/value pairs, grouping the intermediary values by key, iterating over the resulting groups, and reducing each group to provide a result. A GPU-accelerated distributed processing framework is an extension to a traditional distributed processing framework that supports offloading tasks to GPUs for further acceleration.
Spatial big data architectures are intended to address requirements for spatio-temporal data that is too large or computationally demanding for traditional computing architectures (Shekhar et al. 2012). This includes operations such as real-time data ingest, stream processing, batch processing, storage, and spatio-temporal analytical processing.
Real-time processing of observational data (data in motion). This commonly involves monitoring and tracking dynamic assets in real-time; this can include vehicles, aircraft, and vessels, as well as stationary assets such as weather and environmental monitoring sensors.
Batch processing of persisted spatio-temporal data (data at rest). Workflows with data at rest incorporate tabular and spatial processing (e.g., summarizing data, analyzing patterns, and proximity analysis), geoenrichment and geoenablement (adding spatial capabilities to non-spatial data, adding information from contextual spatial data collections), and machine learning and predictive analytics (clustering, classification, and prediction).
Spatial big data systems must be scalable. Scalability encompasses being able to increase and support different amounts of data, processing them, and allocating computational resources without impacting costs or efficiency. To meet this requirement, it is required to distribute data sets and their processing across multiple computing and storage nodes.
Spatial big data systems must be able to process large streams of data in a short period of time, thus returning the results to users as efficiently as possible. In addition, the system should support computation intensive spatio-temporal analytics.
Big data systems must be able to manage the continuous flow of data and its processing in real time, facilitating decision-making.
Big data systems must support data consistency, heterogeneity, and exploitation. Different data formats must be also managed to represent useful information for the system.
Big data systems must ensure security in the data and its manipulation in the architecture, supporting integrity of information, exchanging data, multilevel policy-driven, access control, and prevent unauthorized access.
Big data systems must ensure high data availability, through data replication horizontal scaling (i.e., distribute a data set over clusters of computers and storage nodes). The system must allow replication and handle hardware failures.
Big data systems must be transparently intercommunicated to allow exchanging information between machines and processes, interfaces, and people.
Key Research Findings
Spatial big data architectures have existed since the early 2000s with some of the original implementations at Google and Microsoft. Some of the key research issues related to spatial big data architectures include data storage (repositories, distributes storage, NoSQL databases), distributed spatio-temporal analytic processing, stream processing, scalable cloud computing, and GPU-enabled distributed processing frameworks.
Parallel Database Systems
Big data repositories have existed in many forms, frequently built by corporations and governmental agencies with special requirements. Beginning in the 1990s, commercial vendors began offering parallel database management systems to address the needs of big data (DeWitt and Gray 1992). Parallel database systems are classified as being shared memory (processing elements sharing memory), shared disk (processing elements do not share memory, but do share disk storage), or shared nothing (neither memory nor disk storage is shared between processing elements). Significant efforts included those by Teradata, IBM, Digital, Microsoft, Oracle, Google, and Amazon. Additionally, beginning in the 1980s, there were numerous research systems that contributed to these efforts (e.g., GAMMA and Bubba; DeWitt et al. 1986, Alexander and Copeland 1988).
Distributed File Stores
Hadoop Distributed File System (HDFS) is a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster (Shvachko et al. 2010). It was inspired by the Google File System (GFS; Ghemawat et al. 2003). HDFS stores large files (typically in the range of gigabytes to terabytes) across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence theoretically does not require redundant array of independent disks (RAID) storage on hosts (but to increase input-output (I/O) performance, RAID configurations may be employed). With the default replication value of three, data is stored on three nodes: two on the same rack and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high.
A NoSQL (originally referencing “non-SQL” or “non-relational”) database is a mechanism for the storage and retrieval of data that is modeled differently from standard relational databases (NoSQL 2009; Pavlo and Aslett 2016). NoSQL databases are often considered next generation databases; they are intended to address weaknesses of traditional relational databases such as being readily distributable, simpler in design, open-source, and horizontally scalable (often problematic for relational databases). Many databases supporting these characteristics originated in the late 1960s; the “NoSQL” description was employed beginning in the late 1990s with the requirements imposed by companies such as Facebook, Google, and Amazon. NoSQL databases are commonly used with big data applications. NoSQL systems are also sometimes called “Not only SQL” to emphasize that they may support SQL-like query languages.
Key-value: Apache Ignite, Couchbase, Dynamo, Oracle NoSQL Database, Redis, Riak
Columnar: Accumulo, Cassandra, Druid, HBase, Vertica
Document: Apache CouchDB, Cosmos DB, IBM Domino, MarkLogic, MongoDB
Graph: AllegroGraph, Apache Giraph, MarkLogic, Neo4J
Multi-model: Apache Ignite, Couchbase, MarkLogic
Spatial Batch Processing
Spatio-temporal analysis in a batch context involves a very wide scope of functionality. In academia, much of the research has focused on the spatial join (or spatio-temporal join) function. In commercial systems, spatial analysis also includes summarizing data, incident and similar location detection, proximity analysis, pattern analysis, and data management (import, export, cleansing, etc.).
Spatial joins have been widely studied in both the standard sequential environment (Jacox and Samet 2007), as well as in the parallel (Brinkhoff et al. 1996) and distributed environments (Abel et al. 1995). For over 20 years, algorithms have been developed to take advantage of parallel and distributed processing architectures and software frameworks. The recent resurgence in interest in spatial join processing is the results of newfound interest in distributed, fault-tolerant, computing frameworks such as Apache Hadoop, as well as the explosion in observational and IoT data.
With distributed processing architectures, there are two principal approaches that are employed when performing spatial joins. The first, termed a broadcast (or mapside) spatial join, is designed for joining a large dataset with another small dataset (e.g., political boundaries). The large dataset is partitioned across the processing nodes and the complete small dataset is broadcast to each of the nodes. This allows significant optimization opportunities. The second approach, termed a partitioned (or reduce side) spatial join, is a more general technique that is used when joining two large datasets. Partitioned joins use a divide-and-conquer approach (Aji et al. 2013). The two large datasets are divided into small pieces via a spatial decomposition, and each small piece is processed independently.
SJMR (Spatial Join with MapReduce) introduced the first distributed spatial join on Hadoop using the MapReduce programming model (Dean and Ghemawat 2008; Zhang et al. 2009). SpatialHadoop (Eldawy and Mokbel 2015) optimized SJMR with a persistent spatial index (it supports grid files, R − trees, and R + trees) that is precomputed. Hadoop-GIS (Aji et al. 2013), which is utilized in medical pathology imaging, features both 2D and 3D spatial join. GIS Tools for Hadoop (Whitman et al. 2014) is an open source library that implements range and distance queries and k-NN. It also supports a distributed PMR quadtree-based spatial index. GeoSpark (Yu et al. 2015) is a framework for performing spatial joins, range, and k-NN queries. The framework supports quadtree and R-tree indexing of the source data. Magellan (Sriharsha 2015) is an open source library for geospatial analytics that uses Spark (Zaharia et al. 2010). It supports a broadcast join and a reduce-side optimized join and is integrated with Spark SQL for a traditional, SQL user experience. SpatialSpark (You et al. 2015) supports both a broadcast spatial join and a partitioned spatial join on Spark. The partitioning is supported using either fixed-grid, binary space partition or a sort-tile approach. STARK (Hagedorn et al. 2017) is a Spark-based framework that supports spatial joins, k-NN, and range queries on both spatial and spatio-temporal data. STARK supports three temporal operators: contains, containedBy, and intersects) and also supports the DBSCAN density-based spatial clusterer (Ester et al. 1996). MSJS (multi-way spatial join algorithm with Spark (Du et al. 2017)) addresses the problem of performing multi-way spatial joins using the common technique of cascading sequences of pairwise spatial joins. Simba (Xie et al. 2016) offers range, distance (circle range), and k-NN queries as well as distance and k-NN joins. Two-level indexing, global and local, is employed, similar to the various indexing work on Hadoop MapReduce. LocationSpark (Tang et al. 2016) supports range query, k-NN, spatial join, and k-NN join. LocationSpark uses global and local indices – Grid, R-tree, Quadtree, and IR-tree. GeoMesa is an open-source, distributed, spatio-temporal index built on top of Bigtable-style databases (Chang et al. 2008) using an implementation of the Geohash algorithm implemented in Scala (Hughes et al. 2015). The Esri GeoAnalytics Server (Whitman et al. 2017) supports many types of spatial analysis is a distributed environment (leveraging the Spark framework). It provides functionality for summarizing data (e.g., aggregation, spatio-temporal join, polygon overlay), incident and similar location detection, proximity analysis, and pattern analysis (hot spot analysis, NetCDF generation).
MapReduce Programming Model
MapReduce is a programming model and an associated implementation for processing big data sets with a parallel, distributed algorithm (Sakr et al. 2013). A MapReduce program is composed of a map procedure (or method), which performs filtering and sorting (such as sorting students by first name into queues, one queue for each name), and a reduce method, which performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). A MapReduce framework manages the processing by marshalling the distributed cluster nodes, running the various tasks and algorithms in parallel, managing communications and data transfers between cluster nodes, while supporting fault tolerance and redundancy.
The MapReduce model is inspired by the map and reduces functions commonly used in functional programming (note that their purpose in the MapReduce framework is not the same as in their original forms). The key contributions of the MapReduce model are not the actual map and reduce functions that resemble the Message Passing Interface (MPI) standard’s reduce and scatter operations. The major contributions of the MapReduce model are the scalability and fault-tolerance that is supported through optimization of the execution engine. A single-threaded implementation of MapReduce is commonly slower than a traditional (non-MapReduce) implementation; gains are typically realized with multi-node or multi-threaded implementations.
MapReduce libraries have been written in many programming languages, with different levels of optimization. The most popular open-source implementation is found in Apache Hadoop.
Stream processing is a computer programming paradigm, equivalent to dataflow programming, event stream processing, and reactive programming that allows some applications to more easily exploit a limited form of parallel processing (Gedik et al. 2008). Such applications can use multiple computational units, such as the floating point unit on a graphics processing unit or field-programmable gate arrays (FPGAs), without explicitly managing allocation, synchronization, or communication among those units.
The stream processing paradigm simplifies parallel software and hardware by restricting the parallel computation that can be performed. Given a sequence of data (a stream), a series of operations (kernel functions) is applied to each element in the stream. Kernel functions are usually pipelined, and optimal local on-chip memory reuse is attempted, in order to minimize the loss in bandwidth, accredited to external memory interaction. Uniform streaming, where one kernel function is applied to all elements in the stream, is typical. Since the kernel and stream abstractions expose data dependencies, compiler tools can fully automate and optimize on-chip management tasks.
The Lambda architecture depends on a data model with an append-only, immutable data source that serves as a system of record. It is intended for ingesting and processing timestamped events that are appended to existing events rather than overwriting them. State is determined from the natural time-based ordering of the data.
GPU-Accelerated Distributed Frameworks
Examples of Application
The application of technologies related to spatial big data architectures is broad given the rapidly growing interest in spatial data that has emerged during the twenty-first century. Notable among this family of technologies in terms of significance and application include distributed processing frameworks, geo-spatial stream processing, and the numerous implementations of platform as a service (PaaS).
Apache Hadoop (Apache 2006) is an open-source software framework and associated utilities that facilitate using a network of commodity computers to solve problems involving large amounts of data and computation. Inspired by the seminal work at Google on MapReduce and the Google File System (GFS), Hadoop provides a software framework for both the distributed storage and processing of big data using the MapReduce programming model.
Similar to the efforts at Google, Hadoop was designed for computer clusters built from commodity hardware (still the common usage pattern). Hadoop has also been employed on large clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.
The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), a resource manager, a collection of utilities, and a processing framework that is an implementation of the MapReduce programming model that runs against large clusters of machines. HDFS splits very large files (including those of size gigabytes and larger) into blocks that are distributed across multiple nodes in a cluster. Reliability is achieved by replicating the blocks across multiple nodes (with a default replication factor of 3). Hadoop distributes packaged code into nodes to process the data in parallel. This approach takes advantage of data locality, where nodes manipulate the data they have access to. This allows the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.
Hadoop has been deployed in traditional datacenters as well as in the cloud. The cloud allows organizations to deploy Hadoop without the need to acquire hardware or specific setup expertise. Vendors who currently have an offering for the cloud that incorporate Hadoop include Microsoft, Amazon, IBM, Google, and Oracle. Most of the Fortune 50 companies currently deploy Hadoop clusters.
Spark Streaming is an extension to Spark API that supports scalable, high-throughput, fault-tolerant, stream processing of real-time data streams (Garillot and Maas 2018). Data can be ingested from many sources (e.g., Kafka, Flume, or TCP sockets) and can be processed using temporally aware algorithms expressed with high-level functions like map, reduce, join, and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. Spark’s machine learning (Spark ML) and graph processing (GraphX) algorithms can be applied to these data streams.
Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.
Big Data as a Service
As a public cloud service from a provider
As a private service (software or appliance) inside the firewall
As software deployed on a public infrastructure as a service
Big Data as a Service (BDaaS) is a new concept that combines Software as a Service (SaaS), Platform as a Service (PaaS), and Data as a Service (DaaS) in order to address the requirements of working with massively large data sets. BDaas offerings commonly incorporate the Hadoop stack (e.g., HDFS, Hive, MapReduce, Pig, Storm, and Spark), NoSQL data stores, and stream processing capabilities.
Microsoft Azure is a cloud computing service utilizing Microsoft-managed data centers that supports both software as a service (SaaS) and platform as a service (PaaS). It provides data storage capabilities including Cosmos DB (a NoSQL database), the Azure Data Lake, and SQL Server-based databases. Azure supports a scalable event processing engine and a machine learning service that supports predictive analytics and data science applications.
The Google Cloud is a Paas offering that supports big data with data warehousing, batch and stream processing, data exploration, and support for the Hadoop/Spark framework. Key components include BigQuery, a managed data warehouse supporting analytics at scale, Cloud Dataflow, which supports both stream and batch processing, and Cloud Dataproc, a framework for running Apache MapReduce and Spark processes.
Amazon AWS, though commonly considered an Infrastructure as a Service (IaaS) where the user is responsible for configuration, AWS also provides PaaS functionality. Amazon supports Elastic MapReduce (EMR) that works in conjunction with EC2 (Elastic Compute Cloud) and S3 (Simple Storage Service). Data storage is provided through DynamoDB (NoSQL), Redshift (columnar), and RDS (relational data store). Machine learning and real-time data processing infrastructures are also supported.
Other significant examples of BDaaS providers include the IBM Cloud and the Oracle Data Cloud. Big data Infrastructure as a Service (IaaS) offerings (that work with other clouds such as AWS, Azure, and Oracle) are available from Hortonworks, Cloudera, Esri, and Databricks.
Future Directions for Research
Spatio-temporally enabling distributed and NoSQL databases such as Accumulo, Cassandra, HBase, Dynamo, and Elasticsearch. This involves not only supporting spatial types but also incorporating rich collections of topological, spatial, and temporal operators.
Spatio-temporal analytics is another area requiring attention. Much research to date has focused on supporting spatial (or spatio-temporal) joins on distributed frameworks such as MapReduce or Spark. While beneficial, spatio-temporal analytics is a far richer domain that also includes geostatistics (e.g., kriging), spatial statistics, proximity analysis, and pattern analysis.
Spatially enabling machine learning algorithms that run in a distributed cluster (e.g., extending Spark ML or Scikit-learn (Pedregosa et al. 2011)) is another significant research area given the growing interest and importance of machine learning, predictive analytics, and deep learning. To date, research has primarily focused on density-based clustering algorithms such as DBSCAN, HDBSCAN (McInnes and Healy 2017), and OPTICS.
Recently, much attention has been paid to incorporating GPU processing capabilities into distributed processing frameworks such as Spark. While some basic spatial capabilities can currently be supported (e.g., aggregation and visualization of point data), much work needs to be done to further streamline and optimize the integration of GPU processors and extend the native spatio-temporal capabilities.
- Alexander W, Copeland G (1988) Process and dataflow control in distributed data-intensive systems. In: Proceedings of the 1988 ACM SIGMOD international conference on management of data (SIGMOD ’88), pp 90–98. https://doi.org/10.1145/50202.50212
- Apache (2006) Welcome to Apache Hadoop!. http://hadoop.apache.org. Accessed 26 Mar 2018
- Brinkhoff T, Kriegel HP, Seeger B (1996) Parallel processing of spatial joins using r-trees. In: Proceedings of the 12th international conference on data engineering, New Orleans, Louisiana, pp 258–265Google Scholar
- DeWitt DJ, Gerber RH, Graefe G, Heytens ML, Kumar KB, Muralikrishna M (1986) GAMMA – a high performance dataflow database machine. In: Proceedings of the 12th international conference on very large data bases (VLDB ’86), Kyoto, Japan, pp 228–237Google Scholar
- Eldawy A, Mokbel MF (2015) SpatialHadoop: a mapreduce framework for spatial data. In: IEEE 31st international conference on data engineering (ICDE), Seoul, South Korea, pp 1352–1363Google Scholar
- Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining (KDD-96), Portland, Oregon, pp 226–231Google Scholar
- Garillot F, Maas G (2018) Stream processing with apache spark: best practices for scaling and optimizing Apache spark. O’Reilly Media, Sebastopol. http://shop.oreilly.com/product/0636920047568.do
- Gedik B, Andrade H, Wu K-L, Yu PS, Doo M (2008) SPADE: the system s declarative stream processing engine. In: Proceedings of the 2008 ACM SIGMOD international conference on management of data (SIGMOD ’08), pp 1123–1134. https://doi.org/10.1145/1376616.1376729
- Ghemawat S, Gobioff H, Leung S (2003) The Google file system. In: Proceedings of the 19th ACM symposium on operating systems principles, Oct 2003, pp 29–43. https://doi.org/10.1145/945445.945450
- Grossman M, Sarkar, V (2016) SWAT: a programmable, in-memory, distributed, high-performance computing platform. In: Proceedings of the 25th ACM international symposium on high-performance parallel and distributed computing (HPDC ’16). ACM, New York, pp 81–92. https://doi.org/10.1145/2907294.2907307
- Hagedorn S, Götze P, Sattler KU (2017) The STARK framework for spatio-temporal data analytics on spark. In: Proceedings of the 17th conference on database systems for business, technology, and the web (BTW 2017), StuttgartGoogle Scholar
- Hassaan M, Elghandour I (2016) A real-time big data analysis framework on a CPU/GPU heterogeneous cluster: a meteorological application case study. In: Proceedings of the 3rd IEEE/ACM international conference on big data computing, applications and technologies (BDCAT ’16). ACM, New York, pp 168–177. https://doi.org/10.1145/3006299.3006304CrossRefGoogle Scholar
- Marz N, Warren J (2015) Big data: principles and best practices of scalable realtime data systems, 1st edn. Manning Publications, GreenwichGoogle Scholar
- McInnes L, Healy J (2017) Accelerated hierarchical density based clustering. In: IEEE international conference on data mining workshops (ICDMW), New Orleans, Louisiana, pp 33–42Google Scholar
- Mysore D, Khupat S, Jain S (2013) Big data architecture and patterns. IBM, White Paper, 2013. http://www.ibm.com/developerworks/library/bdarchpatterns1. Accessed 26 Mar 2018
- NoSQL (2009) NoSQL definition. http://nosql-database.org. Accessed 26 Mar 2018
- Sena B, Allian AP, Nakagawa EY (2017) Characterizing big data software architectures: a systematic mapping study. In: Proceeding of the 11th Brazilian symposium on software components, architectures, and reuse (SBCARS ’17). https://doi.org/10.1145/3132498.3132510
- Shekhar S, Gunturi V, Evans MR, Yang KS. 2012. Spatial big-data challenges intersecting mobility and cloud computing. In: Proceedings of the eleventh ACM international workshop on data engineering for wireless and mobile access (MobiDE ’12), pp 1–6. https://doi.org/10.1145/2258056.2258058
- Shvachko K, Kuang H, Radia S, Chansler R (2010) The Hadoop distributed file system. In: Proceedings of the 2010 IEEE 26th symposium on mass storage systems and technologies (MSST). https://doi.org/10.1109/MSST.2010.5496972
- Sriharsha R (2015) Magellan: geospatial analytics on spark. https://hortonworks.com/blog/magellan-geospatial-analytics-in-spark/. Accessed June 2017
- Whitman RT, Park MB, Ambrose SM, Hoel EG (2014) Spatial indexing and analytics on Hadoop. In: Proceedings of the 22nd ACM SIGSPATIAL international conference on advances in geographic information systems (SIGSPATIAL ’14), pp 73–82. https://doi.org/10.1145/2666310.2666387
- Whitman RT, Park MB, Marsh BG, Hoel EG (2017) Spatio-temporal join on Apache spark. In: Hoel E, Newsam S, Ravada S, Tamassia R, Trajcevski G (eds) Proceedings of the 25th ACM SIGSPATIAL international conference on advances in geographic information systems (SIGSPATIAL’17). https://doi.org/10.1145/3139958.3139963
- Xie D, Li F, Yao B, Li G, Zhou L, Guo M (2016) Simba: efficient in-memory spatial analytics. In: Proceedings of the 2016 international conference on management of data (SIGMOD ’16), pp 1071–1085. https://doi.org/10.1145/2882903.2915237
- You S, Zhang J, Gruenwald L (2015) Large-scale spatial join query processing in Cloud. In: 2015 31st IEEE international conference on data engineering workshops, Seoul, 13–17 April 2015, pp 34–41Google Scholar
- Yu J, Wu J, Sarwat M (2015) Geospark: a cluster computing framework for processing large-scale spatial data. In: Proceedings of the 23rd SIGSPATIAL international conference on advances in geographic information systems, Seattle, WAGoogle Scholar
- Yuan Y, Salmi MF, Huai Y, Wang K, Lee R, Zhang X (2016) Spark-GPU: an accelerated in-memory data processing engine on clusters. In: Proceedings of the 2016 IEEE international conference on big data (Big Data 2016), Washington, DC, pp 273–283Google Scholar
- Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX conference on hot topics in cloud computing (HotCloud’10), Boston, MAGoogle Scholar
- Zhang S, Han J, Liu Z, Wang K, Xu Z (2009) SJMR: parallelizing spatial join with mapreduce on clusters. In: IEEE international conference on Cluster computing (CLUSTER’09), New Orleans, Louisiana, pp 1–8Google Scholar