Encyclopedia of Big Data Technologies

Living Edition
| Editors: Sherif Sakr, Albert Zomaya

Architectures

  • Erik G. HoelEmail author
Living reference work entry
DOI: https://doi.org/10.1007/978-3-319-63962-8_216-1
  • 155 Downloads

Synonyms

Definitions

Spatial big data is a spatio-temporal data that is too large or requires data-intensive computation that is too demanding for traditional computing architectures. Stream processing in this context is the processing of spatio-temporal data in motion. The data is observational; it is produced by sensors – moving or otherwise. Computations on the data are made as the data is produced or received. A distributed processing cluster is a networked collection of computers that communicate and process data in a coordinated manner. Computers in the cluster are coordinated to solve a common problem. A lambda architecture is a scalable, fault-tolerant data-processing architecture that is designed to handle large quantities of data by exploiting both stream and batch processing methods. Data partitioning involves physically dividing a dataset into separate data stores on a distributed processing cluster. This is done to achieve improved scalability, performance, availability, and fault-tolerance. Distributed file systems, in the context of big data architectures, are similar to traditional distributed file systems but are intended to persist large datasets on commodity hardware in a fault-tolerant manner with simple coherency models. The MapReduce programming model is intended for large-scale distributed data processing and is based upon simple concepts involving iterating over data, performing a computation on key/value pairs, grouping the intermediary values by key, iterating over the resulting groups, and reducing each group to provide a result. A GPU-accelerated distributed processing framework is an extension to a traditional distributed processing framework that supports offloading tasks to GPUs for further acceleration.

Overview

Spatial big data architectures are intended to address requirements for spatio-temporal data that is too large or computationally demanding for traditional computing architectures (Shekhar et al. 2012). This includes operations such as real-time data ingest, stream processing, batch processing, storage, and spatio-temporal analytical processing.

Spatial big data offers additional challenges beyond what is commonly faced in the big data space. This includes advanced indexing, querying, analytical processing, visualization, and machine learning. Examples of spatial big data include moving vehicles (peer-to-peer ridesharing, delivery vehicles, ships, airplanes, etc.), stationary sensor data (e.g., SCADA, AMI), cell phones (call detail records), IoT devices, as well as spatially enabled content from social media (Fig. 1).
Fig. 1

Generic big data architecture

Spatial big data architectures are used in a variety of workflows:
  • Real-time processing of observational data (data in motion). This commonly involves monitoring and tracking dynamic assets in real-time; this can include vehicles, aircraft, and vessels, as well as stationary assets such as weather and environmental monitoring sensors.

  • Batch processing of persisted spatio-temporal data (data at rest). Workflows with data at rest incorporate tabular and spatial processing (e.g., summarizing data, analyzing patterns, and proximity analysis), geoenrichment and geoenablement (adding spatial capabilities to non-spatial data, adding information from contextual spatial data collections), and machine learning and predictive analytics (clustering, classification, and prediction).

Architectural Requirements

When designing and developing scalable software systems that can address the high-level workflows encountered in spatial big data systems, a number of basic requirements must be identified. It is important to note that most of these also generally apply to traditional non-spatial big data systems (Mysore et al. 2013; Klein et al. 2016; Sena et al. 2017):
  • Scalable

    Spatial big data systems must be scalable. Scalability encompasses being able to increase and support different amounts of data, processing them, and allocating computational resources without impacting costs or efficiency. To meet this requirement, it is required to distribute data sets and their processing across multiple computing and storage nodes.

  • High Performant

    Spatial big data systems must be able to process large streams of data in a short period of time, thus returning the results to users as efficiently as possible. In addition, the system should support computation intensive spatio-temporal analytics.

  • Real-time

    Big data systems must be able to manage the continuous flow of data and its processing in real time, facilitating decision-making.

  • Consistent

    Big data systems must support data consistency, heterogeneity, and exploitation. Different data formats must be also managed to represent useful information for the system.

  • Secure

    Big data systems must ensure security in the data and its manipulation in the architecture, supporting integrity of information, exchanging data, multilevel policy-driven, access control, and prevent unauthorized access.

  • Available

    Big data systems must ensure high data availability, through data replication horizontal scaling (i.e., distribute a data set over clusters of computers and storage nodes). The system must allow replication and handle hardware failures.

  • Interoperable

    Big data systems must be transparently intercommunicated to allow exchanging information between machines and processes, interfaces, and people.

Key Research Findings

Spatial big data architectures have existed since the early 2000s with some of the original implementations at Google and Microsoft. Some of the key research issues related to spatial big data architectures include data storage (repositories, distributes storage, NoSQL databases), distributed spatio-temporal analytic processing, stream processing, scalable cloud computing, and GPU-enabled distributed processing frameworks.

Data Storage

Parallel Database Systems

Big data repositories have existed in many forms, frequently built by corporations and governmental agencies with special requirements. Beginning in the 1990s, commercial vendors began offering parallel database management systems to address the needs of big data (DeWitt and Gray 1992). Parallel database systems are classified as being shared memory (processing elements sharing memory), shared disk (processing elements do not share memory, but do share disk storage), or shared nothing (neither memory nor disk storage is shared between processing elements). Significant efforts included those by Teradata, IBM, Digital, Microsoft, Oracle, Google, and Amazon. Additionally, beginning in the 1980s, there were numerous research systems that contributed to these efforts (e.g., GAMMA and Bubba; DeWitt et al. 1986, Alexander and Copeland 1988).

Distributed File Stores

Hadoop Distributed File System (HDFS) is a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster (Shvachko et al. 2010). It was inspired by the Google File System (GFS; Ghemawat et al. 2003). HDFS stores large files (typically in the range of gigabytes to terabytes) across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence theoretically does not require redundant array of independent disks (RAID) storage on hosts (but to increase input-output (I/O) performance, RAID configurations may be employed). With the default replication value of three, data is stored on three nodes: two on the same rack and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high.

NoSQL Databases

A NoSQL (originally referencing “non-SQL” or “non-relational”) database is a mechanism for the storage and retrieval of data that is modeled differently from standard relational databases (NoSQL 2009; Pavlo and Aslett 2016). NoSQL databases are often considered next generation databases; they are intended to address weaknesses of traditional relational databases such as being readily distributable, simpler in design, open-source, and horizontally scalable (often problematic for relational databases). Many databases supporting these characteristics originated in the late 1960s; the “NoSQL” description was employed beginning in the late 1990s with the requirements imposed by companies such as Facebook, Google, and Amazon. NoSQL databases are commonly used with big data applications. NoSQL systems are also sometimes called “Not only SQL” to emphasize that they may support SQL-like query languages.

In order to achieve increased performance and scalability, NoSQL databases commonly used data structures (e.g., key-value, columnar, document, or graph) that are different from those used in relational databases. NoSQL databases vary in terms of applicability to particular problem domains. NoSQL databases are often classified by their primal data structures; examples include:
  • Key-value: Apache Ignite, Couchbase, Dynamo, Oracle NoSQL Database, Redis, Riak

  • Columnar: Accumulo, Cassandra, Druid, HBase, Vertica

  • Document: Apache CouchDB, Cosmos DB, IBM Domino, MarkLogic, MongoDB

  • Graph: AllegroGraph, Apache Giraph, MarkLogic, Neo4J

  • Multi-model: Apache Ignite, Couchbase, MarkLogic

Spatial Batch Processing

Spatio-temporal analysis in a batch context involves a very wide scope of functionality. In academia, much of the research has focused on the spatial join (or spatio-temporal join) function. In commercial systems, spatial analysis also includes summarizing data, incident and similar location detection, proximity analysis, pattern analysis, and data management (import, export, cleansing, etc.).

Spatial Joins

Spatial joins have been widely studied in both the standard sequential environment (Jacox and Samet 2007), as well as in the parallel (Brinkhoff et al. 1996) and distributed environments (Abel et al. 1995). For over 20 years, algorithms have been developed to take advantage of parallel and distributed processing architectures and software frameworks. The recent resurgence in interest in spatial join processing is the results of newfound interest in distributed, fault-tolerant, computing frameworks such as Apache Hadoop, as well as the explosion in observational and IoT data.

With distributed processing architectures, there are two principal approaches that are employed when performing spatial joins. The first, termed a broadcast (or mapside) spatial join, is designed for joining a large dataset with another small dataset (e.g., political boundaries). The large dataset is partitioned across the processing nodes and the complete small dataset is broadcast to each of the nodes. This allows significant optimization opportunities. The second approach, termed a partitioned (or reduce side) spatial join, is a more general technique that is used when joining two large datasets. Partitioned joins use a divide-and-conquer approach (Aji et al. 2013). The two large datasets are divided into small pieces via a spatial decomposition, and each small piece is processed independently.

SJMR (Spatial Join with MapReduce) introduced the first distributed spatial join on Hadoop using the MapReduce programming model (Dean and Ghemawat 2008; Zhang et al. 2009). SpatialHadoop (Eldawy and Mokbel 2015) optimized SJMR with a persistent spatial index (it supports grid files, R − trees, and R + trees) that is precomputed. Hadoop-GIS (Aji et al. 2013), which is utilized in medical pathology imaging, features both 2D and 3D spatial join. GIS Tools for Hadoop (Whitman et al. 2014) is an open source library that implements range and distance queries and k-NN. It also supports a distributed PMR quadtree-based spatial index. GeoSpark (Yu et al. 2015) is a framework for performing spatial joins, range, and k-NN queries. The framework supports quadtree and R-tree indexing of the source data. Magellan (Sriharsha 2015) is an open source library for geospatial analytics that uses Spark (Zaharia et al. 2010). It supports a broadcast join and a reduce-side optimized join and is integrated with Spark SQL for a traditional, SQL user experience. SpatialSpark (You et al. 2015) supports both a broadcast spatial join and a partitioned spatial join on Spark. The partitioning is supported using either fixed-grid, binary space partition or a sort-tile approach. STARK (Hagedorn et al. 2017) is a Spark-based framework that supports spatial joins, k-NN, and range queries on both spatial and spatio-temporal data. STARK supports three temporal operators: contains, containedBy, and intersects) and also supports the DBSCAN density-based spatial clusterer (Ester et al. 1996). MSJS (multi-way spatial join algorithm with Spark (Du et al. 2017)) addresses the problem of performing multi-way spatial joins using the common technique of cascading sequences of pairwise spatial joins. Simba (Xie et al. 2016) offers range, distance (circle range), and k-NN queries as well as distance and k-NN joins. Two-level indexing, global and local, is employed, similar to the various indexing work on Hadoop MapReduce. LocationSpark (Tang et al. 2016) supports range query, k-NN, spatial join, and k-NN join. LocationSpark uses global and local indices – Grid, R-tree, Quadtree, and IR-tree. GeoMesa is an open-source, distributed, spatio-temporal index built on top of Bigtable-style databases (Chang et al. 2008) using an implementation of the Geohash algorithm implemented in Scala (Hughes et al. 2015). The Esri GeoAnalytics Server (Whitman et al. 2017) supports many types of spatial analysis is a distributed environment (leveraging the Spark framework). It provides functionality for summarizing data (e.g., aggregation, spatio-temporal join, polygon overlay), incident and similar location detection, proximity analysis, and pattern analysis (hot spot analysis, NetCDF generation).

MapReduce Programming Model

MapReduce is a programming model and an associated implementation for processing big data sets with a parallel, distributed algorithm (Sakr et al. 2013). A MapReduce program is composed of a map procedure (or method), which performs filtering and sorting (such as sorting students by first name into queues, one queue for each name), and a reduce method, which performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). A MapReduce framework manages the processing by marshalling the distributed cluster nodes, running the various tasks and algorithms in parallel, managing communications and data transfers between cluster nodes, while supporting fault tolerance and redundancy.

The MapReduce model is inspired by the map and reduces functions commonly used in functional programming (note that their purpose in the MapReduce framework is not the same as in their original forms). The key contributions of the MapReduce model are not the actual map and reduce functions that resemble the Message Passing Interface (MPI) standard’s reduce and scatter operations. The major contributions of the MapReduce model are the scalability and fault-tolerance that is supported through optimization of the execution engine. A single-threaded implementation of MapReduce is commonly slower than a traditional (non-MapReduce) implementation; gains are typically realized with multi-node or multi-threaded implementations.

MapReduce libraries have been written in many programming languages, with different levels of optimization. The most popular open-source implementation is found in Apache Hadoop.

Stream Processing

Stream processing is a computer programming paradigm, equivalent to dataflow programming, event stream processing, and reactive programming that allows some applications to more easily exploit a limited form of parallel processing (Gedik et al. 2008). Such applications can use multiple computational units, such as the floating point unit on a graphics processing unit or field-programmable gate arrays (FPGAs), without explicitly managing allocation, synchronization, or communication among those units.

The stream processing paradigm simplifies parallel software and hardware by restricting the parallel computation that can be performed. Given a sequence of data (a stream), a series of operations (kernel functions) is applied to each element in the stream. Kernel functions are usually pipelined, and optimal local on-chip memory reuse is attempted, in order to minimize the loss in bandwidth, accredited to external memory interaction. Uniform streaming, where one kernel function is applied to all elements in the stream, is typical. Since the kernel and stream abstractions expose data dependencies, compiler tools can fully automate and optimize on-chip management tasks.

Lambda Architecture

A Lambda Architecture is an architecture that is intended to process large volumes of data by incorporating both batch and real-time processing techniques (Marz and Warren 2015). This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data. The two view outputs may be joined before presentation. In addition, historic analysis is used to tune the real-time analytical processing as well as building models for prediction (machine learning). The rise of lambda architecture is correlated with the growth of big data, real-time analytics, and the drive to mitigate the latencies of map-reduce (Fig. 2).
Fig. 2

Generic lambda architecture

The Lambda architecture depends on a data model with an append-only, immutable data source that serves as a system of record. It is intended for ingesting and processing timestamped events that are appended to existing events rather than overwriting them. State is determined from the natural time-based ordering of the data.

GPU-Accelerated Distributed Frameworks

Distributed processing frameworks such as Spark (which support in-memory processing) have been extended and enhanced with the incorporation of GPUs for key computationally intensive operations (e.g., machine learning and graph theoretic algorithms, Prasad et al. 2015). Researchers have observed that a few Spark nodes with GPUs can outperform a much larger cluster of non-GPU nodes (Grossman and Sarkar 2016; Hassaan and Elghandour 2016; Yuan et al. 2016; Hong et al. 2017). The main bottlenecks when incorporating GPUs in hybrid architectures often involve data communication, memory and resource management, and differences in programming models. Different approaches to solving these problems have employed GPU-wrapper APIs (e.g., PyCUDA), hybrid RDDs (resilient distributed datasets) where the RDD is stored in the CPU, generating native GPU code from high-level source code written for the distributed framework (e.g., Scala, Java, or Python code with Spark) or native GPU RDDs where data is processed and stored in the GPU device memory (Fig. 3).
Fig. 3

GPU-accelerated Spark framework

Examples of Application

The application of technologies related to spatial big data architectures is broad given the rapidly growing interest in spatial data that has emerged during the twenty-first century. Notable among this family of technologies in terms of significance and application include distributed processing frameworks, geo-spatial stream processing, and the numerous implementations of platform as a service (PaaS).

Apache Hadoop

Apache Hadoop (Apache 2006) is an open-source software framework and associated utilities that facilitate using a network of commodity computers to solve problems involving large amounts of data and computation. Inspired by the seminal work at Google on MapReduce and the Google File System (GFS), Hadoop provides a software framework for both the distributed storage and processing of big data using the MapReduce programming model.

Similar to the efforts at Google, Hadoop was designed for computer clusters built from commodity hardware (still the common usage pattern). Hadoop has also been employed on large clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), a resource manager, a collection of utilities, and a processing framework that is an implementation of the MapReduce programming model that runs against large clusters of machines. HDFS splits very large files (including those of size gigabytes and larger) into blocks that are distributed across multiple nodes in a cluster. Reliability is achieved by replicating the blocks across multiple nodes (with a default replication factor of 3). Hadoop distributes packaged code into nodes to process the data in parallel. This approach takes advantage of data locality, where nodes manipulate the data they have access to. This allows the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.

Hadoop has been deployed in traditional datacenters as well as in the cloud. The cloud allows organizations to deploy Hadoop without the need to acquire hardware or specific setup expertise. Vendors who currently have an offering for the cloud that incorporate Hadoop include Microsoft, Amazon, IBM, Google, and Oracle. Most of the Fortune 50 companies currently deploy Hadoop clusters.

Spark Streaming

Spark Streaming is an extension to Spark API that supports scalable, high-throughput, fault-tolerant, stream processing of real-time data streams (Garillot and Maas 2018). Data can be ingested from many sources (e.g., Kafka, Flume, or TCP sockets) and can be processed using temporally aware algorithms expressed with high-level functions like map, reduce, join, and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. Spark’s machine learning (Spark ML) and graph processing (GraphX) algorithms can be applied to these data streams.

Internally, Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches (Fig. 4).
Fig. 4

Spark streaming micro-batch architecture

Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.

Big Data as a Service

Platform as a Service (PaaS) is a category of cloud computing services that provides a platform allowing customers to run and manage applications without the complexity of building and maintaining the infrastructure usually associated with developing and launching an application (Chang et al. 2010). PaaS is commonly delivered in one of three ways:
  • As a public cloud service from a provider

  • As a private service (software or appliance) inside the firewall

  • As software deployed on a public infrastructure as a service

Big Data as a Service (BDaaS) is a new concept that combines Software as a Service (SaaS), Platform as a Service (PaaS), and Data as a Service (DaaS) in order to address the requirements of working with massively large data sets. BDaas offerings commonly incorporate the Hadoop stack (e.g., HDFS, Hive, MapReduce, Pig, Storm, and Spark), NoSQL data stores, and stream processing capabilities.

Microsoft Azure is a cloud computing service utilizing Microsoft-managed data centers that supports both software as a service (SaaS) and platform as a service (PaaS). It provides data storage capabilities including Cosmos DB (a NoSQL database), the Azure Data Lake, and SQL Server-based databases. Azure supports a scalable event processing engine and a machine learning service that supports predictive analytics and data science applications.

The Google Cloud is a Paas offering that supports big data with data warehousing, batch and stream processing, data exploration, and support for the Hadoop/Spark framework. Key components include BigQuery, a managed data warehouse supporting analytics at scale, Cloud Dataflow, which supports both stream and batch processing, and Cloud Dataproc, a framework for running Apache MapReduce and Spark processes.

Amazon AWS, though commonly considered an Infrastructure as a Service (IaaS) where the user is responsible for configuration, AWS also provides PaaS functionality. Amazon supports Elastic MapReduce (EMR) that works in conjunction with EC2 (Elastic Compute Cloud) and S3 (Simple Storage Service). Data storage is provided through DynamoDB (NoSQL), Redshift (columnar), and RDS (relational data store). Machine learning and real-time data processing infrastructures are also supported.

Other significant examples of BDaaS providers include the IBM Cloud and the Oracle Data Cloud. Big data Infrastructure as a Service (IaaS) offerings (that work with other clouds such as AWS, Azure, and Oracle) are available from Hortonworks, Cloudera, Esri, and Databricks.

Future Directions for Research

Despite the significant advancements that have been made over the past decade on key topics relates to spatial big data architectures, much further research is necessary in order to further democratize the capabilities and application to broader problem domains. Some of the more significant areas needing attention include:
  • Spatio-temporally enabling distributed and NoSQL databases such as Accumulo, Cassandra, HBase, Dynamo, and Elasticsearch. This involves not only supporting spatial types but also incorporating rich collections of topological, spatial, and temporal operators.

  • Spatio-temporal analytics is another area requiring attention. Much research to date has focused on supporting spatial (or spatio-temporal) joins on distributed frameworks such as MapReduce or Spark. While beneficial, spatio-temporal analytics is a far richer domain that also includes geostatistics (e.g., kriging), spatial statistics, proximity analysis, and pattern analysis.

  • Spatially enabling machine learning algorithms that run in a distributed cluster (e.g., extending Spark ML or Scikit-learn (Pedregosa et al. 2011)) is another significant research area given the growing interest and importance of machine learning, predictive analytics, and deep learning. To date, research has primarily focused on density-based clustering algorithms such as DBSCAN, HDBSCAN (McInnes and Healy 2017), and OPTICS.

  • Recently, much attention has been paid to incorporating GPU processing capabilities into distributed processing frameworks such as Spark. While some basic spatial capabilities can currently be supported (e.g., aggregation and visualization of point data), much work needs to be done to further streamline and optimize the integration of GPU processors and extend the native spatio-temporal capabilities.

Cross-References

References

  1. Abel DJ, Ooi BC, Tan K-L, Power R, Yu JX (1995) Spatial join strategies in distributed spatial DBMS. In: Advances in spatial databases – 4th international symposium, SSD’95. Lecture notes in computer science, vol 1619. Springer, Portland, pp 348–367CrossRefGoogle Scholar
  2. Aji A, Wang F, Vo H, Lee R, Liu Q, Zhang X, Saltz J (2013) Hadoop-GIS: a high performance spatial data warehousing system over mapreduce. Proc VLDB Endow 6(11):1009–1020CrossRefGoogle Scholar
  3. Alexander W, Copeland G (1988) Process and dataflow control in distributed data-intensive systems. In: Proceedings of the 1988 ACM SIGMOD international conference on management of data (SIGMOD ’88), pp 90–98.  https://doi.org/10.1145/50202.50212
  4. Apache (2006) Welcome to Apache Hadoop!. http://hadoop.apache.org. Accessed 26 Mar 2018
  5. Brinkhoff T, Kriegel HP, Seeger B (1996) Parallel processing of spatial joins using r-trees. In: Proceedings of the 12th international conference on data engineering, New Orleans, Louisiana, pp 258–265Google Scholar
  6. Chang F, Dean J, Ghemawat S, Hsieh W, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE (2008) Bigtable: a distributed storage system for structured data. ACM Trans Comput Syst 26(2).  https://doi.org/10.1145/1365815.1365816CrossRefGoogle Scholar
  7. Chang WY, Abu-Amara H, Sanford JF (2010) Transforming Enterprise Cloud Services. Springer, London, pp 55–56CrossRefGoogle Scholar
  8. Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113.  https://doi.org/10.1145/1327452.1327492CrossRefGoogle Scholar
  9. DeWitt D, Gray J (1992) Parallel database systems: the future of high performance database systems. Commun ACM 35(6).  https://doi.org/10.1145/129888.129894CrossRefGoogle Scholar
  10. DeWitt DJ, Gerber RH, Graefe G, Heytens ML, Kumar KB, Muralikrishna M (1986) GAMMA – a high performance dataflow database machine. In: Proceedings of the 12th international conference on very large data bases (VLDB ’86), Kyoto, Japan, pp 228–237Google Scholar
  11. Du Z, Zhao X, Ye X, Zhou J, Zhang F, Liu R (2017) An effective high-performance multiway spatial join algorithm with spark. ISPRS Int J Geo-Information 6(4):96CrossRefGoogle Scholar
  12. Eldawy A, Mokbel MF (2015) SpatialHadoop: a mapreduce framework for spatial data. In: IEEE 31st international conference on data engineering (ICDE), Seoul, South Korea, pp 1352–1363Google Scholar
  13. Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining (KDD-96), Portland, Oregon, pp 226–231Google Scholar
  14. Garillot F, Maas G (2018) Stream processing with apache spark: best practices for scaling and optimizing Apache spark. O’Reilly Media, Sebastopol. http://shop.oreilly.com/product/0636920047568.do
  15. Gedik B, Andrade H, Wu K-L, Yu PS, Doo M (2008) SPADE: the system s declarative stream processing engine. In: Proceedings of the 2008 ACM SIGMOD international conference on management of data (SIGMOD ’08), pp 1123–1134.  https://doi.org/10.1145/1376616.1376729
  16. Ghemawat S, Gobioff H, Leung S (2003) The Google file system. In: Proceedings of the 19th ACM symposium on operating systems principles, Oct 2003, pp 29–43.  https://doi.org/10.1145/945445.945450
  17. Grossman M, Sarkar, V (2016) SWAT: a programmable, in-memory, distributed, high-performance computing platform. In: Proceedings of the 25th ACM international symposium on high-performance parallel and distributed computing (HPDC ’16). ACM, New York, pp 81–92.  https://doi.org/10.1145/2907294.2907307
  18. Hagedorn S, Götze P, Sattler KU (2017) The STARK framework for spatio-temporal data analytics on spark. In: Proceedings of the 17th conference on database systems for business, technology, and the web (BTW 2017), StuttgartGoogle Scholar
  19. Hassaan M, Elghandour I (2016) A real-time big data analysis framework on a CPU/GPU heterogeneous cluster: a meteorological application case study. In: Proceedings of the 3rd IEEE/ACM international conference on big data computing, applications and technologies (BDCAT ’16). ACM, New York, pp 168–177.  https://doi.org/10.1145/3006299.3006304CrossRefGoogle Scholar
  20. Hong S, Choi W, Jeong W-K (2017) GPU in-memory processing using spark for iterative computation. In: Proceedings of the 17th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGrid ’17), pp 31–41.  https://doi.org/10.1109/CCGRID.2017.41CrossRefGoogle Scholar
  21. Hughes JN, Annex A, Eichelberger CN, Fox A, Hulbert A, Ronquest M (2015) Geomesa: a distributed architecture for spatio-temporal fusion. In: Proceedings of SPIE defense and security.  https://doi.org/10.1117/12.2177233CrossRefGoogle Scholar
  22. Jacox EH, Samet H (2007) Spatial join techniques. ACM Trans Database Syst 32(1):7CrossRefGoogle Scholar
  23. Klein J, Buglak R, Blockow D, Wuttke T, Cooper B (2016) A reference architecture for big data systems in the national security domain. In: Proceedings of the 2nd international workshop on BIG data software engineering (BIGDSE ’16).  https://doi.org/10.1145/2896825.2896834CrossRefGoogle Scholar
  24. Marz N, Warren J (2015) Big data: principles and best practices of scalable realtime data systems, 1st edn. Manning Publications, GreenwichGoogle Scholar
  25. McInnes L, Healy J (2017) Accelerated hierarchical density based clustering. In: IEEE international conference on data mining workshops (ICDMW), New Orleans, Louisiana, pp 33–42Google Scholar
  26. Mysore D, Khupat S, Jain S (2013) Big data architecture and patterns. IBM, White Paper, 2013. http://www.ibm.com/developerworks/library/bdarchpatterns1. Accessed 26 Mar 2018
  27. NoSQL (2009) NoSQL definition. http://nosql-database.org. Accessed 26 Mar 2018
  28. Pavlo A, Aslett M (2016) What’s really new with NewSQL? SIGMOD Rec 45(2):45–55.  https://doi.org/10.1145/3003665.3003674CrossRefGoogle Scholar
  29. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Perrot M, Duchesnay É (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830MathSciNetzbMATHGoogle Scholar
  30. Prasad S, McDermott M, Puri S, Shah D, Aghajarian D, Shekhar S, Zhou X (2015) A vision for GPU-accelerated parallel computation on geo-spatial datasets. SIGSPATIAL Spec 6(3):19–26.  https://doi.org/10.1145/2766196.2766200CrossRefGoogle Scholar
  31. Sakr S, Liu A, Fayoumi AG (2013) The family of mapreduce and large-scale data processing systems. ACM Comput Surv 46(1):1.  https://doi.org/10.1145/2522968.2522979CrossRefGoogle Scholar
  32. Sena B, Allian AP, Nakagawa EY (2017) Characterizing big data software architectures: a systematic mapping study. In: Proceeding of the 11th Brazilian symposium on software components, architectures, and reuse (SBCARS ’17).  https://doi.org/10.1145/3132498.3132510
  33. Shekhar S, Gunturi V, Evans MR, Yang KS. 2012. Spatial big-data challenges intersecting mobility and cloud computing. In: Proceedings of the eleventh ACM international workshop on data engineering for wireless and mobile access (MobiDE ’12), pp 1–6.  https://doi.org/10.1145/2258056.2258058
  34. Shvachko K, Kuang H, Radia S, Chansler R (2010) The Hadoop distributed file system. In: Proceedings of the 2010 IEEE 26th symposium on mass storage systems and technologies (MSST).  https://doi.org/10.1109/MSST.2010.5496972
  35. Sriharsha R (2015) Magellan: geospatial analytics on spark. https://hortonworks.com/blog/magellan-geospatial-analytics-in-spark/. Accessed June 2017
  36. Tang M, Yu Y, Malluhi QM, Ouzzani M, Aref WG (2016) LocationSpark: a distributed in-memory data management system for big spatial data. Proc VLDB Endow 9(13):1565–1568.  https://doi.org/10.14778/3007263.3007310CrossRefGoogle Scholar
  37. Whitman RT, Park MB, Ambrose SM, Hoel EG (2014) Spatial indexing and analytics on Hadoop. In: Proceedings of the 22nd ACM SIGSPATIAL international conference on advances in geographic information systems (SIGSPATIAL ’14), pp 73–82.  https://doi.org/10.1145/2666310.2666387
  38. Whitman RT, Park MB, Marsh BG, Hoel EG (2017) Spatio-temporal join on Apache spark. In: Hoel E, Newsam S, Ravada S, Tamassia R, Trajcevski G (eds) Proceedings of the 25th ACM SIGSPATIAL international conference on advances in geographic information systems (SIGSPATIAL’17).  https://doi.org/10.1145/3139958.3139963
  39. Xie D, Li F, Yao B, Li G, Zhou L, Guo M (2016) Simba: efficient in-memory spatial analytics. In: Proceedings of the 2016 international conference on management of data (SIGMOD ’16), pp 1071–1085.  https://doi.org/10.1145/2882903.2915237
  40. You S, Zhang J, Gruenwald L (2015) Large-scale spatial join query processing in Cloud. In: 2015 31st IEEE international conference on data engineering workshops, Seoul, 13–17 April 2015, pp 34–41Google Scholar
  41. Yu J, Wu J, Sarwat M (2015) Geospark: a cluster computing framework for processing large-scale spatial data. In: Proceedings of the 23rd SIGSPATIAL international conference on advances in geographic information systems, Seattle, WAGoogle Scholar
  42. Yuan Y, Salmi MF, Huai Y, Wang K, Lee R, Zhang X (2016) Spark-GPU: an accelerated in-memory data processing engine on clusters. In: Proceedings of the 2016 IEEE international conference on big data (Big Data 2016), Washington, DC, pp 273–283Google Scholar
  43. Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX conference on hot topics in cloud computing (HotCloud’10), Boston, MAGoogle Scholar
  44. Zhang S, Han J, Liu Z, Wang K, Xu Z (2009) SJMR: parallelizing spatial join with mapreduce on clusters. In: IEEE international conference on Cluster computing (CLUSTER’09), New Orleans, Louisiana, pp 1–8Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Environmental Systems Research InstituteRedlandsUSA

Section editors and affiliations

  • Timos Sellis
    • 1
  • Aamir Cheema
  1. 1.Data Science Research InstituteSwinburne University of TechnologyMelbourneAustralia