Overview

Data analytics is undergoing a revolution: the volume and velocity of data sources are increasing rapidly. Across a number of application domains from web, social, and energy analytics to scientific computing, large quantities of data are generated continuously in the form of posts, tweets, logs, sensor readings, and more. A modern analytics service must provide real-time analysis of these data streams to extract meaningful and timely information. As a result, there has been a growing interest in streaming analytics with recent development of several distributed analytics platforms (Apache Storm 2015; Boykin et al. 2014; Zaharia et al. 2013).

In many streaming analytics domains, inputs originate from diverse sources including users, devices, and sensors located around the globe. As a result, the distributed infrastructure of a typical analytics service (e.g., Google Analytics, Akamai Media Analytics, etc.) has a hub-and-spoke model. Data sources send streams of data to nearby “edge” servers. These geographically distributed edge servers send data to a central location that can process the data further, store summaries, and present those summaries in visual form to users of the analytics service. While the central hub is typically located in a well-provisioned data center, resources may be limited at the edge locations. In particular, the available WAN bandwidth between the edges and the center is limited.

A traditional approach to analytics processing is the centralized model where no processing is performed at the edges and all the data is sent to a dedicated centralized location. This approach is generally suboptimal, because it strains the scarce WAN bandwidth available between the edges and the center, leading to delayed results. Further, it fails to make use of the available compute and storage resources at the edge. An alternative is a decentralized approach (Rabkin et al. 2014) that utilizes the edge for much of the processing in order to minimize WAN traffic. However, neither approach is optimal for modern analytics requirements. Instead, analytics processing must utilize both edge and central resources in a carefully coordinated manner in order to achieve the stringent requirements of an analytics service in terms of network traffic, user-perceived delay, as well as accuracy of results.

A crucial question for a geo-distributed analytics system is how best to utilize resources at both the edges and the center in order to deliver timely results. In particular, optimizing geo-distributed streaming analytics requires an analytics system to determine how much computation to perform at the edges and how much to leave for the center (i.e., where to compute), as well as when to send partial results from edges to the center. These questions must be addressed in the context of the trade-off between multiple metrics: cost (e.g., WAN traffic), timeliness (e.g., latency), and data quality (e.g., result accuracy).

Key Research Findings

Recent work (Heintz et al. 2015, 2016a, 2017) has addressed the problem of optimizing geo-distributed streaming analytics by analyzing these trade-offs in the context of windowed grouped aggregation, an important primitive in any stream analytics system. This work systematically analyzes these trade-offs both in the context of exact computation (where all data must be completely processed) and approximate computation (where some error can be tolerated).

Abstractions for grouped aggregation are provided in most data analytics frameworks, for example, as the Reduce operation in MapReduce or Group By in SQL and LINQ. A useful variant in stream computing is windowed grouped aggregation, where each group is further broken down into finite time windows before being summarized. Windowed grouped aggregation is one of the most frequently used primitives in an analytics service and underlies queries that aggregate a metric of interest over a time window. Example queries that utilize windowed grouped aggregation include computing content popularity for a web analytics application or the average network load distribution for a network monitoring application.

Exact Computation

Recent work (Heintz et al. 2015, 2017) focuses on designing algorithms for performing windowed grouped aggregation in order to optimize the two key metrics of any geo-distributed streaming analytics service: WAN traffic and staleness (the delay in getting the result for a time window). The aggregation algorithm runs on each edge, aggregating the records in the input stream and sending (partial) aggregates to the center over the WAN. The scheduling problem is to determine when to send which aggregates to the center.

This work examines baseline approaches such as pure streaming, a centralized approach where all data is immediately sent from edge to center without any edge processing, and pure batching, a decentralized approach where all data during a time window is aggregated at the edge, with only the aggregated results being sent to the center at the end of the window. It shows that such approaches do not jointly optimize traffic and staleness and, hence, are suboptimal. It presents a family of optimal offline algorithms that jointly minimize both staleness and traffic. One offline optimal algorithm is the eager optimal algorithm that eagerly flushes updates for each distinct key immediately after the final arrival to that key within a window. Another offline optimal algorithm is the lazy optimal algorithm which schedules its updates to start at the last possible time that would still enable it to flush all its updates with an optimal staleness. There is a family of optimal algorithms whose schedules consist of update times that lie between those for the eager and lazy algorithm for each key.

Using these offline optimal algorithms as a foundation, practical online aggregation algorithms are developed that emulate the offline optimal algorithms. The key insight here is that windowed grouped aggregation can be modeled as a caching problem where the cache size varies over time. This insight allows the decomposition of the scheduling problem into two subproblems: determining the (dynamic) cache size and defining a cache eviction policy. While the first subproblem can be solved by using insights gained from the optimal offline algorithms, the second subproblem lends itself to using the vast prior work on cache replacement policies (Podlipnig et al. 2003). To overcome potential problems with pure emulation of either the eager or the lazy offline optimal algorithms in practice, a hybrid algorithm is proposed that computes cache size as a linear combination of eager and lazy cache sizes. Concretely, a hybrid algorithm with a laziness parameter α – denoted by hybrid(α) – estimates the cache size c(t) at time t as:

$$\displaystyle \begin{aligned}c(t) = \alpha \cdot c_l(t) + (1 - \alpha) \cdot c_e(t),\end{aligned}$$

where cl(t) and ce(t) are the lazy and eager cache size estimates, respectively. By selecting an appropriate value for α, the hybrid algorithm can achieve the desired trade-off between traffic and staleness.

The practicality of these algorithms is demonstrated through an implementation in Apache Storm (2015), deployed on the PlanetLab (2015) testbed. The experiments are driven by workloads derived from anonymized traces of a popular web analytics service offered by Akamai (Nygren et al. 2010), a large content delivery network. The results of the experiments show that the proposed online hybrid aggregation algorithms simultaneously achieve traffic close to optimal while reducing staleness significantly compared to baseline algorithms such as batching. Further, these algorithms are robust to a variety of system configurations (number of edges), stream arrival rates, and query types.

Approximate Computation

In a geo-distributed setting, due to constrained WAN bandwidth, it is not always feasible to produce exact results in a timely manner. In such cases, applications must either sacrifice timeliness by allowing delayed – i.e., late – results or sacrifice accuracy by tolerating some error in the results. This is a fundamental trade-off: it is not always feasible to compute exact results with bounded staleness (Rabkin et al. 2014), Further, many real-world applications can tolerate some staleness or inaccuracy in their final results.

Recent work (Heintz et al. 2016a) has studied the staleness-error trade-off for windowed grouped aggregation in geo-distributed streaming analytics, recognizing that applications have diverse requirements: some may tolerate higher staleness in order to achieve lower error and vice versa. As in the exact computation scenario, the aggregation algorithm runs on each edge, partially aggregating the input data and sending results to the center over the WAN. In this case, the scheduling problem is to determine when, and if, aggregates need to be sent to the center. Aggregation algorithms are devised to solve two complementary problems: minimize staleness under an error constraint and minimize error under a staleness constraint.

This work first designs theoretically optimal offline algorithms to solve each of these problems and then develops practical online algorithms based on the intuition derived from the offline optimal algorithms. The optimal offline algorithms allow minimzing staleness (resp., error) under an error (resp., staleness) constraint. Intuitively, the error-bound problem is solved by transmitting only the aggregates that are essential to meet the error constraint. For the staleness-bound problem, error is minimized by prioritizing keys for transmission, based on their potential error (if not sent), while fully utilizing the available network bandwidth.

Using these offline algorithms as references, practical online algorithms are designed to efficiently trade off staleness and error. These practical algorithms are based on the key insight of representing grouped aggregation at the edge as a two-part cache. This formulation generalizes the caching-based framework for exact windowed grouped aggregation (Heintz et al. 2015) by introducing cache partitioning and cache eviction policies to identify which partial results must be sent and which ones can be discarded.

The practicality and efficacy of these algorithms is demonstrated through both trace-driven simulations and implementation in Apache Storm (2015), deployed on a PlanetLab (Peterson et al. 2003) testbed. Using workloads derived from traces of a popular web analytics service offered by Akamai (Nygren et al. 2010), experiments show that the proposed algorithms reduce staleness and error significantly compared to a practical aggregation algorithm that effectively combines batching with streaming using random sampling. The proposed techniques apply across a diverse set of aggregates, from distributive and algebraic aggregates (Gray et al. 1997) such as Sum and Max to sketch-based approximation of holistic aggregates such as unique count (via sketch data structures such as HyperLogLog Flajolet et al. 2007). Further, they can handle diverse network bandwidth and workload conditions and are adaptive to variable network conditions.

Related Work

Numerous stream computing systems (Akidau et al. 2013; Chandrasekaran et al. 2003; Kulkarni et al. 2015; Qian et al. 2013; Zaharia et al. 2013; Apache Flink 2016; Chen et al. 2016) have been proposed in recent years. The Google Dataflow model (Akidau et al. 2015) and the Apache Beam project (2016) have presented a unified abstraction for computing over both bounded and unbounded datasets. These systems provide many useful ideas for new analytics systems to build upon, but they are primarily designed for tightly coupled computing environments such as clusters and data centers and do not fully consider the challenges in a geo-distributed environment.

Wide-area computing has received increased research attention in recent years, due in part to the widening gap between data processing and communication costs. Much of this attention has been paid to batch computing (Pu et al. 2015; Vulimiri et al. 2015a,b; Heintz et al. 2016b). Relatively little work on streaming computation has focused on wide-area deployments or associated questions such as where to place computation. Pietzuch et al. (2006) optimize operator placement in geo-distributed settings to balance between system-level bandwidth usage and latency. Hwang et al. (2008) rely on replication across the wide area in order to achieve fault tolerance and reduce straggler effects.

JetStream (Rabkin et al. 2014) considers wide-area streaming computation and addresses the tension between timeliness and accuracy, focusing at a higher level on the appropriate abstractions for navigating this trade-off. Meanwhile BlinkDB (Agarwal et al. 2013) provides mechanisms to trade accuracy and response time, though it does not focus on processing streaming data. Das et al. (2014) consider trade-offs between throughput and latency in Spark Streaming, but they focus on exact computation and consider only a uniform batching interval for the entire stream.

Aggregation is a key operator in analytics, and grouped aggregation is supported by many data-parallel programming models (Boykin et al. 2014; Gray et al. 1997; Yu et al. 2009). Larson et al. (2002) explore the benefits of performing partial aggregation prior to a join operation. While they also recognize similarities to caching, they consider only a simple fixed-size cache. In sensor networks, aggregation is often performed over a hierarchical topology to improve energy efficiency and network longevity (Madden et al. 2002; Rajagopalan and Varshney 2006). Amur et al. (2013) study grouped aggregation, focusing on the design and implementation of efficient data structures for batch and streaming computation, though they do not consider staleness, a key performance metric in a geo-distributed setting.

Future Directions for Research

As more and more data is generated by sensors and IoT devices near the edges, research on optimizing geo-distributed streaming analytics can be taken into several directions. One area of research would be to optimize streaming analytics for a heterogeneous, geo-distributed computational environment. This could include edge devices and computing resources, in-networking processing elements, as well as centralized data centers. These elements can differ in their computing, storage, and networking capabilities, and their interactions can lead to interesting resource provisioning and scheduling issues. Another area of research is to achieve desired cost-latency-accuracy trade-offs in streaming analytics by utilizing alternate approaches such as sampling, sketching, as well as compression. For instance, sketching techniques can allow trading off accuracy for latency, and compression techniques can help minimize WAN bandwidth usage. Optimizing different types of queries and multi-query optimization is another area of future research. While multi-query optimization in traditional database systems has focused on minimizing execution time and memory usage, the metrics of interest in a geo-distributed streaming analytics settings are different and will result in new research issues.