Optimizing Geo-Distributed Streaming Analytics

Chandra, Abhishek; Heintz, Benjamin; Sitaraman, Ramesh

doi:10.1007/978-3-319-63962-8_155-1

Optimizing Geo-Distributed Streaming Analytics

Abhishek Chandra³,
Benjamin Heintz³ &
Ramesh Sitaraman⁴

Living reference work entry
First Online: 01 February 2018

195 Accesses

Abstract

Rapid data streams are generated continuously from diverse sources including users, devices, and sensors located around the globe. Modern analytics services require the analysis of large quantities of such data streams derived from disparate geo-distributed sources. Further, the analytics requirements can be complex, resulting in complex trade-offs between cost, performance, and accuracy. A typical geo-distributed analytics service uses a hub-and-spoke model, comprising multiple edges connected by a wide area network (WAN) to a central data warehouse, which leads to the question of how much computation should be performed at the edges versus the center. While the traditional approach to analytics processing is to send all the data to a dedicated centralized location, an alternative approach would be to push all computing to the edge for in situ processing. However, neither approach is optimal for modern analytics requirements. Instead, the optimal solution often entails carefully orchestrating the analytics processing at both the center and the edges and is driven by factors such as application, data, and resource characteristics.

Download reference work entry PDF

Overview

Data analytics is undergoing a revolution: the volume and velocity of data sources are increasing rapidly. Across a number of application domains from web, social, and energy analytics to scientific computing, large quantities of data are generated continuously in the form of posts, tweets, logs, sensor readings, and more. A modern analytics service must provide real-time analysis of these data streams to extract meaningful and timely information. As a result, there has been a growing interest in streaming analytics with recent development of several distributed analytics platforms (Apache Storm 2015; Boykin et al. 2014; Zaharia et al. 2013).

In many streaming analytics domains, inputs originate from diverse sources including users, devices, and sensors located around the globe. As a result, the distributed infrastructure of a typical analytics service (e.g., Google Analytics, Akamai Media Analytics, etc.) has a hub-and-spoke model. Data sources send streams of data to nearby “edge” servers. These geographically distributed edge servers send data to a central location that can process the data further, store summaries, and present those summaries in visual form to users of the analytics service. While the central hub is typically located in a well-provisioned data center, resources may be limited at the edge locations. In particular, the available WAN bandwidth between the edges and the center is limited.

A traditional approach to analytics processing is the centralized model where no processing is performed at the edges and all the data is sent to a dedicated centralized location. This approach is generally suboptimal, because it strains the scarce WAN bandwidth available between the edges and the center, leading to delayed results. Further, it fails to make use of the available compute and storage resources at the edge. An alternative is a decentralized approach (Rabkin et al. 2014) that utilizes the edge for much of the processing in order to minimize WAN traffic. However, neither approach is optimal for modern analytics requirements. Instead, analytics processing must utilize both edge and central resources in a carefully coordinated manner in order to achieve the stringent requirements of an analytics service in terms of network traffic, user-perceived delay, as well as accuracy of results.

A crucial question for a geo-distributed analytics system is how best to utilize resources at both the edges and the center in order to deliver timely results. In particular, optimizing geo-distributed streaming analytics requires an analytics system to determine how much computation to perform at the edges and how much to leave for the center (i.e., where to compute), as well as when to send partial results from edges to the center. These questions must be addressed in the context of the trade-off between multiple metrics: cost (e.g., WAN traffic), timeliness (e.g., latency), and data quality (e.g., result accuracy).

Key Research Findings

Recent work (Heintz et al. 2015, 2016a, 2017) has addressed the problem of optimizing geo-distributed streaming analytics by analyzing these trade-offs in the context of windowed grouped aggregation, an important primitive in any stream analytics system. This work systematically analyzes these trade-offs both in the context of exact computation (where all data must be completely processed) and approximate computation (where some error can be tolerated).

Abstractions for grouped aggregation are provided in most data analytics frameworks, for example, as the Reduce operation in MapReduce or Group By in SQL and LINQ. A useful variant in stream computing is windowed grouped aggregation, where each group is further broken down into finite time windows before being summarized. Windowed grouped aggregation is one of the most frequently used primitives in an analytics service and underlies queries that aggregate a metric of interest over a time window. Example queries that utilize windowed grouped aggregation include computing content popularity for a web analytics application or the average network load distribution for a network monitoring application.

Exact Computation

Recent work (Heintz et al. 2015, 2017) focuses on designing algorithms for performing windowed grouped aggregation in order to optimize the two key metrics of any geo-distributed streaming analytics service: WAN traffic and staleness (the delay in getting the result for a time window). The aggregation algorithm runs on each edge, aggregating the records in the input stream and sending (partial) aggregates to the center over the WAN. The scheduling problem is to determine when to send which aggregates to the center.

This work examines baseline approaches such as pure streaming, a centralized approach where all data is immediately sent from edge to center without any edge processing, and pure batching, a decentralized approach where all data during a time window is aggregated at the edge, with only the aggregated results being sent to the center at the end of the window. It shows that such approaches do not jointly optimize traffic and staleness and, hence, are suboptimal. It presents a family of optimal offline algorithms that jointly minimize both staleness and traffic. One offline optimal algorithm is the eager optimal algorithm that eagerly flushes updates for each distinct key immediately after the final arrival to that key within a window. Another offline optimal algorithm is the lazy optimal algorithm which schedules its updates to start at the last possible time that would still enable it to flush all its updates with an optimal staleness. There is a family of optimal algorithms whose schedules consist of update times that lie between those for the eager and lazy algorithm for each key.

Using these offline optimal algorithms as a foundation, practical online aggregation algorithms are developed that emulate the offline optimal algorithms. The key insight here is that windowed grouped aggregation can be modeled as a caching problem where the cache size varies over time. This insight allows the decomposition of the scheduling problem into two subproblems: determining the (dynamic) cache size and defining a cache eviction policy. While the first subproblem can be solved by using insights gained from the optimal offline algorithms, the second subproblem lends itself to using the vast prior work on cache replacement policies (Podlipnig et al. 2003). To overcome potential problems with pure emulation of either the eager or the lazy offline optimal algorithms in practice, a hybrid algorithm is proposed that computes cache size as a linear combination of eager and lazy cache sizes. Concretely, a hybrid algorithm with a laziness parameter α – denoted by hybrid(α) – estimates the cache size c(t) at time t as:

$$\displaystyle \begin{aligned}c(t) = \alpha \cdot c_l(t) + (1 - \alpha) \cdot c_e(t),\end{aligned}$$

where c_l(t) and c_e(t) are the lazy and eager cache size estimates, respectively. By selecting an appropriate value for α, the hybrid algorithm can achieve the desired trade-off between traffic and staleness.

The practicality of these algorithms is demonstrated through an implementation in Apache Storm (2015), deployed on the PlanetLab (2015) testbed. The experiments are driven by workloads derived from anonymized traces of a popular web analytics service offered by Akamai (Nygren et al. 2010), a large content delivery network. The results of the experiments show that the proposed online hybrid aggregation algorithms simultaneously achieve traffic close to optimal while reducing staleness significantly compared to baseline algorithms such as batching. Further, these algorithms are robust to a variety of system configurations (number of edges), stream arrival rates, and query types.

Approximate Computation

In a geo-distributed setting, due to constrained WAN bandwidth, it is not always feasible to produce exact results in a timely manner. In such cases, applications must either sacrifice timeliness by allowing delayed – i.e., late – results or sacrifice accuracy by tolerating some error in the results. This is a fundamental trade-off: it is not always feasible to compute exact results with bounded staleness (Rabkin et al. 2014), Further, many real-world applications can tolerate some staleness or inaccuracy in their final results.

Recent work (Heintz et al. 2016a) has studied the staleness-error trade-off for windowed grouped aggregation in geo-distributed streaming analytics, recognizing that applications have diverse requirements: some may tolerate higher staleness in order to achieve lower error and vice versa. As in the exact computation scenario, the aggregation algorithm runs on each edge, partially aggregating the input data and sending results to the center over the WAN. In this case, the scheduling problem is to determine when, and if, aggregates need to be sent to the center. Aggregation algorithms are devised to solve two complementary problems: minimize staleness under an error constraint and minimize error under a staleness constraint.

This work first designs theoretically optimal offline algorithms to solve each of these problems and then develops practical online algorithms based on the intuition derived from the offline optimal algorithms. The optimal offline algorithms allow minimzing staleness (resp., error) under an error (resp., staleness) constraint. Intuitively, the error-bound problem is solved by transmitting only the aggregates that are essential to meet the error constraint. For the staleness-bound problem, error is minimized by prioritizing keys for transmission, based on their potential error (if not sent), while fully utilizing the available network bandwidth.

Using these offline algorithms as references, practical online algorithms are designed to efficiently trade off staleness and error. These practical algorithms are based on the key insight of representing grouped aggregation at the edge as a two-part cache. This formulation generalizes the caching-based framework for exact windowed grouped aggregation (Heintz et al. 2015) by introducing cache partitioning and cache eviction policies to identify which partial results must be sent and which ones can be discarded.

The practicality and efficacy of these algorithms is demonstrated through both trace-driven simulations and implementation in Apache Storm (2015), deployed on a PlanetLab (Peterson et al. 2003) testbed. Using workloads derived from traces of a popular web analytics service offered by Akamai (Nygren et al. 2010), experiments show that the proposed algorithms reduce staleness and error significantly compared to a practical aggregation algorithm that effectively combines batching with streaming using random sampling. The proposed techniques apply across a diverse set of aggregates, from distributive and algebraic aggregates (Gray et al. 1997) such as Sum and Max to sketch-based approximation of holistic aggregates such as unique count (via sketch data structures such as HyperLogLog Flajolet et al. 2007). Further, they can handle diverse network bandwidth and workload conditions and are adaptive to variable network conditions.

Related Work

Numerous stream computing systems (Akidau et al. 2013; Chandrasekaran et al. 2003; Kulkarni et al. 2015; Qian et al. 2013; Zaharia et al. 2013; Apache Flink 2016; Chen et al. 2016) have been proposed in recent years. The Google Dataflow model (Akidau et al. 2015) and the Apache Beam project (2016) have presented a unified abstraction for computing over both bounded and unbounded datasets. These systems provide many useful ideas for new analytics systems to build upon, but they are primarily designed for tightly coupled computing environments such as clusters and data centers and do not fully consider the challenges in a geo-distributed environment.

Wide-area computing has received increased research attention in recent years, due in part to the widening gap between data processing and communication costs. Much of this attention has been paid to batch computing (Pu et al. 2015; Vulimiri et al. 2015a,b; Heintz et al. 2016b). Relatively little work on streaming computation has focused on wide-area deployments or associated questions such as where to place computation. Pietzuch et al. (2006) optimize operator placement in geo-distributed settings to balance between system-level bandwidth usage and latency. Hwang et al. (2008) rely on replication across the wide area in order to achieve fault tolerance and reduce straggler effects.

JetStream (Rabkin et al. 2014) considers wide-area streaming computation and addresses the tension between timeliness and accuracy, focusing at a higher level on the appropriate abstractions for navigating this trade-off. Meanwhile BlinkDB (Agarwal et al. 2013) provides mechanisms to trade accuracy and response time, though it does not focus on processing streaming data. Das et al. (2014) consider trade-offs between throughput and latency in Spark Streaming, but they focus on exact computation and consider only a uniform batching interval for the entire stream.

Aggregation is a key operator in analytics, and grouped aggregation is supported by many data-parallel programming models (Boykin et al. 2014; Gray et al. 1997; Yu et al. 2009). Larson et al. (2002) explore the benefits of performing partial aggregation prior to a join operation. While they also recognize similarities to caching, they consider only a simple fixed-size cache. In sensor networks, aggregation is often performed over a hierarchical topology to improve energy efficiency and network longevity (Madden et al. 2002; Rajagopalan and Varshney 2006). Amur et al. (2013) study grouped aggregation, focusing on the design and implementation of efficient data structures for batch and streaming computation, though they do not consider staleness, a key performance metric in a geo-distributed setting.

Future Directions for Research

As more and more data is generated by sensors and IoT devices near the edges, research on optimizing geo-distributed streaming analytics can be taken into several directions. One area of research would be to optimize streaming analytics for a heterogeneous, geo-distributed computational environment. This could include edge devices and computing resources, in-networking processing elements, as well as centralized data centers. These elements can differ in their computing, storage, and networking capabilities, and their interactions can lead to interesting resource provisioning and scheduling issues. Another area of research is to achieve desired cost-latency-accuracy trade-offs in streaming analytics by utilizing alternate approaches such as sampling, sketching, as well as compression. For instance, sketching techniques can allow trading off accuracy for latency, and compression techniques can help minimize WAN bandwidth usage. Optimizing different types of queries and multi-query optimization is another area of future research. While multi-query optimization in traditional database systems has focused on minimizing execution time and memory usage, the metrics of interest in a geo-distributed streaming analytics settings are different and will result in new research issues.

References

Agarwal S et al (2013) BlinkDB: queries with bounded errors and bounded response times on very large data. In: Proceedings of EuroSys, pp 29–42
Google Scholar
Akidau T et al (2013) MillWheel: fault-tolerant stream processing at Internet scale. Proc VLDB Endow 6(11):1033–1044
Article Google Scholar
Akidau T et al (2015) The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proc VLDB Endow 8:1792–1803
Article Google Scholar
Amur H et al (2013) Memory-efficient groupby-aggregate using compressed buffer trees. In: Proceedings of the symposium on cloud computing (SoCC)
Google Scholar
Apache Flink (2016) Scalable batch and stream data processing. http://flink.apache.org/
Apache Storm (2015) Storm, distributed and fault-tolerant realtime computation. http://storm.apache.org/
Beam (2016) Apache Beam (incubating). http://beam.incubator.apache.org/
Boykin O, Ritchie S, O’Connel I, Lin J (2014) Summingbird: a framework for integrating batch and online mapreduce computations. In: Proceedings of VLDB, vol 7, pp 1441–1451
Google Scholar
Chandrasekaran S et al (2003) TelegraphCQ: continuous dataflow processing for an uncertain world. In: Proceedings of the conference on innovative data systems research
Book Google Scholar
Chen GJ, Wiener JL, Iyer S, Jaiswal A, Lei R, Simha N, Wang W, Wilfong K, Williamson T, Yilmaz S (2016) Realtime data processing at facebook. In: Proceedings of SIGMOD, pp 1087–1098
Google Scholar
Das T, Zhong Y, Stoica I, Shenker S (2014) Adaptive stream processing using dynamic batch sizing. In: Proceedings of the ACM symposium on cloud computing, pp 16:1–16:13
Google Scholar
Flajolet P, Fusy É, Gandouet O et al (2007) HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. In: Proceedings of the international conference on analysis of algorithms
MATH Google Scholar
Gray J et al (1997) Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min Knowl Discov 1(1):29–53
Article Google Scholar
Heintz B, Chandra A, Sitaraman RK (2015) Optimizing grouped aggregation in geo-distributed streaming analytics. In: Proceedings of the ACM symposium on high-performance parallel and distributed computing, pp 133–144
Google Scholar
Heintz B, Chandra A, Sitaraman RK (2016a) Trading timeliness and accuracy in geo-distributed streaming analytics. In: Proceedings of the ACM symposium on cloud computing
Book Google Scholar
Heintz B, Chandra A, Sitaraman RK, Weissman J (2016b) End-to-end optimization for geo-distributed mapreduce. IEEE Trans Cloud Comput 4(3):293–306
Article Google Scholar
Heintz B, Chandra A, Sitaraman RK (2017) Optimizing timeliness and cost in geo-distributed streaming analytics. IEEE Trans Cloud Comput. http://ieeexplore.ieee.org/document/8031021/
Hwang JH, Cetintemel U, Zdonik S (2008) Fast and highly-available stream processing over wide area networks. In: Proceedings of ICDE, pp 804–813
Google Scholar
Kulkarni S et al (2015) Twitter heron: stream processing at scale. In: Proceedings of SIGMOD, pp 239–250
Google Scholar
Larson PA (2002) Data reduction by partial preaggregation. In: Proceedings of ICDE, pp 706–715
Google Scholar
Madden S, Franklin MJ, Hellerstein JM, Hong W (2002) TAG: a Tiny AGgregation service for ad-hoc sensor networks. In: Proceedings of OSDI, pp 131–146
Google Scholar
Nygren E, Sitaraman RK, Sun J (2010) The Akamai network: a platform for high-performance internet applications. SIGOPS Oper Syst Rev 44(3):2–19
Article Google Scholar
Peterson L, Anderson T, Culler D, Roscoe T (2003) A blueprint for introducing disruptive technology into the Internet. SIGCOMM Comput Commun Rev 33(1): 59–64
Article Google Scholar
Pietzuch P et al (2006) Network-aware operator placement for stream-processing systems. In: Proceedings of ICDE
Book Google Scholar
PlanetLab (2015) http://planet-lab.org/
Podlipnig S, Böszörmenyi L (2003) A survey of web cache replacement strategies. ACM Comput Surv 35(4):374–398
Article Google Scholar
Pu Q, Ananthanarayanan G, Bodik P, Kandula S, Akella A, Bahl P, Stoica I (2015) Low latency geo-distributed data analytics. In: Proceedings of SIGCOMM, pp 421–434
Google Scholar
Qian Z et al (2013) TimeStream: reliable stream computation in the cloud. In: Proceedings of EuroSys, pp 1–14
Google Scholar
Rabkin A, Arye M, Sen S, Pai VS, Freedman MJ (2014) Aggregation and degradation in JetStream: streaming analytics in the wide area. In: Proceedings of NSDI, pp. 275–288
Google Scholar
Rajagopalan R, Varshney P (2006) Data-aggregation techniques in sensor networks: a survey. IEEE Commun Surv Tutor 8(4):48–63
Article Google Scholar
Vulimiri A, Curino C, Godfrey B, Karanasos K, Varghese G (2015a) WANalytics: analytics for a geo-distributed data-intensive world. In: Proceedings of CIDR
Book Google Scholar
Vulimiri A, Curino C, Godfrey PB, Jungblut T, Padhye J, Varghese G (2015b) Global analytics in the face of bandwidth and regulatory constraints. In: Proceedings of NSDI, pp 323–336
Google Scholar
Yu Y, Gunda PK, Isard M (2009) Distributed aggregation for data-parallel computing: interfaces and implementations. In: Proceedings of SOSP, pp 247–260
Google Scholar
Zaharia M, Das T, Li H, Hunter T, Shenker S, Stoica I (2013) Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of SOSP, pp 423–438
Google Scholar

Download references

Author information

Authors and Affiliations

University of Minnesota, Minneapolis, MN, USA
Abhishek Chandra & Benjamin Heintz
University of Massachusetts, Amherst, MA, USA
Ramesh Sitaraman

Authors

Abhishek Chandra
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin Heintz
View author publications
You can also search for this author in PubMed Google Scholar
Ramesh Sitaraman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abhishek Chandra .

Editor information

Editors and Affiliations

Institute of Computer Science, University of Tartu, Tartu, Estonia
Sherif Sakr
Sch of Info Techno, Building J12, University of Sydney Sch of Info Techno, Building J12, Sydney, Australia
Albert Zomaya

Section Editor information

Delft University of Technology, Delft, Netherlands
Asterios Katsifodimos
School of Informatics, University of Edinburgh, Edinburgh, UK
Pramod Bhatotia

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Chandra, A., Heintz, B., Sitaraman, R. (2018). Optimizing Geo-Distributed Streaming Analytics. In: Sakr, S., Zomaya, A. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-63962-8_155-1

Download citation

DOI: https://doi.org/10.1007/978-3-319-63962-8_155-1
Published: 01 February 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-63962-8
Online ISBN: 978-3-319-63962-8
eBook Packages: Springer Reference MathematicsReference Module Computer Science and Engineering

Publish with us

Policies and ethics