Skip to main content

Optimizing Geo-Distributed Streaming Analytics

  • Living reference work entry
  • First Online:
Encyclopedia of Big Data Technologies

Abstract

Rapid data streams are generated continuously from diverse sources including users, devices, and sensors located around the globe. Modern analytics services require the analysis of large quantities of such data streams derived from disparate geo-distributed sources. Further, the analytics requirements can be complex, resulting in complex trade-offs between cost, performance, and accuracy. A typical geo-distributed analytics service uses a hub-and-spoke model, comprising multiple edges connected by a wide area network (WAN) to a central data warehouse, which leads to the question of how much computation should be performed at the edges versus the center. While the traditional approach to analytics processing is to send all the data to a dedicated centralized location, an alternative approach would be to push all computing to the edge for in situ processing. However, neither approach is optimal for modern analytics requirements. Instead, the optimal solution often entails carefully orchestrating the analytics processing at both the center and the edges and is driven by factors such as application, data, and resource characteristics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

References

  • Agarwal S et al (2013) BlinkDB: queries with bounded errors and bounded response times on very large data. In: Proceedings of EuroSys, pp 29–42

    Google Scholar 

  • Akidau T et al (2013) MillWheel: fault-tolerant stream processing at Internet scale. Proc VLDB Endow 6(11):1033–1044

    Article  Google Scholar 

  • Akidau T et al (2015) The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proc VLDB Endow 8:1792–1803

    Article  Google Scholar 

  • Amur H et al (2013) Memory-efficient groupby-aggregate using compressed buffer trees. In: Proceedings of the symposium on cloud computing (SoCC)

    Google Scholar 

  • Apache Flink (2016) Scalable batch and stream data processing. http://flink.apache.org/

  • Apache Storm (2015) Storm, distributed and fault-tolerant realtime computation. http://storm.apache.org/

  • Beam (2016) Apache Beam (incubating). http://beam.incubator.apache.org/

  • Boykin O, Ritchie S, O’Connel I, Lin J (2014) Summingbird: a framework for integrating batch and online mapreduce computations. In: Proceedings of VLDB, vol 7, pp 1441–1451

    Google Scholar 

  • Chandrasekaran S et al (2003) TelegraphCQ: continuous dataflow processing for an uncertain world. In: Proceedings of the conference on innovative data systems research

    Book  Google Scholar 

  • Chen GJ, Wiener JL, Iyer S, Jaiswal A, Lei R, Simha N, Wang W, Wilfong K, Williamson T, Yilmaz S (2016) Realtime data processing at facebook. In: Proceedings of SIGMOD, pp 1087–1098

    Google Scholar 

  • Das T, Zhong Y, Stoica I, Shenker S (2014) Adaptive stream processing using dynamic batch sizing. In: Proceedings of the ACM symposium on cloud computing, pp 16:1–16:13

    Google Scholar 

  • Flajolet P, Fusy É, Gandouet O et al (2007) HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. In: Proceedings of the international conference on analysis of algorithms

    MATH  Google Scholar 

  • Gray J et al (1997) Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min Knowl Discov 1(1):29–53

    Article  Google Scholar 

  • Heintz B, Chandra A, Sitaraman RK (2015) Optimizing grouped aggregation in geo-distributed streaming analytics. In: Proceedings of the ACM symposium on high-performance parallel and distributed computing, pp 133–144

    Google Scholar 

  • Heintz B, Chandra A, Sitaraman RK (2016a) Trading timeliness and accuracy in geo-distributed streaming analytics. In: Proceedings of the ACM symposium on cloud computing

    Book  Google Scholar 

  • Heintz B, Chandra A, Sitaraman RK, Weissman J (2016b) End-to-end optimization for geo-distributed mapreduce. IEEE Trans Cloud Comput 4(3):293–306

    Article  Google Scholar 

  • Heintz B, Chandra A, Sitaraman RK (2017) Optimizing timeliness and cost in geo-distributed streaming analytics. IEEE Trans Cloud Comput. http://ieeexplore.ieee.org/document/8031021/

  • Hwang JH, Cetintemel U, Zdonik S (2008) Fast and highly-available stream processing over wide area networks. In: Proceedings of ICDE, pp 804–813

    Google Scholar 

  • Kulkarni S et al (2015) Twitter heron: stream processing at scale. In: Proceedings of SIGMOD, pp 239–250

    Google Scholar 

  • Larson PA (2002) Data reduction by partial preaggregation. In: Proceedings of ICDE, pp 706–715

    Google Scholar 

  • Madden S, Franklin MJ, Hellerstein JM, Hong W (2002) TAG: a Tiny AGgregation service for ad-hoc sensor networks. In: Proceedings of OSDI, pp 131–146

    Google Scholar 

  • Nygren E, Sitaraman RK, Sun J (2010) The Akamai network: a platform for high-performance internet applications. SIGOPS Oper Syst Rev 44(3):2–19

    Article  Google Scholar 

  • Peterson L, Anderson T, Culler D, Roscoe T (2003) A blueprint for introducing disruptive technology into the Internet. SIGCOMM Comput Commun Rev 33(1): 59–64

    Article  Google Scholar 

  • Pietzuch P et al (2006) Network-aware operator placement for stream-processing systems. In: Proceedings of ICDE

    Book  Google Scholar 

  • PlanetLab (2015) http://planet-lab.org/

  • Podlipnig S, Böszörmenyi L (2003) A survey of web cache replacement strategies. ACM Comput Surv 35(4):374–398

    Article  Google Scholar 

  • Pu Q, Ananthanarayanan G, Bodik P, Kandula S, Akella A, Bahl P, Stoica I (2015) Low latency geo-distributed data analytics. In: Proceedings of SIGCOMM, pp 421–434

    Google Scholar 

  • Qian Z et al (2013) TimeStream: reliable stream computation in the cloud. In: Proceedings of EuroSys, pp 1–14

    Google Scholar 

  • Rabkin A, Arye M, Sen S, Pai VS, Freedman MJ (2014) Aggregation and degradation in JetStream: streaming analytics in the wide area. In: Proceedings of NSDI, pp. 275–288

    Google Scholar 

  • Rajagopalan R, Varshney P (2006) Data-aggregation techniques in sensor networks: a survey. IEEE Commun Surv Tutor 8(4):48–63

    Article  Google Scholar 

  • Vulimiri A, Curino C, Godfrey B, Karanasos K, Varghese G (2015a) WANalytics: analytics for a geo-distributed data-intensive world. In: Proceedings of CIDR

    Book  Google Scholar 

  • Vulimiri A, Curino C, Godfrey PB, Jungblut T, Padhye J, Varghese G (2015b) Global analytics in the face of bandwidth and regulatory constraints. In: Proceedings of NSDI, pp 323–336

    Google Scholar 

  • Yu Y, Gunda PK, Isard M (2009) Distributed aggregation for data-parallel computing: interfaces and implementations. In: Proceedings of SOSP, pp 247–260

    Google Scholar 

  • Zaharia M, Das T, Li H, Hunter T, Shenker S, Stoica I (2013) Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of SOSP, pp 423–438

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abhishek Chandra .

Editor information

Editors and Affiliations

Section Editor information

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Chandra, A., Heintz, B., Sitaraman, R. (2018). Optimizing Geo-Distributed Streaming Analytics. In: Sakr, S., Zomaya, A. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-63962-8_155-1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-63962-8_155-1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-63962-8

  • Online ISBN: 978-3-319-63962-8

  • eBook Packages: Springer Reference MathematicsReference Module Computer Science and Engineering

Publish with us

Policies and ethics