Encyclopedia of Big Data Technologies

Living Edition
| Editors: Sherif Sakr, Albert Zomaya

Approximate Computing for Stream Analytics

  • Do Le Quoc
  • Ruichuan Chen
  • Pramod Bhatotia
  • Christof Fetzer
  • Volker Hilt
  • Thorsten Strufe
Living reference work entry
DOI: https://doi.org/10.1007/978-3-319-63962-8_153-1

Abstract

Approximate computing has become a promising mechanism to trade off accuracy for efficiency. The idea behind approximate computing is to compute over a representative sample instead of the entire input dataset. Thus, approximate computing – based on the chosen sample size – can make a systematic trade-off between the output accuracy and computation efficiency. Unfortunately, the state-of-the-art systems for approximate computing primarily target batch analytics, where the input data remains unchanged during the course of computation. Thus, they are not well-suited for stream analytics. This motivated the design of StreamApprox– a stream analytics system for approximate computing. To realize this idea, an online stratified reservoir sampling algorithm is designed to produce approximate output with rigorous error bounds. Importantly, the proposed algorithm is generic and can be applied to two prominent types of stream processing systems: (1) batched stream processing such as Apache Spark Streaming, and (2) pipelined stream processing such as Apache Flink.

This is a preview of subscription content, log in to check access.

References

  1. Agarwal S, Mozafari B, Panda A, Milner H, Madden S, Stoica I (2013) BlinkDB: queries with bounded errors and bounded response times on very large data. In: Proceedings of the ACM European conference on computer systems (EuroSys)Google Scholar
  2. Al-Kateb M, Lee BS (2010) Stratified reservoir sampling over heterogeneous data streams. In: Proceedings of the 22nd international conference on scientific and statistical database management (SSDBM)Google Scholar
  3. Angel S, Ballani H, Karagiannis T, O’Shea G, Thereska E (2014) End-to-end performance isolation through virtual datacenters. In: Proceedings of the USENIX conference on operating systems design and implementation (OSDI)Google Scholar
  4. Bhatotia P (2015) Incremental parallel and distributed systems. PhD thesis, Max Planck Institute for Software Systems (MPI-SWS)Google Scholar
  5. Bhatotia P, Wieder A, Akkus IE, Rodrigues R, Acar UA (2011a) Large-scale incremental data processing with change propagation. In: Proceedings of the conference on hot topics in cloud computing (HotCloud)Google Scholar
  6. Bhatotia P, Wieder A, Rodrigues R, Acar UA, Pasquini R (2011b) Incoop: MapReduce for incremental computations. In: Proceedings of the ACM symposium on cloud computing (SoCC)Google Scholar
  7. Bhatotia P, Dischinger M, Rodrigues R, Acar UA (2012a) Slider: incremental sliding-window computations for large-scale data analysis. Technical report MPI-SWS-2012-004, MPI-SWS. http://www.mpi-sws.org/tr/2012-004.pdf
  8. Bhatotia P, Rodrigues R, Verma A (2012b) Shredder: GPU-accelerated incremental storage and computation. In: Proceedings of USENIX conference on file and storage technologies (FAST)Google Scholar
  9. Bhatotia P, Acar UA, Junqueira FP, Rodrigues R (2014) Slider: incremental sliding window analytics. In: Proceedings of the 15th international middleware conference (Middleware)Google Scholar
  10. Bhatotia P, Fonseca P, Acar UA, Brandenburg B, Rodrigues R (2015) iThreads: a threading library for parallel incremental computation. In: Proceedings of the 20th international conference on architectural support for programming languages and operating systems (ASPLOS)Google Scholar
  11. Blum A, Dwork C, McSherry F, Nissim K (2005) Practical privacy: the sulq framework. In: Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems (PODS)Google Scholar
  12. Blum A, Ligett K, Roth A (2008) A learning theory approach to non-interactive database privacy. In: Proceedings of the fortieth annual ACM symposium on theory of computing (STOC)Google Scholar
  13. Charles R, Alexey T, Gregory G, Randy HK, Michael K (2012) Towards understanding heterogeneous clouds at scale: Google trace analysis. Techical reportGoogle Scholar
  14. Cormode G, Garofalakis M, Haas PJ, Jermaine C (2012) Synopses for massive data: samples, histograms, wavelets, sketches. Foundations and Trends in Databases. Now, BostonGoogle Scholar
  15. Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Proceedings of the USENIX conference on operating systems design and implementation (OSDI)Google Scholar
  16. Doucet A, Godsill S, Andrieu C (2000) On sequential monte carlo sampling methods for bayesian filtering. Stat Comput 10:197–208Google Scholar
  17. Dziuda DM (2010) Data mining for genomics and proteomics: analysis of gene and protein expression data. Wiley, HobokenGoogle Scholar
  18. Foundation AS (2017a) Apache flink. https://flink.apache.org
  19. Foundation AS (2017b) Apache spark streaming. https://spark.apache.org/streaming
  20. Foundation AS (2017c) Kafka – a high-throughput distributed messaging system. https://kafka.apache.org
  21. Garofalakis MN, Gibbon PB (2001) Approximate query processing: taming the terabytes. In: Proceedings of the international conference on very large data bases (VLDB)Google Scholar
  22. Hellerstein JM, Haas PJ, Wang HJ (1997) Online aggregation. In: Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD)Google Scholar
  23. Krishnan DR, Quoc DL, Bhatotia P, Fetzer C, Rodrigues R (2016) IncApprox: a data analytics system for incremental approximate computing. In: Proceedings of the 25th international conference on World Wide Web (WWW)Google Scholar
  24. Masud MM, Woolam C, Gao J, Khan L, Han J, Hamlen KW, Oza NC (2012) Facing the reality of data stream classification: coping with scarcity of labeled data. Knowl Inf Syst 33:213–244Google Scholar
  25. Murray DG, McSherry F, Isaacs R, Isard M, Barham P, Abadi M (2013) Naiad: a timely dataflow system. In: Proceedings of the twenty-fourth ACM symposium on operating systems principles (SOSP)Google Scholar
  26. Natarajan S (1995) Imprecise and approximate computation. Kluwer Academic Publishers, BostonGoogle Scholar
  27. Quoc DL, Martin A, Fetzer C (2013) Scalable and real-time deep packet inspection. In: Proceedings of the 2013 IEEE/ACM 6th international conference on utility and cloud computing (UCC)Google Scholar
  28. Quoc DL, Yazdanov L, Fetzer C (2014) Dolen: user-side multi-cloud application monitoring. In: International conference on future internet of things and cloud (FICLOUD)Google Scholar
  29. Quoc DL, D’Alessandro V, Park B, Romano L, Fetzer C (2015a) Scalable network traffic classification using distributed support vector machines. In: Proceedings of the 2015 IEEE 8th international conference on cloud computing (CLOUD)Google Scholar
  30. Quoc DL, Fetzer C, Felber P, Étienne Rivière, Schiavoni V, Sutra P (2015b) Unicrawl: a practical geographically distributed web crawler. In: Proceedings of the 2015 IEEE 8th international conference on cloud computing (CLOUD)Google Scholar
  31. Quoc DL, Beck M, Bhatotia P, Chen R, Fetzer C, Strufe T (2017a) Privacy preserving stream analytics: the marriage of randomized response and approximate computing. https://arxiv.org/abs/1701.05403
  32. Quoc DL, Beck M, Bhatotia P, Chen R, Fetzer C, Strufe T (2017b) PrivApprox: privacy-preserving stream analytics. In: Proceedings of the 2017 USENIX conference on USENIX annual technical conference (USENIX ATC)Google Scholar
  33. Quoc DL, Chen R, Bhatotia P, Fetzer C, Hilt V, Strufe T (2017c) Approximate stream analytics in apache flink and apache spark streaming. CoRR, abs/1709.02946Google Scholar
  34. Quoc DL, Chen R, Bhatotia P, Fetzer C, Hilt V, Strufe T (2017d) StreamApprox: approximate computing for stream analytics. In: Proceedings of the international middleware conference (middleware)Google Scholar
  35. Srikanth K, Anil S, Aleksandar V, Matthaios O, Robert G, Surajit C, Ding B (2016) Quickr: lazily approximating complex ad-hoc queries in big data clusters. In: Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD)Google Scholar
  36. Thompson SK (2012) Sampling. Wiley series in probability and statistics. The Australasian Institute of Mining and Metallurgy, CarltonGoogle Scholar
  37. Wieder A, Bhatotia P, Post A, Rodrigues R (2010a) Brief announcement: modelling mapreduce for optimal execution in the cloud. In: Proceedings of the 29th ACM SIGACT-SIGOPS symposium on principles of distributed computing (PODC)Google Scholar
  38. Wieder A, Bhatotia P, Post A, Rodrigues R (2010b) Conductor: orchestrating the clouds. In: Proceedings of the 4th international workshop on large scale distributed systems and middleware (LADIS)Google Scholar
  39. Wieder A, Bhatotia P, Post A, Rodrigues R (2012) Orchestrating the deployment of computations in the cloud with conductor. In: Proceedings of the 9th USENIX symposium on networked systems design and implementation (NSDI)Google Scholar
  40. Wikipedia (2017) 68-95-99.7 RuleGoogle Scholar
  41. Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on networked systems design and implementation (NSDI)Google Scholar
  42. Zaharia M, Das T, Li H, Hunter T, Shenker S, Stoica I (2013) Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of the twenty-fourth ACM symposium on operating systems principles (SOSP)Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  • Do Le Quoc
    • 1
  • Ruichuan Chen
    • 2
  • Pramod Bhatotia
    • 3
  • Christof Fetzer
    • 1
  • Volker Hilt
    • 2
  • Thorsten Strufe
    • 1
  1. 1.TU DresdenDresdenGermany
  2. 2.Nokia Bell LabsStuttgartGermany
  3. 3.University of Edinburgh and Alan Turing InstituteEdinburghUK

Section editors and affiliations

  • Asterios Katsifodimos
    • 1
  • Pramod Bhatotia
    • 2
  1. 1.Delft University of TechnologyDelftNetherlands
  2. 2.School of InformaticsUniversity of EdinburghEdinburghUnited Kingdom