Encyclopedia of Big Data Technologies

2019 Edition
| Editors: Sherif Sakr, Albert Y. Zomaya

Introduction to Stream Processing Algorithms

  • Nicoló RivettiEmail author
Reference work entry
DOI: https://doi.org/10.1007/978-3-319-77525-8_192

Definitions

Data streaming focuses on estimating functions over streams, which is an important task in data-intensive applications. It aims at approximating functions or statistical measures over (distributed) massive stream(s) in poly-logarithmic space over the size and/or the domain size (i.e., number of distinct items) of the stream(s).

Overview

Many different domains are concerned by the analysis of streams, including machine learning, data mining, databases, information retrieval, and network monitoring. In all these fields, it is necessary to quickly and precisely process a huge amount of data. This can also be applied to any other data issued from distributed applications such as social networks or sensor networks. Given these settings, the real-time analysis of large streams, relying on full-space algorithms, is often not feasible. Two main approaches exist to monitor massive data streams in real time with small amount of resources: sampling and summaries.

Computing information...

This is a preview of subscription content, log in to check access.

References

  1. Alon N, Matias Y, Szegedy M (1996) The space complexity of approximating the frequency moments. In: Proceedings of the 28th ACM symposium on theory of computing, STOCGoogle Scholar
  2. Anceaume E, Busnel Y (2014) A distributed information divergence estimation over data streams. IEEE Trans Parallel Distrib Syst 25(2):478–487CrossRefGoogle Scholar
  3. Anceaume E, Busnel Y, Rivetti N, Sericola B (2015) Identifying global icebergs in distributed streams. In: Proceedings of the 34th international symposium on reliable distributed systems, SRDSGoogle Scholar
  4. Bar-Yossef Z, Jayram TS, Kumar R, Sivakumar D, Trevisan L (2002) Counting distinct elements in a data stream. In: Proceedings of the 6th international workshop on randomization and approximation techniques, RANDOMGoogle Scholar
  5. Caneill M, El Rheddane A, Leroy V, De Palma N (2016) Locality-aware routing in stateful streaming applications. In: Proceedings of the 17th international middleware conference, Middleware’16Google Scholar
  6. Cardellini V, Casalicchio E, Colajanni M, Yu PS (2002) The state of the art in locally distributed web-server systems. ACM Comput Surv 34(2):263–311CrossRefGoogle Scholar
  7. Cardellini V, Grassi V, Lo Presti F, Nardelli M (2016) Optimal operator placement for distributed stream processing applications. In: Proceedings of the 10th ACM international conference on distributed and event-based systems, DEBSGoogle Scholar
  8. Carney D, Çetintemel U, Rasin A, Zdonik S, Cherniack M, Stonebraker M (2003) Operator scheduling in a data stream manager. In: Proceedings of the 29th international conference on very large data bases, VLDBGoogle Scholar
  9. Chakrabarti A, Cormode G, McGregor A (2007) A near-optimal algorithm for computing the entropy of a stream. In: Proceedings of the 18th ACM-SIAM symposium on discrete algorithms, SODAGoogle Scholar
  10. Charikar M, Chen K, Farach-Colton M (2002) Finding frequent items in data streams. In: Proceedings of the 29th international colloquium on automata, languages and programming, ICALPGoogle Scholar
  11. Cormode G (2011) Continuous distributed monitoring: a short survey. In: Proceedings of the 1st international workshop on algorithms and models for distributed event processing, AlMoDEP’11Google Scholar
  12. Cormode G, Muthukrishnan S (2005) An improved data stream summary: the count-min sketch and its applications. J Algorithms 55(1):58–75MathSciNetzbMATHCrossRefGoogle Scholar
  13. Cormode G, Muthukrishnan S, Yi K (2011) Algorithms for distributed functional monitoring. ACM Trans Algorithms 7(2):21:1–21:20MathSciNetzbMATHCrossRefGoogle Scholar
  14. Datar M, Gionis A, Indyk P, Motwani R (2002) Maintaining stream statistics over sliding windows. SIAM J Comput 31(6):1794–1813MathSciNetzbMATHCrossRefGoogle Scholar
  15. Flajolet P, Martin GN (1985) Probabilistic counting algorithms for data base applications. J Comput Syst Sci 31(2):182–209MathSciNetzbMATHCrossRefGoogle Scholar
  16. Ganguly S, Garafalakis M, Rastogi R, Sabnani K (2007) Streaming algorithms for robust, real-time detection of DDoS attacks. In: Proceedings of the 27th international conference on distributed computing systems, ICDCSGoogle Scholar
  17. Gedik B (2014) Partitioning functions for stateful data parallelism in stream processing. The VLDB J 23(4): 517–539CrossRefGoogle Scholar
  18. Gibbons PB, Tirthapura S (2001) Estimating simple functions on the union of data streams. In: Proceedings of the 13th ACM symposium on parallel algorithms and architectures, SPAAGoogle Scholar
  19. Gibbons PB, Tirthapura S (2004) Distributed streams algorithms for sliding windows. Theory Comput Syst 37(3):457–478MathSciNetzbMATHCrossRefGoogle Scholar
  20. Hirzel M, Soulé R, Schneider S, Gedik B, Grimm R (2014) A catalog of stream processing optimizations. ACM Comput Surv 46(4):41–34CrossRefGoogle Scholar
  21. Kane DM, Nelson J, Woodruff DP (2010) An optimal algorithm for the distinct elements problem. In: Proceedings of the 19th ACM symposium on principles of database systems, PODSGoogle Scholar
  22. Manjhi A, Shkapenyuk V, Dhamdhere K, Olston C (2005) Finding (recently) frequent items in distributed data streams. In: Proceedings of the 21st international conference on data engineering, ICDEGoogle Scholar
  23. Manku G, Motwani R (2002) Approximate frequency counts over data streams. In: Proceedings of the 28th international conference on very large data bases, VLDBGoogle Scholar
  24. Metwally A, Agrawal D, El Abbadi A (2005) Efficient computation of frequent and top-k elements in data streams. In: Proceedings of the 10th international conference on database theory, ICDTGoogle Scholar
  25. Misra J, Gries D (1982) Finding repeated elements. Sci Comput Program 2:143–152MathSciNetzbMATHCrossRefGoogle Scholar
  26. Muthukrishnan S (2005) Data streams: algorithms and applications. Now Publishers Inc., HanoverzbMATHGoogle Scholar
  27. Nasir MAU, Morales GDF, Soriano DG, Kourtellis N, Serafini M (2015) The power of both choices: practical load balancing for distributed stream processing engines. In: Proceedings of the 31st IEEE international conference on data engineering, ICDEGoogle Scholar
  28. Rivetti N, Busnel Y, Mostefaoui A (2015a) Efficiently summarizing distributed data streams over sliding windows. In: Proceedings of the 14th IEEE international symposium on network computing and applications, NCAGoogle Scholar
  29. Rivetti N, Querzoni L, Anceaume E, Busnel Y, Sericola B (2015b) Efficient key grouping for near-optimal load balancing in stream processing systems. In: Proceedings of the 9th ACM international conference on distributed event-Based systems, DEBSGoogle Scholar
  30. Rivetti N, Anceaume E, Busnel Y, Querzoni L, Sericola B (2016a) Proactive online scheduling for shuffle grouping in distributed stream processing systems. In: Proceedings of the 17th ACM/IFIP/USENIX international middleware conference, MiddlewareGoogle Scholar
  31. Rivetti N, Busnel Y, Querzoni L (2016b) Load-aware shedding in stream processing systems. In: Proceedings of the 10th ACM international conference on distributed event-based systems, DEBSGoogle Scholar
  32. Vengerov D, Menck AC, Zait M, Chakkappen SP (2015) Join size estimation subject to filter conditions. Proc VLDB Endow 8(12):1530–1541CrossRefGoogle Scholar
  33. Yi K, Zhang Q (2013) Optimal tracking of distributed heavy hitters and quantiles. Algorithmica 65:206–223MathSciNetzbMATHCrossRefGoogle Scholar
  34. Zhao Q, Lall A, Ogihara M, Xu J (2010) Global iceberg detection over distributed streams. In: Proceedings of the 26th IEEE international conference on data engineering, ICDEGoogle Scholar
  35. Zhao Q, Ogihara M, Wang H, Xu J (2006) Finding global icebergs over distributed data sets. In: Proceedings of the 25th ACM SIGACT- SIGMOD-SIGART symposium on principles of database systems, PODSGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Faculty of Industrial Engineering and ManagementTechnion – Israel Institute of TechnologyHaifaIsrael