Large-Scale Data Stream Processing Systems

Carbone, Paris; Gévay, Gábor E.; Hermann, Gábor; Katsifodimos, Asterios; Soto, Juan; Markl, Volker; Haridi, Seif

doi:10.1007/978-3-319-49340-4_7

Paris Carbone³,
Gábor E. Gévay⁴,
Gábor Hermann⁴,
Asterios Katsifodimos⁴,
Juan Soto⁴,
Volker Markl⁴ &
…
Seif Haridi³

7762 Accesses
6 Citations

Abstract

In our data-centric society, online services, decision making, and other aspects are increasingly becoming heavily dependent on trends and patterns extracted from data. A broad class of societal-scale data management problems requires system support for processing unbounded data with low latency and high throughput. Large-scale data stream processing systems perceive data as infinite streams and are designed to satisfy such requirements. They have further evolved substantially both in terms of expressive programming model support and also efficient and durable runtime execution on commodity clusters. Expressive programming models offer convenient ways to declare continuous data properties and applied computations, while hiding details on how these data streams are physically processed and orchestrated in a distributed environment. Execution engines provide a runtime for such models further allowing for scalable yet durable execution of any declared computation. In this chapter we introduce the major design aspects of large scale data stream processing systems, covering programming model abstraction levels and runtime concerns. We then present a detailed case study on stateful stream processing with Apache Flink, an open-source stream processor that is used for a wide variety of processing tasks. Finally, we address the main challenges of disruptive applications that large-scale data streaming enables from a systemic point of view.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 349.00; Price excludes VAT (USA)

Softcover Book: USD 449.99; Price excludes VAT (USA)

Hardcover Book: USD 449.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Generally in functional programming, higher-order functions might also produce functions as their outputs, but this does not appear in stream processing.
2.
Mind that sum in this example is a pre-defined aggregation function, however, a UDF can also be typically provided to declare an incremental computation.

References

Apache Hadoop project, https://hadoop.apache.org/
Apache Kafka project, http://kafka.apache.org/
Apache Samza project, http://samza.apache.org/
Apache Spark project, http://spark.apache.org/
Apache Storm project, http://storm.apache.org/
D.J. Abadi, D. Carney, U. Çetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, S. Zdonik, Aurora: a new model and architecture for data stream management, in VLDBJ (2003)
Google Scholar
D.J. Abadi, Y. Ahmad, M. Balazinska, U. Cetintemel, M. Cherniack, J.H. Hwang, W. Lindner, A. Maskey, A. Rasin, E. Ryvkina et al., The design of the Borealis stream processing engine, in CIDR (2005)
Google Scholar
K.J. Ahn, S. Guha, A. McGregor, Graph sketches: sparsification, spanners, and subgraphs, in Proceedings of the 31st symposium on Principles of Database Systems. ACM (2012), pp. 5–14
Google Scholar
T. Akidau, A. Balikov, K. Bekiroglu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, S. Whittle, MillWheel: Fault-tolerant stream processing at internet scale, in VLDB (2013)
Google Scholar
T. Akidau, R. Bradshaw, C. Chambers, S. Chernyak, R.J. Fernández-Moctezuma, R. Lax, S. McVeety, D. Mills, F. Perry, E. Schmidt et al, The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing, in VLDB (2015)
Google Scholar
A. Alexandrov, R. Bergmann, S. Ewen, J.C. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl et al., The Stratosphere platform for big data analytics. VLDB J. - Int. J. Very Large Data Bases 23(6), 939–964 (2014)
Article Google Scholar
A. Alexandrov, A. Kunft, A. Katsifodimos, F. Schüler, L. Thamsen, O. Kao, T. Herb, V. Markl, Implicit parallelism through deep language embedding, in ACM SIGMOD (2015), pp. 47–61
Google Scholar
A. Arasu, B. Babcock, S. Babu, J. Cieslewicz, M. Datar, K. Ito, R. Motwani, U. Srivastava, J. Widom, Stream: The stanford data stream management system, Book chapter (2004)
Google Scholar
A. Arasu, M. Cherniack, E. Galvez, D. Maier, A.S. Maskey, E. Ryvkina, M. Stonebraker, R. Tibbetts, Linear road: a stream data management benchmark. in Proceedings of the Thirtieth International Conference on Very Large Data Bases, VLDB Endowment, vol. 30 (2004), pp. 480–491
Google Scholar
A. Arasu, S. Babu, J. Widom, The CQL continuous query language: semantic foundations and query execution, in VLDBJ (2006)
Google Scholar
M. Balazinska, H. Balakrishnan, S.R. Madden, M. Stonebraker, Fault-tolerance in the Borealis distributed stream processing system. ACM Trans. Database Syst. (TODS) 33(1), 3 (2008)
Article Google Scholar
M. Balazinska, J.H. Hwang, M.A. Shah, Fault-tolerance and high availability in data stream management systems., in Encyclopedia of Database Systems (Springer, 2009), pp. 1109–1115
Google Scholar
L. Becchetti, P. Boldi, C. Castillo, A. Gionis, Efficient semi-streaming algorithms for local triangle counting in massive graphs, in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, 2008), pp. 16–24
Google Scholar
Benchmarking streaming computation engines at Yahoo! https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
T. Bernhardt, A. Vasseur, Esper: Event Stream Processing and Correlation. ON-Java (O’Reilly, Springfield, 2007)
Google Scholar
A. Bifet, R. Gavaldà, Adaptive learning from evolving data streams, in Advances in Intelligent Data Analysis VIII (Springer, Berlin, 2009), pp. 249–260
Google Scholar
A. Bifet, G. Holmes, R. Kirkby, B. Pfahringer, Moa: Massive online analysis. J. Mach. Learn. Res. 11, 1601–1604 (2010)
Google Scholar
I. Botan, R. Derakhshan, N. Dindar, L. Haas, R.J. Miller, N. Tatbul, Secret: A model for analysis of the execution semantics of stream processing systems, in VLDB (2010)
Google Scholar
L. Brenna, A. Demers, J. Gehrke, M. Hong, J. Ossher, B. Panda, M. Riedewald, M. Thatte, W. White, Cayuga: a high-performance event processing engine, in Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (ACM, 2007), pp. 1100–1102
Google Scholar
P. Carbone, K. Vandikas, F. Zaloshnja, Towards highly available complex event processing deployments in the cloud, in Seventh International Conference on Next Generation Mobile Apps, Services and Technologies (NGMAST) (IEEE, 2013), pp. 153–158
Google Scholar
P. Carbone, S. Ewen, S. Haridi, A. Katsifodimos, V. Markl, K. Tzoumas, Apache Flink: Stream and batch processing in a single engine. IEEE Data Engineering Bulletin (2015)
Google Scholar
P. Carbone, G. Fóra, S. Ewen, S. Haridi, K. Tzoumas, Lightweight asynchronous snapshots for distributed dataflows (2015). arXiv preprint arXiv:1506.08603
P. Carbone, J. Traub, A. Katsifodimos, S. Haridi, V. Markl, Cutty: Aggregate sharing for user-defined windows, in Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (ACM, 2016)
Google Scholar
A. Carzaniga, D.S. Rosenblum, A.L. Wolf, Design and evaluation of a wide-area event notification service. ACM Trans. Comput. Syst. (TOCS) 19(3), 332–383 (2001)
Article Google Scholar
R. Castro Fernandez, M. Migliavacca, E. Kalyvianaki, P. Pietzuch, Integrating scale out and fault tolerance in stream processing using operator state management, in Proceedings of the 2013 ACM SIGMOD international conference on Management of data (ACM, 2013), pp. 725–736
Google Scholar
U. Cetintemel, J. Du, T. Kraska, S. Madden, D. Maier, J. Meehan, A. Pavlo, M. Stonebraker, E. Sutherland, N. Tatbul et al., S-store: A streaming newSQL system for big velocity applications. Proc. VLDB Endow. 7(13), 1633–1636 (2014)
Article Google Scholar
C. Chambers, A. Raniwala, F. Perry, S. Adams, R.R. Henry, R. Bradshaw, N. Weizenbaum, FlumeJava: easy, efficient data-parallel pipelines, in ACM Sigplan Notices, vol. 45 (ACM, 2010), pp. 363–375
Google Scholar
B. Chandramouli, J. Goldstein, M. Barnett, R. DeLine, D. Fisher, J.C. Platt, J.F. Terwilliger, J. Wernsing, Trill: A high-performance incremental query processor for diverse analytics. Proc. VLDB Endow. 8(4), 401–412 (2014)
Article Google Scholar
S. Chandrasekaran, O. Cooper, A. Deshpande, M.J. Franklin, J.M. Hellerstein, W. Hong, S. Krishnamurthy, S.R. Madden, F. Reiss, M.A. Shah, TelegraphCQ: continuous dataflow processing, in Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (ACM, 2003), pp. 668–668
Google Scholar
K.M. Chandy, L. Lamport, Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. (TOCS) 3(1), 63–75 (1985)
Article Google Scholar
F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach, M. Burrows, T. Chandra, A. Fikes, R.E. Gruber, Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. (TOCS) 26(2), 4 (2008)
Article Google Scholar
J. Chen, D.J. DeWitt, F. Tian, Y. Wang, Niagaracq: A scalable continuous query system for internet databases, in SIGMOD Record (ACM, 2000)
Google Scholar
M. Cherniack, H. Balakrishnan, M. Balazinska, D. Carney, U. Cetintemel, Y. Xing, S.B. Zdonik, Scalable distributed stream processing. CIDR. 3, 257–268 (2003)
Google Scholar
T. Condie, N. Conway, P. Alvaro, J.M. Hellerstein, K. Elmeleegy, R. Sears, Mapreduce online. NSDI. 10, 20 (2010)
Google Scholar
G. Cugola, A. Margara, Processing flows of information: From data stream to complex event processing. ACM Comput. Surv. (CSUR) 44(3), 15 (2012)
Article Google Scholar
U. Dayal, B. Blaustein, A. Buchmann, U. Chakravarthy, M. Hsu, R. Ledin, D. McCarthy, A. Rosenthal, S. Sarin, M.J. Carey et al., The HiPAC project: Combining active databases and timing constraints. ACM Sigmod Rec. 17(1), 51–70 (1988)
Article Google Scholar
G. De Francisci Morales, A. Bifet, Samoa: Scalable advanced massive online analysis. J. Mach. Learn. Res. 16(1), 149–153 (2015)
Google Scholar
J. Dean, S. Ghemawat, Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
N. Dindar, N. Tatbul, R.J. Miller, L.M. Haas, I. Botan, Modeling the execution semantics of stream processing engines with secret. VLDB J. 22(4), 421–446 (2013)
Article Google Scholar
D. Elin, T. Risch, Amos II java interfaces. Uppsala University report (2000)
Google Scholar
J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, J. Zhang, On graph problems in a semi-streaming model. Theor. Comput. Sci. 348(2), 207–216 (2005)
Article MathSciNet MATH Google Scholar
R.C. Fernandez, M. Migliavacca, E. Kalyvianaki, P. Pietzuch, Making state explicit for imperative big data processing, in Proceedings of the 2014 USENIX Annual Technical Conference (USENIX ATC 14) (2014), pp. 49–60
Google Scholar
S. Gatziu, K.R. Dittrich, Samos: An active object-oriented database system. IEEE Data Eng. Bull. 15(1–4), 23–26 (1992)
Google Scholar
B. Gedik, Partitioning functions for stateful data parallelism in stream processing. VLDB J. 23(4), 517–539 (2014)
Article Google Scholar
Google Cloud Dataflow, https://cloud.google.com/dataflow/
W. Han, Y. Miao, K. Li, M. Wu, F. Yang, L. Zhou, V. Prabhakaran, W. Chen, E. Chen, Chronos: a graph engine for temporal graph analysis, in Proceedings of the Ninth European Conference on Computer Systems (ACM, 2014), p. 1
Google Scholar
B. He, M. Yang, Z. Guo, R. Chen, B. Su, W. Lin, L. Zhou, Comet: batched stream processing for data intensive distributed computing, in Proceedings of the 1st ACM Symposium on Cloud Computing (ACM, 2010), pp. 63–74
Google Scholar
M. Hirzel, H. Andrade, B. Gedik, V. Kumar, G. Losa, M. Nasgaard, R. Soule, K. Wu, SPL stream processing language specification. NewYork: IBMResearchDivisionTJ. WatsonResearchCenter, IBM ResearchReport: RC24897 (W0911–044) (2009)
Google Scholar
M. Hirzel, H. Andrade, B. Gedik, G. Jacques-Silva, R. Khandekar, V. Kumar, M. Mendell, H. Nasgaard, S. Schneider, R. Soulé et al., IBM streams processing language: analyzing big data in motion. IBM J. Res. Develop. 57(3/4), 7–1 (2013)
Google Scholar
M. Hirzel, R. Soulé, S. Schneider, B. Gedik, R. Grimm, A catalog of stream processing optimizations. ACM Comput. Surv. (CSUR) 46(4), 46 (2014)
Article Google Scholar
Introduction to Kafka Streams, http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple
A. Iyer, L.E. Li, I. Stoica, CellIQ: real-time cellular network analytics at scale, in 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15) (2015), pp. 309–322
Google Scholar
R. Kallman, H. Kimura, J. Natkins, A. Pavlo, A. Rasin, S. Zdonik, E.P. Jones, S. Madden, M. Stonebraker, Y. Zhang et al., H-store: a high-performance, distributed main memory transaction processing system. Proc. VLDB Endow. 1(2), 1496–1499 (2008)
Article Google Scholar
K. Karanasos, A. Katsifodimos, I. Manolescu, Delta: Scalable data dissemination under capacity constraints. Proc. VLDB Endow. 7(4), 217–228 (2013)
Article Google Scholar
J. Kreps, N. Narkhede, J. Rao et al, Kafka: A distributed messaging system for log processing. NetDB (2011)
Google Scholar
S. Kulkarni, N. Bhagat, M. Fu, V. Kedigehalli, C. Kellogg, S. Mittal, J.M. Patel, K. Ramasamy, S. Taneja, Twitter Heron: Stream processing at scale, in ACM SIGMOD (2015)
Google Scholar
A. Kyrola, G. Blelloch, C. Guestrin, Graphchi: Large-scale graph computation on just a pc, in Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12) (2012), pp. 31–46
Google Scholar
A. Lakshman, P. Malik, Cassandra: a decentralized structured storage system. ACM SIGOPS Oper. Syst. Rev. 44(2), 35–40 (2010)
Article Google Scholar
J. Li, D. Maier, K. Tufte, V. Papadimos, P.A. Tucker, Semantics and evaluation techniques for window aggregates in data streams, in ACM SIGMOD (2005)
Google Scholar
L. Liu, C. Pu, W. Tang, Continual queries for internet scale event-driven information delivery. IEEE Trans. Knowl. Data Eng. 11(4), 610–628 (1999)
Article Google Scholar
Y. Liu, B. Plale et al., Survey of publish subscribe event systems. Computer Science Dept, Indian University 16 (2003)
Google Scholar
D. Logothetis, C. Olston, B. Reed, K.C. Webb, K. Yocum, Stateful bulk processing for incremental analytics, in Proceedings of the 1st ACM Symposium on Cloud Computing (ACM, 2010), pp. 51–62
Google Scholar
D. Luckham, The power of events, vol. 204 (Addison-Wesley Reading, Boston, 2002)
Google Scholar
G. Malewicz, M.H. Austern, A.J. Bik, J.C. Dehnert, I. Horn, N. Leiser, G. Czajkowski, Pregel: a system for large-scale graph processing, in Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (ACM, 2010), pp. 135–146
Google Scholar
N. Marz, J. Warren, Big Data: Principles and Best Practices of Scalable Realtime Data Systems (Manning Publications Co., Greenwich, 2015)
Google Scholar
D. Mishra, SNOOP: an event specification language for active database systems. Ph.D. thesis, University of Florida (1991)
Google Scholar
S.S. Muchnick, Advanced Compiler Design Implementation (Morgan Kaufmann, Burlington, 1997)
Google Scholar
D.G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, M. Abadi, Naiad: a timely dataflow system, in ACM SOSP (2013)
Google Scholar
L. Neumeyer, B. Robbins, A. Nair, A. Kesari, S4: Distributed stream computing platform, in Proceedings of the 2010 IEEE International Conference on Data Mining Workshops (IEEE, 2010), pp. 170–177
Google Scholar
K. Patroumpas, T. Sellis, Window specification over data streams, in Current Trends in Database Technology–EDBT 2006 (Springer, Berlin, 2006), pp. 445–464
Google Scholar
D. Peleg, A.A. Schäffer, Graph spanners. J. Graph Theory 13(1), 99–116 (1989)
Google Scholar
M.A. Shah, J.M. Hellerstein, S. Chandrasekaran, M.J. Franklin, Flux: An adaptive partitioning operator for continuous query systems, in Proceedings of the 19th International Conference on Data Engineering (IEEE, 2003), pp. 25–36
Google Scholar
M.A. Shah, J.M. Hellerstein, E. Brewer, Highly available, fault-tolerant, parallel dataflows, in Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (ACM, 2004), pp. 827–838
Google Scholar
U. Srivastava, J. Widom, Flexible time management in data stream systems. in Proceedings of the Twenty-Third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (ACM, 2004), pp. 263–274
Google Scholar
StreamBase I: Streambase: Real-time, low latency data processing with a stream processing engine (2006)
Google Scholar
J. Thaler, Semi-streaming algorithms for annotated graph streams (2014). arXiv preprint arXiv:1407.3462
The Apache APEX project, https://www.datatorrent.com/apex/
The Apache Beam System, https://wiki.apache.org/incubator/BeamProposal
The Kappa Architecture by Jay Kreps, http://milinda.pathirage.org/kappa-architecture.com/
The Trident Stream Processing Programming Model, http://storm.apache.org/releases/0.10.0/Trident-tutorial.html
A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J.M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu, J. Donham et al, Storm @ Twitter, in Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (ACM, 2014), pp. 147–156
Google Scholar
J. Webber, A programmatic introduction to Neo4j, in Proceedings of the 3rd Annual Conference on Systems, Programming, and Applications: Software For Humanity (ACM, 2012), pp. 217–218
Google Scholar
R.S. Xin, J.E. Gonzalez, M.J. Franklin, I. Stoica, GraphX: A resilient distributed graph system on Spark, in First International Workshop on Graph Data Management Experiences and Systems (ACM, 2013), p. 2
Google Scholar
M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker, I. Stoica, Spark: Cluster computing with working sets. HotCloud 10, 10–10 (2010)
Google Scholar
M. Zaharia, T. Das, H. Li, S. Shenker, I. Stoica, Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters, in Proceedings of the 4th USENIX Conference on Hot Topics in Cloud Ccomputing (USENIX Association, 2012), pp. 10–10
Google Scholar

Download references

Author information

Authors and Affiliations

KTH Royal Institute of Technology, Stockholm, Sweden
Paris Carbone & Seif Haridi
TU Berlin, Berlin, Germany
Gábor E. Gévay, Gábor Hermann, Asterios Katsifodimos, Juan Soto & Volker Markl

Authors

Paris Carbone
View author publications
You can also search for this author in PubMed Google Scholar
Gábor E. Gévay
View author publications
You can also search for this author in PubMed Google Scholar
Gábor Hermann
View author publications
You can also search for this author in PubMed Google Scholar
Asterios Katsifodimos
View author publications
You can also search for this author in PubMed Google Scholar
Juan Soto
View author publications
You can also search for this author in PubMed Google Scholar
Volker Markl
View author publications
You can also search for this author in PubMed Google Scholar
Seif Haridi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paris Carbone .

Editor information

Editors and Affiliations

School of Information Technologies, The University of Sydney, Sydney, New South Wales, Australia
Albert Y. Zomaya
The School of Computer Science, The University of New South Wales, Eveleigh, New South Wales, Australia
Sherif Sakr

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Carbone, P. et al. (2017). Large-Scale Data Stream Processing Systems. In: Zomaya, A., Sakr, S. (eds) Handbook of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-49340-4_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-49340-4_7
Published: 26 February 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49339-8
Online ISBN: 978-3-319-49340-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics