Handbook of Big Data Technologies pp 219-260 | Cite as
Large-Scale Data Stream Processing Systems
Abstract
In our data-centric society, online services, decision making, and other aspects are increasingly becoming heavily dependent on trends and patterns extracted from data. A broad class of societal-scale data management problems requires system support for processing unbounded data with low latency and high throughput. Large-scale data stream processing systems perceive data as infinite streams and are designed to satisfy such requirements. They have further evolved substantially both in terms of expressive programming model support and also efficient and durable runtime execution on commodity clusters. Expressive programming models offer convenient ways to declare continuous data properties and applied computations, while hiding details on how these data streams are physically processed and orchestrated in a distributed environment. Execution engines provide a runtime for such models further allowing for scalable yet durable execution of any declared computation. In this chapter we introduce the major design aspects of large scale data stream processing systems, covering programming model abstraction levels and runtime concerns. We then present a detailed case study on stateful stream processing with Apache Flink, an open-source stream processor that is used for a wide variety of processing tasks. Finally, we address the main challenges of disruptive applications that large-scale data streaming enables from a systemic point of view.
Keywords
Stream Processing Hadoop Distribute File System Dataflow Graph Complex Event Processing Stream ProcessorReferences
- 1.Apache Hadoop project, https://hadoop.apache.org/
- 2.Apache Kafka project, http://kafka.apache.org/
- 3.Apache Samza project, http://samza.apache.org/
- 4.Apache Spark project, http://spark.apache.org/
- 5.Apache Storm project, http://storm.apache.org/
- 6.D.J. Abadi, D. Carney, U. Çetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, S. Zdonik, Aurora: a new model and architecture for data stream management, in VLDBJ (2003)Google Scholar
- 7.D.J. Abadi, Y. Ahmad, M. Balazinska, U. Cetintemel, M. Cherniack, J.H. Hwang, W. Lindner, A. Maskey, A. Rasin, E. Ryvkina et al., The design of the Borealis stream processing engine, in CIDR (2005)Google Scholar
- 8.K.J. Ahn, S. Guha, A. McGregor, Graph sketches: sparsification, spanners, and subgraphs, in Proceedings of the 31st symposium on Principles of Database Systems. ACM (2012), pp. 5–14Google Scholar
- 9.T. Akidau, A. Balikov, K. Bekiroglu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, S. Whittle, MillWheel: Fault-tolerant stream processing at internet scale, in VLDB (2013)Google Scholar
- 10.T. Akidau, R. Bradshaw, C. Chambers, S. Chernyak, R.J. Fernández-Moctezuma, R. Lax, S. McVeety, D. Mills, F. Perry, E. Schmidt et al, The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing, in VLDB (2015)Google Scholar
- 11.A. Alexandrov, R. Bergmann, S. Ewen, J.C. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl et al., The Stratosphere platform for big data analytics. VLDB J. - Int. J. Very Large Data Bases 23(6), 939–964 (2014)CrossRefGoogle Scholar
- 12.A. Alexandrov, A. Kunft, A. Katsifodimos, F. Schüler, L. Thamsen, O. Kao, T. Herb, V. Markl, Implicit parallelism through deep language embedding, in ACM SIGMOD (2015), pp. 47–61Google Scholar
- 13.A. Arasu, B. Babcock, S. Babu, J. Cieslewicz, M. Datar, K. Ito, R. Motwani, U. Srivastava, J. Widom, Stream: The stanford data stream management system, Book chapter (2004)Google Scholar
- 14.A. Arasu, M. Cherniack, E. Galvez, D. Maier, A.S. Maskey, E. Ryvkina, M. Stonebraker, R. Tibbetts, Linear road: a stream data management benchmark. in Proceedings of the Thirtieth International Conference on Very Large Data Bases, VLDB Endowment, vol. 30 (2004), pp. 480–491Google Scholar
- 15.A. Arasu, S. Babu, J. Widom, The CQL continuous query language: semantic foundations and query execution, in VLDBJ (2006)Google Scholar
- 16.M. Balazinska, H. Balakrishnan, S.R. Madden, M. Stonebraker, Fault-tolerance in the Borealis distributed stream processing system. ACM Trans. Database Syst. (TODS) 33(1), 3 (2008)CrossRefGoogle Scholar
- 17.M. Balazinska, J.H. Hwang, M.A. Shah, Fault-tolerance and high availability in data stream management systems., in Encyclopedia of Database Systems (Springer, 2009), pp. 1109–1115Google Scholar
- 18.L. Becchetti, P. Boldi, C. Castillo, A. Gionis, Efficient semi-streaming algorithms for local triangle counting in massive graphs, in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, 2008), pp. 16–24Google Scholar
- 19.Benchmarking streaming computation engines at Yahoo! https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
- 20.T. Bernhardt, A. Vasseur, Esper: Event Stream Processing and Correlation. ON-Java (O’Reilly, Springfield, 2007)Google Scholar
- 21.A. Bifet, R. Gavaldà, Adaptive learning from evolving data streams, in Advances in Intelligent Data Analysis VIII (Springer, Berlin, 2009), pp. 249–260Google Scholar
- 22.A. Bifet, G. Holmes, R. Kirkby, B. Pfahringer, Moa: Massive online analysis. J. Mach. Learn. Res. 11, 1601–1604 (2010)Google Scholar
- 23.I. Botan, R. Derakhshan, N. Dindar, L. Haas, R.J. Miller, N. Tatbul, Secret: A model for analysis of the execution semantics of stream processing systems, in VLDB (2010)Google Scholar
- 24.L. Brenna, A. Demers, J. Gehrke, M. Hong, J. Ossher, B. Panda, M. Riedewald, M. Thatte, W. White, Cayuga: a high-performance event processing engine, in Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (ACM, 2007), pp. 1100–1102Google Scholar
- 25.P. Carbone, K. Vandikas, F. Zaloshnja, Towards highly available complex event processing deployments in the cloud, in Seventh International Conference on Next Generation Mobile Apps, Services and Technologies (NGMAST) (IEEE, 2013), pp. 153–158Google Scholar
- 26.P. Carbone, S. Ewen, S. Haridi, A. Katsifodimos, V. Markl, K. Tzoumas, Apache Flink: Stream and batch processing in a single engine. IEEE Data Engineering Bulletin (2015)Google Scholar
- 27.P. Carbone, G. Fóra, S. Ewen, S. Haridi, K. Tzoumas, Lightweight asynchronous snapshots for distributed dataflows (2015). arXiv preprint arXiv:1506.08603
- 28.P. Carbone, J. Traub, A. Katsifodimos, S. Haridi, V. Markl, Cutty: Aggregate sharing for user-defined windows, in Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (ACM, 2016)Google Scholar
- 29.A. Carzaniga, D.S. Rosenblum, A.L. Wolf, Design and evaluation of a wide-area event notification service. ACM Trans. Comput. Syst. (TOCS) 19(3), 332–383 (2001)CrossRefGoogle Scholar
- 30.R. Castro Fernandez, M. Migliavacca, E. Kalyvianaki, P. Pietzuch, Integrating scale out and fault tolerance in stream processing using operator state management, in Proceedings of the 2013 ACM SIGMOD international conference on Management of data (ACM, 2013), pp. 725–736Google Scholar
- 31.U. Cetintemel, J. Du, T. Kraska, S. Madden, D. Maier, J. Meehan, A. Pavlo, M. Stonebraker, E. Sutherland, N. Tatbul et al., S-store: A streaming newSQL system for big velocity applications. Proc. VLDB Endow. 7(13), 1633–1636 (2014)CrossRefGoogle Scholar
- 32.C. Chambers, A. Raniwala, F. Perry, S. Adams, R.R. Henry, R. Bradshaw, N. Weizenbaum, FlumeJava: easy, efficient data-parallel pipelines, in ACM Sigplan Notices, vol. 45 (ACM, 2010), pp. 363–375Google Scholar
- 33.B. Chandramouli, J. Goldstein, M. Barnett, R. DeLine, D. Fisher, J.C. Platt, J.F. Terwilliger, J. Wernsing, Trill: A high-performance incremental query processor for diverse analytics. Proc. VLDB Endow. 8(4), 401–412 (2014)CrossRefGoogle Scholar
- 34.S. Chandrasekaran, O. Cooper, A. Deshpande, M.J. Franklin, J.M. Hellerstein, W. Hong, S. Krishnamurthy, S.R. Madden, F. Reiss, M.A. Shah, TelegraphCQ: continuous dataflow processing, in Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (ACM, 2003), pp. 668–668Google Scholar
- 35.K.M. Chandy, L. Lamport, Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. (TOCS) 3(1), 63–75 (1985)CrossRefGoogle Scholar
- 36.F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach, M. Burrows, T. Chandra, A. Fikes, R.E. Gruber, Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. (TOCS) 26(2), 4 (2008)CrossRefGoogle Scholar
- 37.J. Chen, D.J. DeWitt, F. Tian, Y. Wang, Niagaracq: A scalable continuous query system for internet databases, in SIGMOD Record (ACM, 2000)Google Scholar
- 38.M. Cherniack, H. Balakrishnan, M. Balazinska, D. Carney, U. Cetintemel, Y. Xing, S.B. Zdonik, Scalable distributed stream processing. CIDR. 3, 257–268 (2003)Google Scholar
- 39.T. Condie, N. Conway, P. Alvaro, J.M. Hellerstein, K. Elmeleegy, R. Sears, Mapreduce online. NSDI. 10, 20 (2010)Google Scholar
- 40.G. Cugola, A. Margara, Processing flows of information: From data stream to complex event processing. ACM Comput. Surv. (CSUR) 44(3), 15 (2012)CrossRefGoogle Scholar
- 41.U. Dayal, B. Blaustein, A. Buchmann, U. Chakravarthy, M. Hsu, R. Ledin, D. McCarthy, A. Rosenthal, S. Sarin, M.J. Carey et al., The HiPAC project: Combining active databases and timing constraints. ACM Sigmod Rec. 17(1), 51–70 (1988)CrossRefGoogle Scholar
- 42.G. De Francisci Morales, A. Bifet, Samoa: Scalable advanced massive online analysis. J. Mach. Learn. Res. 16(1), 149–153 (2015)Google Scholar
- 43.J. Dean, S. Ghemawat, Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
- 44.N. Dindar, N. Tatbul, R.J. Miller, L.M. Haas, I. Botan, Modeling the execution semantics of stream processing engines with secret. VLDB J. 22(4), 421–446 (2013)CrossRefGoogle Scholar
- 45.D. Elin, T. Risch, Amos II java interfaces. Uppsala University report (2000)Google Scholar
- 46.J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, J. Zhang, On graph problems in a semi-streaming model. Theor. Comput. Sci. 348(2), 207–216 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
- 47.R.C. Fernandez, M. Migliavacca, E. Kalyvianaki, P. Pietzuch, Making state explicit for imperative big data processing, in Proceedings of the 2014 USENIX Annual Technical Conference (USENIX ATC 14) (2014), pp. 49–60Google Scholar
- 48.S. Gatziu, K.R. Dittrich, Samos: An active object-oriented database system. IEEE Data Eng. Bull. 15(1–4), 23–26 (1992)Google Scholar
- 49.B. Gedik, Partitioning functions for stateful data parallelism in stream processing. VLDB J. 23(4), 517–539 (2014)CrossRefGoogle Scholar
- 50.Google Cloud Dataflow, https://cloud.google.com/dataflow/
- 51.W. Han, Y. Miao, K. Li, M. Wu, F. Yang, L. Zhou, V. Prabhakaran, W. Chen, E. Chen, Chronos: a graph engine for temporal graph analysis, in Proceedings of the Ninth European Conference on Computer Systems (ACM, 2014), p. 1Google Scholar
- 52.B. He, M. Yang, Z. Guo, R. Chen, B. Su, W. Lin, L. Zhou, Comet: batched stream processing for data intensive distributed computing, in Proceedings of the 1st ACM Symposium on Cloud Computing (ACM, 2010), pp. 63–74Google Scholar
- 53.M. Hirzel, H. Andrade, B. Gedik, V. Kumar, G. Losa, M. Nasgaard, R. Soule, K. Wu, SPL stream processing language specification. NewYork: IBMResearchDivisionTJ. WatsonResearchCenter, IBM ResearchReport: RC24897 (W0911–044) (2009)Google Scholar
- 54.M. Hirzel, H. Andrade, B. Gedik, G. Jacques-Silva, R. Khandekar, V. Kumar, M. Mendell, H. Nasgaard, S. Schneider, R. Soulé et al., IBM streams processing language: analyzing big data in motion. IBM J. Res. Develop. 57(3/4), 7–1 (2013)Google Scholar
- 55.M. Hirzel, R. Soulé, S. Schneider, B. Gedik, R. Grimm, A catalog of stream processing optimizations. ACM Comput. Surv. (CSUR) 46(4), 46 (2014)CrossRefGoogle Scholar
- 56.Introduction to Kafka Streams, http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple
- 57.A. Iyer, L.E. Li, I. Stoica, CellIQ: real-time cellular network analytics at scale, in 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15) (2015), pp. 309–322Google Scholar
- 58.R. Kallman, H. Kimura, J. Natkins, A. Pavlo, A. Rasin, S. Zdonik, E.P. Jones, S. Madden, M. Stonebraker, Y. Zhang et al., H-store: a high-performance, distributed main memory transaction processing system. Proc. VLDB Endow. 1(2), 1496–1499 (2008)CrossRefGoogle Scholar
- 59.K. Karanasos, A. Katsifodimos, I. Manolescu, Delta: Scalable data dissemination under capacity constraints. Proc. VLDB Endow. 7(4), 217–228 (2013)CrossRefGoogle Scholar
- 60.J. Kreps, N. Narkhede, J. Rao et al, Kafka: A distributed messaging system for log processing. NetDB (2011)Google Scholar
- 61.S. Kulkarni, N. Bhagat, M. Fu, V. Kedigehalli, C. Kellogg, S. Mittal, J.M. Patel, K. Ramasamy, S. Taneja, Twitter Heron: Stream processing at scale, in ACM SIGMOD (2015)Google Scholar
- 62.A. Kyrola, G. Blelloch, C. Guestrin, Graphchi: Large-scale graph computation on just a pc, in Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12) (2012), pp. 31–46Google Scholar
- 63.A. Lakshman, P. Malik, Cassandra: a decentralized structured storage system. ACM SIGOPS Oper. Syst. Rev. 44(2), 35–40 (2010)CrossRefGoogle Scholar
- 64.J. Li, D. Maier, K. Tufte, V. Papadimos, P.A. Tucker, Semantics and evaluation techniques for window aggregates in data streams, in ACM SIGMOD (2005)Google Scholar
- 65.L. Liu, C. Pu, W. Tang, Continual queries for internet scale event-driven information delivery. IEEE Trans. Knowl. Data Eng. 11(4), 610–628 (1999)CrossRefGoogle Scholar
- 66.Y. Liu, B. Plale et al., Survey of publish subscribe event systems. Computer Science Dept, Indian University 16 (2003)Google Scholar
- 67.D. Logothetis, C. Olston, B. Reed, K.C. Webb, K. Yocum, Stateful bulk processing for incremental analytics, in Proceedings of the 1st ACM Symposium on Cloud Computing (ACM, 2010), pp. 51–62Google Scholar
- 68.D. Luckham, The power of events, vol. 204 (Addison-Wesley Reading, Boston, 2002)Google Scholar
- 69.G. Malewicz, M.H. Austern, A.J. Bik, J.C. Dehnert, I. Horn, N. Leiser, G. Czajkowski, Pregel: a system for large-scale graph processing, in Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (ACM, 2010), pp. 135–146Google Scholar
- 70.N. Marz, J. Warren, Big Data: Principles and Best Practices of Scalable Realtime Data Systems (Manning Publications Co., Greenwich, 2015)Google Scholar
- 71.D. Mishra, SNOOP: an event specification language for active database systems. Ph.D. thesis, University of Florida (1991)Google Scholar
- 72.S.S. Muchnick, Advanced Compiler Design Implementation (Morgan Kaufmann, Burlington, 1997)Google Scholar
- 73.D.G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, M. Abadi, Naiad: a timely dataflow system, in ACM SOSP (2013)Google Scholar
- 74.L. Neumeyer, B. Robbins, A. Nair, A. Kesari, S4: Distributed stream computing platform, in Proceedings of the 2010 IEEE International Conference on Data Mining Workshops (IEEE, 2010), pp. 170–177Google Scholar
- 75.K. Patroumpas, T. Sellis, Window specification over data streams, in Current Trends in Database Technology–EDBT 2006 (Springer, Berlin, 2006), pp. 445–464Google Scholar
- 76.D. Peleg, A.A. Schäffer, Graph spanners. J. Graph Theory 13(1), 99–116 (1989)Google Scholar
- 77.M.A. Shah, J.M. Hellerstein, S. Chandrasekaran, M.J. Franklin, Flux: An adaptive partitioning operator for continuous query systems, in Proceedings of the 19th International Conference on Data Engineering (IEEE, 2003), pp. 25–36Google Scholar
- 78.M.A. Shah, J.M. Hellerstein, E. Brewer, Highly available, fault-tolerant, parallel dataflows, in Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (ACM, 2004), pp. 827–838Google Scholar
- 79.U. Srivastava, J. Widom, Flexible time management in data stream systems. in Proceedings of the Twenty-Third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (ACM, 2004), pp. 263–274Google Scholar
- 80.StreamBase I: Streambase: Real-time, low latency data processing with a stream processing engine (2006)Google Scholar
- 81.J. Thaler, Semi-streaming algorithms for annotated graph streams (2014). arXiv preprint arXiv:1407.3462
- 82.The Apache APEX project, https://www.datatorrent.com/apex/
- 83.The Apache Beam System, https://wiki.apache.org/incubator/BeamProposal
- 84.The Kappa Architecture by Jay Kreps, http://milinda.pathirage.org/kappa-architecture.com/
- 85.The Trident Stream Processing Programming Model, http://storm.apache.org/releases/0.10.0/Trident-tutorial.html
- 86.A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J.M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu, J. Donham et al, Storm @ Twitter, in Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (ACM, 2014), pp. 147–156Google Scholar
- 87.J. Webber, A programmatic introduction to Neo4j, in Proceedings of the 3rd Annual Conference on Systems, Programming, and Applications: Software For Humanity (ACM, 2012), pp. 217–218Google Scholar
- 88.R.S. Xin, J.E. Gonzalez, M.J. Franklin, I. Stoica, GraphX: A resilient distributed graph system on Spark, in First International Workshop on Graph Data Management Experiences and Systems (ACM, 2013), p. 2Google Scholar
- 89.M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker, I. Stoica, Spark: Cluster computing with working sets. HotCloud 10, 10–10 (2010)Google Scholar
- 90.M. Zaharia, T. Das, H. Li, S. Shenker, I. Stoica, Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters, in Proceedings of the 4th USENIX Conference on Hot Topics in Cloud Ccomputing (USENIX Association, 2012), pp. 10–10Google Scholar