Abstract
Unlike data stream management systems that are mostly intended for analyzing structured information through declarative query languages, systems for stream processing expose generic and imperative (i.e. non-declarative) programming interfaces to work with structured, semi-structured, and entirely unstructured data. Rather than yet another approach for querying data, stream processing can thus be seen as the latency-oriented counterpart to batch processing. In this chapter, we provide an overview over some of the most popular distributed stream processing systems currently available and highlight similarities, differences, and trade-offs taken in their respective designs.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
On top of RDDs, Spark provides DataFrames and Datasets as even more abstract APIs that impose a schema on the otherwise unstructured RDD tuples [Dat18].
References
Tyler Akidau, Alex Balikov, Kaya Bekiroglu, et al. “MillWheel: Fault-Tolerant Stream Processing at Internet Scale”. In: Very Large Data Bases. 2013, pp. 734–746.
Alexander Alexandrov, Rico Bergmann, Stephan Ewen, et al. “The Strato-sphere Platform for Big Data Analytics”. In: The VLDB Journal (2014). issn: 1066-8888. url: https://doi.org/10.1007/s00778-014-0357-y. http://dx.doi.org/10.1007/s00778-014-0357-y.
Tyler Akidau et al. “The Dataflow Model: A Practical Approach to Balancing Correctness, Latency and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing”. In: Proceedings of the VLDB Endowment 8 (2015), pp. 1792–1803.
Rajagopal Ananthanarayanan et al. “Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams”. In: SIGMOD ’13 2013. URL: http://dl.acm.org/citation.cfm?doid$=$2463676.2465272
Alain Biem, Eric Bouillet, Hanhua Feng, et al. “IBM Infosphere Streams for Scalable, Real-time, Intelligent Transportation Services”. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. Indianapolis, Indiana, USA, 2010. ISBN: 978-1-4503-0032-2. url: https://doi.org/10.1145/1807167.1807291. http://doi.acm.org/10.1145/1807167.1807291.
Oscar Boykin et al. “Summingbird: A Framework for Integrating Batch and Online MapReduce Computations”. In: Proc. VLDB Endow 7.13 (Aug. 2014), pp. 1441–1451. issn: 2150-8097. url: https://doi.org/10.14778/2733004.2733016. http://dxdoiorg/10.14778/2733004.2733016.
Cole Brown. “Introducing Concord”. In: Concord Blog (2015). Accessed: 2016-09-21. URL: http://concord.io/posts/introducing_concord.
Craig Chambers et al. “FlumeJava: Easy, Efficient Data-Parallel Pipelines”. In: ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). 2 Penn Plaza, Suite 701 New York, NY 10121-0701, 2010, pp. 363–375. URL: http://dl.acm.org/citation.cfm?id$=$1806638.
Badrish Chandramouli et al. “Trill: A High-performance Incremental Query Processor for Diverse Analytics”. In: Proc. VLDB Endow. 8.4 (Dec. 2014),pp. 401–412. issn: 2150-8097. URL: https://doi.org/10.14778/2735496.2735503. http://dxdoiorg/10.14778/2735496.2735503.
Badrish Chandramouli et al. “Quill: Efficient, Transferable, and Rich Analytics at Scale”. In: International Conference on Very Large Databases (PVLDB Vol. 9, Issue. 14). 2016. URL: https://www.microsoftcom/enus/research/publication/quill-efficient-transferable-rich-analytics-scale/.
Sanket Chintapalli et al. “Benchmarking Streaming Computation Engines at Yahoo!” In: Yahoo! Engineering Blog (2015). Accessed: 2016-10-17. url: http://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at.
K. Mani Chandy and Leslie Lamport. “Distributed Snapshots: Determining Global States of Distributed Systems”. In: ACM Trans. Comput. Syst. 3.1 (Feb. 1985), pp. 63–75. issn: 0734-2071. url: https://doi.org/10.1145/214451.214456. http://doi.acm.org/10.1145/214451.214456.
Ufuk Celebi, Kostas Tzoumas, and Stephan Ewen. “How Apache FlinkTM handles backpressure”. In: data Artisans Blog (Aug. 2015). Accessed: 2017-09-12. url: http://data-artisans.com/how-flink-handles-backpressure/.
Google Cloud Dataflow: Resource Quotas. Accessed: 2016-10-17. Google. 2016. url: https://cloud.google.com/dataflow/quotas.
Ericsson. “Trident – benchmarking performance”. In: Ericsson Research Blog (2014). Accessed: 2016-01-12. URL: http://www.ericsson.com/research-blog/data-knowledge/trident-benchmarking-performance/.
Stephan Ewen. “FLIP-6 - Flink Deployment and Process Model - Standalone, Yarn, Mesos, Kubernetes, etc.” In: Flink Improvement Proposals (Aug. 2016). Accessed: 2017-11-17. url: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65147077
Felix Gessert et al. “Quaestor: Query Web Caching for Database-as-a-Service Providers”. In: Proceedings of the 43rd International Conference on Very Large Data Bases (2017).
Mark Grover et al. Hadoop Application Architectures. Beijing: O’Reilly, 2015. ISBN: 978-1-4919-0008-6. url: http://my.safaribooksonline.com/9781491900086.
Benjamin Hindman et al. “Mesos: A Platform for Fine-grained Resource Sharing in the Data Center”. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation. NSDI’11. Boston, MA: USENIX Association, 2011, pp. 295–308. url: http://dl.acm.org/citation.cfm?id=1972457.1972488.
Fabian Hueske. “Apache Flink 1.5.0 Release Announcement”. In: Apache Flink Blog (May 2018). Accessed: 2018-08-18. url: https://flink.apache.org/news/2018/05/25/release-1.5.0.html
Patrick Hunt et al. “ZooKeeper: Wait-free Coordination for Internet-scale Systems”. In: Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference USENIXATC’10. Boston, MA: USENIX Association, 2010. url: http://dl.acm.org/citation.cfm?id=1855840.1855851.
Fabian Hueske, Shaoxuan Wang, and Xiaowei Jiang. “Continuous Queries on Dynamic Tables”. In: Flink Blog (Apr. 2017). Accessed: 2017-10-27. url: https://flink.apache.org/news/2017/4/04/dynamic-tables.html.
Jay Kreps, Neha Narkhede, and Jun Rao. “Kafka: a Distributed Messaging System for Log Processing”. In: NetDB’11. 2011.
Jay Kreps. “Benchmarking Apache Kafka: 2 Million Writes Per Second (On Three Cheap Machines)”. In: LinkedIn Engineering Blog (Apr. 2014). Accessed: 2016-10-17. url: https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines.
Jay Kreps. “Questioning the Lambda Architecture”. In: O’Reilly Media (July 2014). Accessed: 2015-12-17. url: http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html.
Jay Kreps. “Introducing Kafka Streams: Stream Processing Made Simple”. In: Confluent Blog (2016). Accessed: 2016-09-19. url: http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple/.
Sanjeev Kulkarni et al. “Twitter Heron: Stream Processing at Scale”. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. SIGMOD ’15. Melbourne, Victoria, Australia: ACM, 2015, pp. 239–250. ISBN: 978-1-4503-2758-9. url: https://doi.org/10.1145/2723372.2742788. http://doi.acm.org/10.1145/2723372.2742788.
Douglas Laney. 3D Data Management: Controlling Data Volume, Velocity and Variety. Tech. rep. META Group, 2001. url: http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf.
Wang Lam, Lu Liu, Sts Prasad, et al. “Muppet: MapReduce-style Processing of Fast Data”. In: VLDB 2012 (2012). issn: 2150-8097. url: https://doi.org/10.14778/2367502.2367520. http://dxdoiorg/10.14778/2367502.2367520.
Nathan Marz. “Preview of Storm: The Hadoop of Realtime Processing”. In: BackType Technology Blog (May 2012). Accessed: 2015-12-17. url: http://web.archive.org/web/20120509023348/ http://tech.backtype.com/preview-of-storm-the-hadoop-of-realtime-processing.
Nathan Marz. “History of Apache Storm and lessons learned”. In: Thoughts from the Red Planet (Oct. 2014). Accessed: 2015-12-17. url: http://nathanmarz.com/blog/history-of-apache-storm-and-lessons-learned.html.
Nathan Marz and James Warren. Big Data: Principles and Best Practices of Scalable Realtime Data Systems. 1st. Greenwich, CT, USA: Manning Publications Co., 2015. ISBN: 1617290343, 9781617290343.
Leonardo Neumeyer et al. “S4: Distributed Stream Computing Platform”. In: Proceedings of the 2010 IEEE International Conference on Data Mining Workshops ICDMW ’10. Washington, DC, USA: IEEE Computer Society, 2010, pp. 170–177. ISBN: 978-0-7695-4257-7. url: https://doi.org/10.1109/ICDMW2010.172. https://doi.org/10.1109/ICDMW.2010.172.
Shadi A. Noghabi et al. “Samza: Stateful Scalable Stream Processing at LinkedIn”. In: Proc. VLDB Endow. 10.12 (Aug. 2017), pp. 1634–1645. issn: 2150-8097. URL: https://doi.org/10.14778/3137765.3137770. https://doi.org/10.14778/3137765.3137770.
Pat Patterson and Ted Malaska. “Ingest & Stream Processing – What Will You Choose?” In: QCon (Aug. 2016). Accessed: 2018-05-25. url: https://www.infoq.com/presentations/ingest-stream-processing.
Navina Ramesh. “Apache Samza, LinkedIn’s Framework for Stream Processing”. In: thenewstack.io (2015). Accessed: 2016-09-21. url: http://thenewstack.io/apache-samza-linkedins-framework-for-stream-processing/.
Karthik Ramasamy. “Open Sourcing Twitter Heron”. In: Twitter Blog (May 2016). Accessed: 2017-01-15. url: https://blog.twitter.com/2016/open-sourcing-twitter-heron.
Mark Slee, Aditya Agarwal, and Marc Kwiatkowski. Thrift: Scalable Cross-Language Services Implementation. Tech. rep. Accessed: 2018-08-19. Facebook Inc., Apr. 2007. url: http://thrift.apache.org/static/files/thrift-20070401.pdf.
Matthias J. Sax. “Storm Compatibility in Apache Flink: How to run existing Storm topologies on Flink”. In: Apache Flink Blog (Dec. 2015).
Konstantin Shvachko et al. “The Hadoop Distributed File System”. In: Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) MSST ’10. Washington, DC, USA: IEEE Computer Society 2010, pp. 1–10. ISBN: 978-1-4244-7152-2. url: https://doi.org/10.1109/MSST.2010.5496972. http://dx.doi.org/10.1109/MSST.2010.5496972.
Guaranteeing Message Processing Accessed: 2018-08-19. Apache Software Foundation. 2018. url: http://storm.apache.org/releases/1.2.2/Guaranteeing-message-processing.html.
Bharat Venkat et al. “Can Spark Streaming survive Chaos Monkey?” In: Netflix Tech Blog (2015). Accessed: 2016-01-11. url: http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html.
Timo Walther. “From Streams to Tables and Back Again: An Update on Flink’s Table & SQL API”. In: Flink Blog (Mar. 2017). Accessed: 2017-10-27. url: https://flink.apache.org/news/2017/03/29/table-sql-api-update.html.
Fan Yang et al. Sonora: A Platform for Continuous Mobile-Cloud Computing. Tech. rep. MSR-TR-2012-34. Microsoft Research, 2012. url: http://research.microsoft.com/apps/pubs/default.aspx?id=161446.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, et al. “Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing”. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation NSDI’12. San Jose, CA: USENIX Association, 2012, pp. 2–2. url: http://dl.acm.org/citation.cfm?id=2228298.2228301.
Matei Zaharia, Tathagata Das, Haoyuan Li, et al. “Discretized Streams: Fault-tolerant Streaming Computation at Scale”. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. SOSP ’13. Farminton, Pennsylvania: ACM, 2013, pp. 423–438. ISBN: 978-1-4503-2388-8. url: https://doi.org/10.1145/2517349.2522737. http://doi.acm.org/10.1145/2517349.2522737.
Amazon Kinesis. Amazon Kinesis. Accessed: 2018-08-19. 2018. url: https://aws.amazon.com/kinesis/.
Apache Software Foundation. “The Apache Software Foundation Announces ApacheⓇ FlinkTM as a Top-Level Project”. In: Apache Software Foundation Blog (Jan. 2015). Accessed: 2016-11-25. url: https://blogsapacheorg/foundation/entry/the_apache_software_foundation_announces69.
Apache Software Foundation. “Announcing Apache Flink 1.0.0”. In: Apache Flink Blog (Mar. 2016). Accessed: 2017-01-15. url: https://flink.apache.org/news/2016/03/08/release-1.0.0.html.
Apache Software Foundation. “Apache Flink: Powered By Flink”. In: Apache Flink website (2016). Accessed: 2016-10-17. url: https://flink.apache.org/poweredby.html.
Apache Software Foundation. Flink Accessed: 2016-09-18. 2016. url: https://flink.apache.org/.
Apache Software Foundation. Flume Accessed: 2016-10-17. 2016. url: https://flume.apache.org/.
Apache Software Foundation. “Level of Parallelism in Data Processing”. In: Spark Streaming – 2.0.0 Documentation (2016). Accessed: 2016-09-23. url: https://spark.apache.org/docs/2.0.0/streaming-programming-guide.html#level-of-parallelism-in-data-receiving.
Apache Software Foundation. “Powered By Spark”. In: Apache Spark Website (2016). Accessed: 2016-10-17. url: https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark.
Apache Software Foundation. “The Apache Software Foundation Announces ApacheⓇApexTM as a Top-Level Project”. In: Apache Software Foundation Blog (Apr. 2016). Accessed: 2016-11-25. url: https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces90.
Apache Software Foundation. YARN. Accessed: 2016-10-17. 2016. url: http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html.
Apache Software Foundation. Apex. Accessed: 2018-08-18. 2018. url: http://apex.apache.org/.
Apache Software Foundation. Beam. Accessed: 2018-05-10. 2018. url: https://beam.apache.org/.
Apache Software Foundation. GitHub: Apache Apex Core. Accessed: 2018-08-18. 2018. url: https://github.com/apache/apex-core.
Databricks Inc. “Resilient Distributed Dataset (RDD)”. In: Databricks Glossary (2018). Accessed: 2018-07-22. url: https://databricks.com/glossary/what-is-rdd.
IBM Corporation. Of Streams and Storms. Tech. rep. IBM Software Group, 2014.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2019 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Wingerath, W., Ritter, N., Gessert, F. (2019). General-Purpose Stream Processing. In: Real-Time & Stream Data Management. SpringerBriefs in Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-030-10555-6_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-10555-6_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-10554-9
Online ISBN: 978-3-030-10555-6
eBook Packages: Computer ScienceComputer Science (R0)