General-Purpose Stream Processing

Wingerath, Wolfram; Ritter, Norbert; Gessert, Felix

doi:10.1007/978-3-030-10555-6_5

Wolfram Wingerath¹⁷,
Norbert Ritter¹⁸ &
Felix Gessert¹⁷

Part of the book series: SpringerBriefs in Computer Science ((BRIEFSCOMPUTER))

895 Accesses
1 Citations

Abstract

Unlike data stream management systems that are mostly intended for analyzing structured information through declarative query languages, systems for stream processing expose generic and imperative (i.e. non-declarative) programming interfaces to work with structured, semi-structured, and entirely unstructured data. Rather than yet another approach for querying data, stream processing can thus be seen as the latency-oriented counterpart to batch processing. In this chapter, we provide an overview over some of the most popular distributed stream processing systems currently available and highlight similarities, differences, and trade-offs taken in their respective designs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 59.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In 2016, a native stream processor was introduced to Kafka: Kafka Streams [Kre16] is not only conceptually similar to Samza, but was also built by the same people, reusing portions of the Samza source code [PM16]. Kafka Streams can therefore be considered an unofficial Samza successor.
2.
On top of RDDs, Spark provides DataFrames and Datasets as even more abstract APIs that impose a schema on the otherwise unstructured RDD tuples [Dat18].

References

Tyler Akidau, Alex Balikov, Kaya Bekiroglu, et al. “MillWheel: Fault-Tolerant Stream Processing at Internet Scale”. In: Very Large Data Bases. 2013, pp. 734–746.
Google Scholar
Alexander Alexandrov, Rico Bergmann, Stephan Ewen, et al. “The Strato-sphere Platform for Big Data Analytics”. In: The VLDB Journal (2014). issn: 1066-8888. url: https://doi.org/10.1007/s00778-014-0357-y. http://dx.doi.org/10.1007/s00778-014-0357-y.
Article Google Scholar
Tyler Akidau et al. “The Dataflow Model: A Practical Approach to Balancing Correctness, Latency and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing”. In: Proceedings of the VLDB Endowment 8 (2015), pp. 1792–1803.
Article Google Scholar
Rajagopal Ananthanarayanan et al. “Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams”. In: SIGMOD ’13 2013. URL: http://dl.acm.org/citation.cfm?doid$=$2463676.2465272
Alain Biem, Eric Bouillet, Hanhua Feng, et al. “IBM Infosphere Streams for Scalable, Real-time, Intelligent Transportation Services”. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. Indianapolis, Indiana, USA, 2010. ISBN: 978-1-4503-0032-2. url: https://doi.org/10.1145/1807167.1807291. http://doi.acm.org/10.1145/1807167.1807291.
Oscar Boykin et al. “Summingbird: A Framework for Integrating Batch and Online MapReduce Computations”. In: Proc. VLDB Endow 7.13 (Aug. 2014), pp. 1441–1451. issn: 2150-8097. url: https://doi.org/10.14778/2733004.2733016. http://dxdoiorg/10.14778/2733004.2733016.
Article Google Scholar
Cole Brown. “Introducing Concord”. In: Concord Blog (2015). Accessed: 2016-09-21. URL: http://concord.io/posts/introducing_concord.
Craig Chambers et al. “FlumeJava: Easy, Efficient Data-Parallel Pipelines”. In: ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). 2 Penn Plaza, Suite 701 New York, NY 10121-0701, 2010, pp. 363–375. URL: http://dl.acm.org/citation.cfm?id$=$1806638.
Badrish Chandramouli et al. “Trill: A High-performance Incremental Query Processor for Diverse Analytics”. In: Proc. VLDB Endow. 8.4 (Dec. 2014),pp. 401–412. issn: 2150-8097. URL: https://doi.org/10.14778/2735496.2735503. http://dxdoiorg/10.14778/2735496.2735503.
Article Google Scholar
Badrish Chandramouli et al. “Quill: Efficient, Transferable, and Rich Analytics at Scale”. In: International Conference on Very Large Databases (PVLDB Vol. 9, Issue. 14). 2016. URL: https://www.microsoftcom/enus/research/publication/quill-efficient-transferable-rich-analytics-scale/.
Sanket Chintapalli et al. “Benchmarking Streaming Computation Engines at Yahoo!” In: Yahoo! Engineering Blog (2015). Accessed: 2016-10-17. url: http://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at.
K. Mani Chandy and Leslie Lamport. “Distributed Snapshots: Determining Global States of Distributed Systems”. In: ACM Trans. Comput. Syst. 3.1 (Feb. 1985), pp. 63–75. issn: 0734-2071. url: https://doi.org/10.1145/214451.214456. http://doi.acm.org/10.1145/214451.214456.
Article Google Scholar
Ufuk Celebi, Kostas Tzoumas, and Stephan Ewen. “How Apache Flink^TM handles backpressure”. In: data Artisans Blog (Aug. 2015). Accessed: 2017-09-12. url: http://data-artisans.com/how-flink-handles-backpressure/.
Google Cloud Dataflow: Resource Quotas. Accessed: 2016-10-17. Google. 2016. url: https://cloud.google.com/dataflow/quotas.
Ericsson. “Trident – benchmarking performance”. In: Ericsson Research Blog (2014). Accessed: 2016-01-12. URL: http://www.ericsson.com/research-blog/data-knowledge/trident-benchmarking-performance/.
Stephan Ewen. “FLIP-6 - Flink Deployment and Process Model - Standalone, Yarn, Mesos, Kubernetes, etc.” In: Flink Improvement Proposals (Aug. 2016). Accessed: 2017-11-17. url: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65147077
Felix Gessert et al. “Quaestor: Query Web Caching for Database-as-a-Service Providers”. In: Proceedings of the 43rd International Conference on Very Large Data Bases (2017).
Google Scholar
Mark Grover et al. Hadoop Application Architectures. Beijing: O’Reilly, 2015. ISBN: 978-1-4919-0008-6. url: http://my.safaribooksonline.com/9781491900086.
Benjamin Hindman et al. “Mesos: A Platform for Fine-grained Resource Sharing in the Data Center”. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation. NSDI’11. Boston, MA: USENIX Association, 2011, pp. 295–308. url: http://dl.acm.org/citation.cfm?id=1972457.1972488.
Fabian Hueske. “Apache Flink 1.5.0 Release Announcement”. In: Apache Flink Blog (May 2018). Accessed: 2018-08-18. url: https://flink.apache.org/news/2018/05/25/release-1.5.0.html
Patrick Hunt et al. “ZooKeeper: Wait-free Coordination for Internet-scale Systems”. In: Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference USENIXATC’10. Boston, MA: USENIX Association, 2010. url: http://dl.acm.org/citation.cfm?id=1855840.1855851.
Fabian Hueske, Shaoxuan Wang, and Xiaowei Jiang. “Continuous Queries on Dynamic Tables”. In: Flink Blog (Apr. 2017). Accessed: 2017-10-27. url: https://flink.apache.org/news/2017/4/04/dynamic-tables.html.
Jay Kreps, Neha Narkhede, and Jun Rao. “Kafka: a Distributed Messaging System for Log Processing”. In: NetDB’11. 2011.
Google Scholar
Jay Kreps. “Benchmarking Apache Kafka: 2 Million Writes Per Second (On Three Cheap Machines)”. In: LinkedIn Engineering Blog (Apr. 2014). Accessed: 2016-10-17. url: https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines.
Jay Kreps. “Questioning the Lambda Architecture”. In: O’Reilly Media (July 2014). Accessed: 2015-12-17. url: http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html.
Jay Kreps. “Introducing Kafka Streams: Stream Processing Made Simple”. In: Confluent Blog (2016). Accessed: 2016-09-19. url: http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple/.
Sanjeev Kulkarni et al. “Twitter Heron: Stream Processing at Scale”. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. SIGMOD ’15. Melbourne, Victoria, Australia: ACM, 2015, pp. 239–250. ISBN: 978-1-4503-2758-9. url: https://doi.org/10.1145/2723372.2742788. http://doi.acm.org/10.1145/2723372.2742788.
Douglas Laney. 3D Data Management: Controlling Data Volume, Velocity and Variety. Tech. rep. META Group, 2001. url: http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf.
Wang Lam, Lu Liu, Sts Prasad, et al. “Muppet: MapReduce-style Processing of Fast Data”. In: VLDB 2012 (2012). issn: 2150-8097. url: https://doi.org/10.14778/2367502.2367520. http://dxdoiorg/10.14778/2367502.2367520.
Article Google Scholar
Nathan Marz. “Preview of Storm: The Hadoop of Realtime Processing”. In: BackType Technology Blog (May 2012). Accessed: 2015-12-17. url: http://web.archive.org/web/20120509023348/ http://tech.backtype.com/preview-of-storm-the-hadoop-of-realtime-processing.
Nathan Marz. “History of Apache Storm and lessons learned”. In: Thoughts from the Red Planet (Oct. 2014). Accessed: 2015-12-17. url: http://nathanmarz.com/blog/history-of-apache-storm-and-lessons-learned.html.
Nathan Marz and James Warren. Big Data: Principles and Best Practices of Scalable Realtime Data Systems. 1st. Greenwich, CT, USA: Manning Publications Co., 2015. ISBN: 1617290343, 9781617290343.
Google Scholar
Leonardo Neumeyer et al. “S4: Distributed Stream Computing Platform”. In: Proceedings of the 2010 IEEE International Conference on Data Mining Workshops ICDMW ’10. Washington, DC, USA: IEEE Computer Society, 2010, pp. 170–177. ISBN: 978-0-7695-4257-7. url: https://doi.org/10.1109/ICDMW2010.172. https://doi.org/10.1109/ICDMW.2010.172.
Shadi A. Noghabi et al. “Samza: Stateful Scalable Stream Processing at LinkedIn”. In: Proc. VLDB Endow. 10.12 (Aug. 2017), pp. 1634–1645. issn: 2150-8097. URL: https://doi.org/10.14778/3137765.3137770. https://doi.org/10.14778/3137765.3137770.
Article Google Scholar
Pat Patterson and Ted Malaska. “Ingest & Stream Processing – What Will You Choose?” In: QCon (Aug. 2016). Accessed: 2018-05-25. url: https://www.infoq.com/presentations/ingest-stream-processing.
Navina Ramesh. “Apache Samza, LinkedIn’s Framework for Stream Processing”. In: thenewstack.io (2015). Accessed: 2016-09-21. url: http://thenewstack.io/apache-samza-linkedins-framework-for-stream-processing/.
Karthik Ramasamy. “Open Sourcing Twitter Heron”. In: Twitter Blog (May 2016). Accessed: 2017-01-15. url: https://blog.twitter.com/2016/open-sourcing-twitter-heron.
Mark Slee, Aditya Agarwal, and Marc Kwiatkowski. Thrift: Scalable Cross-Language Services Implementation. Tech. rep. Accessed: 2018-08-19. Facebook Inc., Apr. 2007. url: http://thrift.apache.org/static/files/thrift-20070401.pdf.
Matthias J. Sax. “Storm Compatibility in Apache Flink: How to run existing Storm topologies on Flink”. In: Apache Flink Blog (Dec. 2015).
Google Scholar
Konstantin Shvachko et al. “The Hadoop Distributed File System”. In: Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) MSST ’10. Washington, DC, USA: IEEE Computer Society 2010, pp. 1–10. ISBN: 978-1-4244-7152-2. url: https://doi.org/10.1109/MSST.2010.5496972. http://dx.doi.org/10.1109/MSST.2010.5496972.
Guaranteeing Message Processing Accessed: 2018-08-19. Apache Software Foundation. 2018. url: http://storm.apache.org/releases/1.2.2/Guaranteeing-message-processing.html.
Bharat Venkat et al. “Can Spark Streaming survive Chaos Monkey?” In: Netflix Tech Blog (2015). Accessed: 2016-01-11. url: http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html.
Timo Walther. “From Streams to Tables and Back Again: An Update on Flink’s Table & SQL API”. In: Flink Blog (Mar. 2017). Accessed: 2017-10-27. url: https://flink.apache.org/news/2017/03/29/table-sql-api-update.html.
Fan Yang et al. Sonora: A Platform for Continuous Mobile-Cloud Computing. Tech. rep. MSR-TR-2012-34. Microsoft Research, 2012. url: http://research.microsoft.com/apps/pubs/default.aspx?id=161446.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, et al. “Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing”. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation NSDI’12. San Jose, CA: USENIX Association, 2012, pp. 2–2. url: http://dl.acm.org/citation.cfm?id=2228298.2228301.
Matei Zaharia, Tathagata Das, Haoyuan Li, et al. “Discretized Streams: Fault-tolerant Streaming Computation at Scale”. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. SOSP ’13. Farminton, Pennsylvania: ACM, 2013, pp. 423–438. ISBN: 978-1-4503-2388-8. url: https://doi.org/10.1145/2517349.2522737. http://doi.acm.org/10.1145/2517349.2522737.
Amazon Kinesis. Amazon Kinesis. Accessed: 2018-08-19. 2018. url: https://aws.amazon.com/kinesis/.
Apache Software Foundation. “The Apache Software Foundation Announces ApacheⓇ Flink^TM as a Top-Level Project”. In: Apache Software Foundation Blog (Jan. 2015). Accessed: 2016-11-25. url: https://blogsapacheorg/foundation/entry/the_apache_software_foundation_announces69.
Apache Software Foundation. “Announcing Apache Flink 1.0.0”. In: Apache Flink Blog (Mar. 2016). Accessed: 2017-01-15. url: https://flink.apache.org/news/2016/03/08/release-1.0.0.html.
Apache Software Foundation. “Apache Flink: Powered By Flink”. In: Apache Flink website (2016). Accessed: 2016-10-17. url: https://flink.apache.org/poweredby.html.
Apache Software Foundation. Flink Accessed: 2016-09-18. 2016. url: https://flink.apache.org/.
Apache Software Foundation. Flume Accessed: 2016-10-17. 2016. url: https://flume.apache.org/.
Apache Software Foundation. “Level of Parallelism in Data Processing”. In: Spark Streaming – 2.0.0 Documentation (2016). Accessed: 2016-09-23. url: https://spark.apache.org/docs/2.0.0/streaming-programming-guide.html#level-of-parallelism-in-data-receiving.
Apache Software Foundation. “Powered By Spark”. In: Apache Spark Website (2016). Accessed: 2016-10-17. url: https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark.
Apache Software Foundation. “The Apache Software Foundation Announces ApacheⓇApex^TM as a Top-Level Project”. In: Apache Software Foundation Blog (Apr. 2016). Accessed: 2016-11-25. url: https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces90.
Apache Software Foundation. YARN. Accessed: 2016-10-17. 2016. url: http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html.
Apache Software Foundation. Apex. Accessed: 2018-08-18. 2018. url: http://apex.apache.org/.
Apache Software Foundation. Beam. Accessed: 2018-05-10. 2018. url: https://beam.apache.org/.
Apache Software Foundation. GitHub: Apache Apex Core. Accessed: 2018-08-18. 2018. url: https://github.com/apache/apex-core.
Databricks Inc. “Resilient Distributed Dataset (RDD)”. In: Databricks Glossary (2018). Accessed: 2018-07-22. url: https://databricks.com/glossary/what-is-rdd.
IBM Corporation. Of Streams and Storms. Tech. rep. IBM Software Group, 2014.
Google Scholar

Download references

Author information

Authors and Affiliations

Baqend GmbH, Hamburg, Hamburg, Germany
Wolfram Wingerath & Felix Gessert
University of Hamburg, Hamburg, Hamburg, Germany
Norbert Ritter

Authors

Wolfram Wingerath
View author publications
You can also search for this author in PubMed Google Scholar
Norbert Ritter
View author publications
You can also search for this author in PubMed Google Scholar
Felix Gessert
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Wingerath, W., Ritter, N., Gessert, F. (2019). General-Purpose Stream Processing. In: Real-Time & Stream Data Management. SpringerBriefs in Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-030-10555-6_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-10555-6_5
Published: 03 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-10554-9
Online ISBN: 978-3-030-10555-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics