Skip to main content

General-Purpose Stream Processing

  • Chapter
  • First Online:
Real-Time & Stream Data Management

Part of the book series: SpringerBriefs in Computer Science ((BRIEFSCOMPUTER))

Abstract

Unlike data stream management systems that are mostly intended for analyzing structured information through declarative query languages, systems for stream processing expose generic and imperative (i.e. non-declarative) programming interfaces to work with structured, semi-structured, and entirely unstructured data. Rather than yet another approach for querying data, stream processing can thus be seen as the latency-oriented counterpart to batch processing. In this chapter, we provide an overview over some of the most popular distributed stream processing systems currently available and highlight similarities, differences, and trade-offs taken in their respective designs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 44.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 59.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In 2016, a native stream processor was introduced to Kafka: Kafka Streams [Kre16] is not only conceptually similar to Samza, but was also built by the same people, reusing portions of the Samza source code [PM16]. Kafka Streams can therefore be considered an unofficial Samza successor.

  2. 2.

    On top of RDDs, Spark provides DataFrames and Datasets as even more abstract APIs that impose a schema on the otherwise unstructured RDD tuples [Dat18].

References

  1. Tyler Akidau, Alex Balikov, Kaya Bekiroglu, et al. “MillWheel: Fault-Tolerant Stream Processing at Internet Scale”. In: Very Large Data Bases. 2013, pp. 734–746.

    Google Scholar 

  2. Alexander Alexandrov, Rico Bergmann, Stephan Ewen, et al. “The Strato-sphere Platform for Big Data Analytics”. In: The VLDB Journal (2014). issn: 1066-8888. url: https://doi.org/10.1007/s00778-014-0357-y. http://dx.doi.org/10.1007/s00778-014-0357-y.

    Article  Google Scholar 

  3. Tyler Akidau et al. “The Dataflow Model: A Practical Approach to Balancing Correctness, Latency and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing”. In: Proceedings of the VLDB Endowment 8 (2015), pp. 1792–1803.

    Article  Google Scholar 

  4. Rajagopal Ananthanarayanan et al. “Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams”. In: SIGMOD ’13 2013. URL: http://dl.acm.org/citation.cfm?doid$=$2463676.2465272

  5. Alain Biem, Eric Bouillet, Hanhua Feng, et al. “IBM Infosphere Streams for Scalable, Real-time, Intelligent Transportation Services”. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. Indianapolis, Indiana, USA, 2010. ISBN: 978-1-4503-0032-2. url: https://doi.org/10.1145/1807167.1807291. http://doi.acm.org/10.1145/1807167.1807291.

  6. Oscar Boykin et al. “Summingbird: A Framework for Integrating Batch and Online MapReduce Computations”. In: Proc. VLDB Endow 7.13 (Aug. 2014), pp. 1441–1451. issn: 2150-8097. url: https://doi.org/10.14778/2733004.2733016. http://dxdoiorg/10.14778/2733004.2733016.

    Article  Google Scholar 

  7. Cole Brown. “Introducing Concord”. In: Concord Blog (2015). Accessed: 2016-09-21. URL: http://concord.io/posts/introducing_concord.

  8. Craig Chambers et al. “FlumeJava: Easy, Efficient Data-Parallel Pipelines”. In: ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). 2 Penn Plaza, Suite 701 New York, NY 10121-0701, 2010, pp. 363–375. URL: http://dl.acm.org/citation.cfm?id$=$1806638.

  9. Badrish Chandramouli et al. “Trill: A High-performance Incremental Query Processor for Diverse Analytics”. In: Proc. VLDB Endow. 8.4 (Dec. 2014),pp. 401–412. issn: 2150-8097. URL: https://doi.org/10.14778/2735496.2735503. http://dxdoiorg/10.14778/2735496.2735503.

    Article  Google Scholar 

  10. Badrish Chandramouli et al. “Quill: Efficient, Transferable, and Rich Analytics at Scale”. In: International Conference on Very Large Databases (PVLDB Vol. 9, Issue. 14). 2016. URL: https://www.microsoftcom/enus/research/publication/quill-efficient-transferable-rich-analytics-scale/.

  11. Sanket Chintapalli et al. “Benchmarking Streaming Computation Engines at Yahoo!” In: Yahoo! Engineering Blog (2015). Accessed: 2016-10-17. url: http://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at.

  12. K. Mani Chandy and Leslie Lamport. “Distributed Snapshots: Determining Global States of Distributed Systems”. In: ACM Trans. Comput. Syst. 3.1 (Feb. 1985), pp. 63–75. issn: 0734-2071. url: https://doi.org/10.1145/214451.214456. http://doi.acm.org/10.1145/214451.214456.

    Article  Google Scholar 

  13. Ufuk Celebi, Kostas Tzoumas, and Stephan Ewen. “How Apache FlinkTM handles backpressure”. In: data Artisans Blog (Aug. 2015). Accessed: 2017-09-12. url: http://data-artisans.com/how-flink-handles-backpressure/.

  14. Google Cloud Dataflow: Resource Quotas. Accessed: 2016-10-17. Google. 2016. url: https://cloud.google.com/dataflow/quotas.

  15. Ericsson. “Trident – benchmarking performance”. In: Ericsson Research Blog (2014). Accessed: 2016-01-12. URL: http://www.ericsson.com/research-blog/data-knowledge/trident-benchmarking-performance/.

  16. Stephan Ewen. “FLIP-6 - Flink Deployment and Process Model - Standalone, Yarn, Mesos, Kubernetes, etc.” In: Flink Improvement Proposals (Aug. 2016). Accessed: 2017-11-17. url: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65147077

  17. Felix Gessert et al. “Quaestor: Query Web Caching for Database-as-a-Service Providers”. In: Proceedings of the 43rd International Conference on Very Large Data Bases (2017).

    Google Scholar 

  18. Mark Grover et al. Hadoop Application Architectures. Beijing: O’Reilly, 2015. ISBN: 978-1-4919-0008-6. url: http://my.safaribooksonline.com/9781491900086.

  19. Benjamin Hindman et al. “Mesos: A Platform for Fine-grained Resource Sharing in the Data Center”. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation. NSDI’11. Boston, MA: USENIX Association, 2011, pp. 295–308. url: http://dl.acm.org/citation.cfm?id=1972457.1972488.

  20. Fabian Hueske. “Apache Flink 1.5.0 Release Announcement”. In: Apache Flink Blog (May 2018). Accessed: 2018-08-18. url: https://flink.apache.org/news/2018/05/25/release-1.5.0.html

  21. Patrick Hunt et al. “ZooKeeper: Wait-free Coordination for Internet-scale Systems”. In: Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference USENIXATC’10. Boston, MA: USENIX Association, 2010. url: http://dl.acm.org/citation.cfm?id=1855840.1855851.

  22. Fabian Hueske, Shaoxuan Wang, and Xiaowei Jiang. “Continuous Queries on Dynamic Tables”. In: Flink Blog (Apr. 2017). Accessed: 2017-10-27. url: https://flink.apache.org/news/2017/4/04/dynamic-tables.html.

  23. Jay Kreps, Neha Narkhede, and Jun Rao. “Kafka: a Distributed Messaging System for Log Processing”. In: NetDB’11. 2011.

    Google Scholar 

  24. Jay Kreps. “Benchmarking Apache Kafka: 2 Million Writes Per Second (On Three Cheap Machines)”. In: LinkedIn Engineering Blog (Apr. 2014). Accessed: 2016-10-17. url: https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines.

  25. Jay Kreps. “Questioning the Lambda Architecture”. In: O’Reilly Media (July 2014). Accessed: 2015-12-17. url: http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html.

  26. Jay Kreps. “Introducing Kafka Streams: Stream Processing Made Simple”. In: Confluent Blog (2016). Accessed: 2016-09-19. url: http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple/.

  27. Sanjeev Kulkarni et al. “Twitter Heron: Stream Processing at Scale”. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. SIGMOD ’15. Melbourne, Victoria, Australia: ACM, 2015, pp. 239–250. ISBN: 978-1-4503-2758-9. url: https://doi.org/10.1145/2723372.2742788. http://doi.acm.org/10.1145/2723372.2742788.

  28. Douglas Laney. 3D Data Management: Controlling Data Volume, Velocity and Variety. Tech. rep. META Group, 2001. url: http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf.

  29. Wang Lam, Lu Liu, Sts Prasad, et al. “Muppet: MapReduce-style Processing of Fast Data”. In: VLDB 2012 (2012). issn: 2150-8097. url: https://doi.org/10.14778/2367502.2367520. http://dxdoiorg/10.14778/2367502.2367520.

    Article  Google Scholar 

  30. Nathan Marz. “Preview of Storm: The Hadoop of Realtime Processing”. In: BackType Technology Blog (May 2012). Accessed: 2015-12-17. url: http://web.archive.org/web/20120509023348/ http://tech.backtype.com/preview-of-storm-the-hadoop-of-realtime-processing.

  31. Nathan Marz. “History of Apache Storm and lessons learned”. In: Thoughts from the Red Planet (Oct. 2014). Accessed: 2015-12-17. url: http://nathanmarz.com/blog/history-of-apache-storm-and-lessons-learned.html.

  32. Nathan Marz and James Warren. Big Data: Principles and Best Practices of Scalable Realtime Data Systems. 1st. Greenwich, CT, USA: Manning Publications Co., 2015. ISBN: 1617290343, 9781617290343.

    Google Scholar 

  33. Leonardo Neumeyer et al. “S4: Distributed Stream Computing Platform”. In: Proceedings of the 2010 IEEE International Conference on Data Mining Workshops ICDMW ’10. Washington, DC, USA: IEEE Computer Society, 2010, pp. 170–177. ISBN: 978-0-7695-4257-7. url: https://doi.org/10.1109/ICDMW2010.172. https://doi.org/10.1109/ICDMW.2010.172.

  34. Shadi A. Noghabi et al. “Samza: Stateful Scalable Stream Processing at LinkedIn”. In: Proc. VLDB Endow. 10.12 (Aug. 2017), pp. 1634–1645. issn: 2150-8097. URL: https://doi.org/10.14778/3137765.3137770. https://doi.org/10.14778/3137765.3137770.

    Article  Google Scholar 

  35. Pat Patterson and Ted Malaska. “Ingest & Stream Processing – What Will You Choose?” In: QCon (Aug. 2016). Accessed: 2018-05-25. url: https://www.infoq.com/presentations/ingest-stream-processing.

  36. Navina Ramesh. “Apache Samza, LinkedIn’s Framework for Stream Processing”. In: thenewstack.io (2015). Accessed: 2016-09-21. url: http://thenewstack.io/apache-samza-linkedins-framework-for-stream-processing/.

  37. Karthik Ramasamy. “Open Sourcing Twitter Heron”. In: Twitter Blog (May 2016). Accessed: 2017-01-15. url: https://blog.twitter.com/2016/open-sourcing-twitter-heron.

  38. Mark Slee, Aditya Agarwal, and Marc Kwiatkowski. Thrift: Scalable Cross-Language Services Implementation. Tech. rep. Accessed: 2018-08-19. Facebook Inc., Apr. 2007. url: http://thrift.apache.org/static/files/thrift-20070401.pdf.

  39. Matthias J. Sax. “Storm Compatibility in Apache Flink: How to run existing Storm topologies on Flink”. In: Apache Flink Blog (Dec. 2015).

    Google Scholar 

  40. Konstantin Shvachko et al. “The Hadoop Distributed File System”. In: Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) MSST ’10. Washington, DC, USA: IEEE Computer Society 2010, pp. 1–10. ISBN: 978-1-4244-7152-2. url: https://doi.org/10.1109/MSST.2010.5496972. http://dx.doi.org/10.1109/MSST.2010.5496972.

  41. Guaranteeing Message Processing Accessed: 2018-08-19. Apache Software Foundation. 2018. url: http://storm.apache.org/releases/1.2.2/Guaranteeing-message-processing.html.

  42. Bharat Venkat et al. “Can Spark Streaming survive Chaos Monkey?” In: Netflix Tech Blog (2015). Accessed: 2016-01-11. url: http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html.

  43. Timo Walther. “From Streams to Tables and Back Again: An Update on Flink’s Table & SQL API”. In: Flink Blog (Mar. 2017). Accessed: 2017-10-27. url: https://flink.apache.org/news/2017/03/29/table-sql-api-update.html.

  44. Fan Yang et al. Sonora: A Platform for Continuous Mobile-Cloud Computing. Tech. rep. MSR-TR-2012-34. Microsoft Research, 2012. url: http://research.microsoft.com/apps/pubs/default.aspx?id=161446.

  45. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, et al. “Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing”. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation NSDI’12. San Jose, CA: USENIX Association, 2012, pp. 2–2. url: http://dl.acm.org/citation.cfm?id=2228298.2228301.

  46. Matei Zaharia, Tathagata Das, Haoyuan Li, et al. “Discretized Streams: Fault-tolerant Streaming Computation at Scale”. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. SOSP ’13. Farminton, Pennsylvania: ACM, 2013, pp. 423–438. ISBN: 978-1-4503-2388-8. url: https://doi.org/10.1145/2517349.2522737. http://doi.acm.org/10.1145/2517349.2522737.

  47. Amazon Kinesis. Amazon Kinesis. Accessed: 2018-08-19. 2018. url: https://aws.amazon.com/kinesis/.

  48. Apache Software Foundation. “The Apache Software Foundation Announces ApacheⓇ FlinkTM as a Top-Level Project”. In: Apache Software Foundation Blog (Jan. 2015). Accessed: 2016-11-25. url: https://blogsapacheorg/foundation/entry/the_apache_software_foundation_announces69.

  49. Apache Software Foundation. “Announcing Apache Flink 1.0.0”. In: Apache Flink Blog (Mar. 2016). Accessed: 2017-01-15. url: https://flink.apache.org/news/2016/03/08/release-1.0.0.html.

  50. Apache Software Foundation. “Apache Flink: Powered By Flink”. In: Apache Flink website (2016). Accessed: 2016-10-17. url: https://flink.apache.org/poweredby.html.

  51. Apache Software Foundation. Flink Accessed: 2016-09-18. 2016. url: https://flink.apache.org/.

  52. Apache Software Foundation. Flume Accessed: 2016-10-17. 2016. url: https://flume.apache.org/.

  53. Apache Software Foundation. “Level of Parallelism in Data Processing”. In: Spark Streaming – 2.0.0 Documentation (2016). Accessed: 2016-09-23. url: https://spark.apache.org/docs/2.0.0/streaming-programming-guide.html#level-of-parallelism-in-data-receiving.

  54. Apache Software Foundation. “Powered By Spark”. In: Apache Spark Website (2016). Accessed: 2016-10-17. url: https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark.

  55. Apache Software Foundation. “The Apache Software Foundation Announces ApacheⓇApexTM as a Top-Level Project”. In: Apache Software Foundation Blog (Apr. 2016). Accessed: 2016-11-25. url: https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces90.

  56. Apache Software Foundation. YARN. Accessed: 2016-10-17. 2016. url: http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html.

  57. Apache Software Foundation. Apex. Accessed: 2018-08-18. 2018. url: http://apex.apache.org/.

  58. Apache Software Foundation. Beam. Accessed: 2018-05-10. 2018. url: https://beam.apache.org/.

  59. Apache Software Foundation. GitHub: Apache Apex Core. Accessed: 2018-08-18. 2018. url: https://github.com/apache/apex-core.

  60. Databricks Inc. “Resilient Distributed Dataset (RDD)”. In: Databricks Glossary (2018). Accessed: 2018-07-22. url: https://databricks.com/glossary/what-is-rdd.

  61. IBM Corporation. Of Streams and Storms. Tech. rep. IBM Software Group, 2014.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2019 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Wingerath, W., Ritter, N., Gessert, F. (2019). General-Purpose Stream Processing. In: Real-Time & Stream Data Management. SpringerBriefs in Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-030-10555-6_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-10555-6_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-10554-9

  • Online ISBN: 978-3-030-10555-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics