Advertisement

General-Purpose Stream Processing

  • Wolfram Wingerath
  • Norbert Ritter
  • Felix Gessert
Chapter
Part of the SpringerBriefs in Computer Science book series (BRIEFSCOMPUTER)

Abstract

Unlike data stream management systems that are mostly intended for analyzing structured information through declarative query languages, systems for stream processing expose generic and imperative (i.e. non-declarative) programming interfaces to work with structured, semi-structured, and entirely unstructured data. Rather than yet another approach for querying data, stream processing can thus be seen as the latency-oriented counterpart to batch processing. In this chapter, we provide an overview over some of the most popular distributed stream processing systems currently available and highlight similarities, differences, and trade-offs taken in their respective designs.

References

  1. [ABB+13]
    Tyler Akidau, Alex Balikov, Kaya Bekiroglu, et al. “MillWheel: Fault-Tolerant Stream Processing at Internet Scale”. In: Very Large Data Bases. 2013, pp. 734–746.Google Scholar
  2. [ABE+14]
    Alexander Alexandrov, Rico Bergmann, Stephan Ewen, et al. “The Strato-sphere Platform for Big Data Analytics”. In: The VLDB Journal (2014). issn: 1066-8888. url: https://doi.org/10.1007/s00778-014-0357-y. http://dx.doi.org/10.1007/s00778-014-0357-y.CrossRefGoogle Scholar
  3. [Aki+15]
    Tyler Akidau et al. “The Dataflow Model: A Practical Approach to Balancing Correctness, Latency and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing”. In: Proceedings of the VLDB Endowment 8 (2015), pp. 1792–1803.CrossRefGoogle Scholar
  4. [Ana+13]
    Rajagopal Ananthanarayanan et al. “Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams”. In: SIGMOD ’13 2013. URL: http://dl.acm.org/citation.cfm?doid$=$2463676.2465272
  5. [BBF+10]
    Alain Biem, Eric Bouillet, Hanhua Feng, et al. “IBM Infosphere Streams for Scalable, Real-time, Intelligent Transportation Services”. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. Indianapolis, Indiana, USA, 2010. ISBN: 978-1-4503-0032-2. url: https://doi.org/10.1145/1807167.1807291. http://doi.acm.org/10.1145/1807167.1807291.
  6. [Boy+14]
    Oscar Boykin et al. “Summingbird: A Framework for Integrating Batch and Online MapReduce Computations”. In: Proc. VLDB Endow 7.13 (Aug. 2014), pp. 1441–1451. issn: 2150-8097. url: https://doi.org/10.14778/2733004.2733016. http://dxdoiorg/10.14778/2733004.2733016.CrossRefGoogle Scholar
  7. [Bro15]
    Cole Brown. “Introducing Concord”. In: Concord Blog (2015). Accessed: 2016-09-21. URL: http://concord.io/posts/introducing_concord.
  8. [Cha+10]
    Craig Chambers et al. “FlumeJava: Easy, Efficient Data-Parallel Pipelines”. In: ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). 2 Penn Plaza, Suite 701 New York, NY 10121-0701, 2010, pp. 363–375. URL: http://dl.acm.org/citation.cfm?id$=$1806638.
  9. [Cha+14]
    Badrish Chandramouli et al. “Trill: A High-performance Incremental Query Processor for Diverse Analytics”. In: Proc. VLDB Endow. 8.4 (Dec. 2014),pp. 401–412. issn: 2150-8097. URL: https://doi.org/10.14778/2735496.2735503. http://dxdoiorg/10.14778/2735496.2735503.CrossRefGoogle Scholar
  10. [Cha+16]
    Badrish Chandramouli et al. “Quill: Efficient, Transferable, and Rich Analytics at Scale”. In: International Conference on Very Large Databases (PVLDB Vol. 9, Issue. 14). 2016. URL: https://www.microsoftcom/enus/research/publication/quill-efficient-transferable-rich-analytics-scale/.
  11. [Chi+15]
    Sanket Chintapalli et al. “Benchmarking Streaming Computation Engines at Yahoo!” In: Yahoo! Engineering Blog (2015). Accessed: 2016-10-17. url: http://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at.
  12. [CL85]
    K. Mani Chandy and Leslie Lamport. “Distributed Snapshots: Determining Global States of Distributed Systems”. In: ACM Trans. Comput. Syst. 3.1 (Feb. 1985), pp. 63–75. issn: 0734-2071. url: https://doi.org/10.1145/214451.214456. http://doi.acm.org/10.1145/214451.214456.CrossRefGoogle Scholar
  13. [CTE15]
    Ufuk Celebi, Kostas Tzoumas, and Stephan Ewen. “How Apache FlinkTM handles backpressure”. In: data Artisans Blog (Aug. 2015). Accessed: 2017-09-12. url: http://data-artisans.com/how-flink-handles-backpressure/.
  14. [Data]
    Google Cloud Dataflow: Resource Quotas. Accessed: 2016-10-17. Google. 2016. url: https://cloud.google.com/dataflow/quotas.
  15. [Eri14]
    Ericsson. “Trident – benchmarking performance”. In: Ericsson Research Blog (2014). Accessed: 2016-01-12. URL: http://www.ericsson.com/research-blog/data-knowledge/trident-benchmarking-performance/.
  16. [Ewe16]
    Stephan Ewen. “FLIP-6 - Flink Deployment and Process Model - Standalone, Yarn, Mesos, Kubernetes, etc.” In: Flink Improvement Proposals (Aug. 2016). Accessed: 2017-11-17. url: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65147077
  17. [Ges+17]
    Felix Gessert et al. “Quaestor: Query Web Caching for Database-as-a-Service Providers”. In: Proceedings of the 43rd International Conference on Very Large Data Bases (2017).Google Scholar
  18. [Gro+15]
    Mark Grover et al. Hadoop Application Architectures. Beijing: O’Reilly, 2015. ISBN: 978-1-4919-0008-6. url: http://my.safaribooksonline.com/9781491900086.
  19. [Hin+11]
    Benjamin Hindman et al. “Mesos: A Platform for Fine-grained Resource Sharing in the Data Center”. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation. NSDI’11. Boston, MA: USENIX Association, 2011, pp. 295–308. url: http://dl.acm.org/citation.cfm?id=1972457.1972488.
  20. [Hue18]
    Fabian Hueske. “Apache Flink 1.5.0 Release Announcement”. In: Apache Flink Blog (May 2018). Accessed: 2018-08-18. url: https://flink.apache.org/news/2018/05/25/release-1.5.0.html
  21. [Hun+10]
    Patrick Hunt et al. “ZooKeeper: Wait-free Coordination for Internet-scale Systems”. In: Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference USENIXATC’10. Boston, MA: USENIX Association, 2010. url: http://dl.acm.org/citation.cfm?id=1855840.1855851.
  22. [HWJ17]
    Fabian Hueske, Shaoxuan Wang, and Xiaowei Jiang. “Continuous Queries on Dynamic Tables”. In: Flink Blog (Apr. 2017). Accessed: 2017-10-27. url: https://flink.apache.org/news/2017/4/04/dynamic-tables.html.
  23. [KNR11]
    Jay Kreps, Neha Narkhede, and Jun Rao. “Kafka: a Distributed Messaging System for Log Processing”. In: NetDB’11. 2011.Google Scholar
  24. [Kre14a]
    Jay Kreps. “Benchmarking Apache Kafka: 2 Million Writes Per Second (On Three Cheap Machines)”. In: LinkedIn Engineering Blog (Apr. 2014). Accessed: 2016-10-17. url: https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines.
  25. [Kre14b]
    Jay Kreps. “Questioning the Lambda Architecture”. In: O’Reilly Media (July 2014). Accessed: 2015-12-17. url: http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html.
  26. [Kre16]
    Jay Kreps. “Introducing Kafka Streams: Stream Processing Made Simple”. In: Confluent Blog (2016). Accessed: 2016-09-19. url: http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple/.
  27. [Kul+15]
    Sanjeev Kulkarni et al. “Twitter Heron: Stream Processing at Scale”. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. SIGMOD ’15. Melbourne, Victoria, Australia: ACM, 2015, pp. 239–250. ISBN: 978-1-4503-2758-9. url: https://doi.org/10.1145/2723372.2742788. http://doi.acm.org/10.1145/2723372.2742788.
  28. [Lan01]
    Douglas Laney. 3D Data Management: Controlling Data Volume, Velocity and Variety. Tech. rep. META Group, 2001. url: http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf.
  29. [LLP+12]
    Wang Lam, Lu Liu, Sts Prasad, et al. “Muppet: MapReduce-style Processing of Fast Data”. In: VLDB 2012 (2012). issn: 2150-8097. url: https://doi.org/10.14778/2367502.2367520. http://dxdoiorg/10.14778/2367502.2367520.CrossRefGoogle Scholar
  30. [Mar12]
    Nathan Marz. “Preview of Storm: The Hadoop of Realtime Processing”. In: BackType Technology Blog (May 2012). Accessed: 2015-12-17. url: http://web.archive.org/web/20120509023348/ http://tech.backtype.com/preview-of-storm-the-hadoop-of-realtime-processing.
  31. [Mar14]
    Nathan Marz. “History of Apache Storm and lessons learned”. In: Thoughts from the Red Planet (Oct. 2014). Accessed: 2015-12-17. url: http://nathanmarz.com/blog/history-of-apache-storm-and-lessons-learned.html.
  32. [MW15]
    Nathan Marz and James Warren. Big Data: Principles and Best Practices of Scalable Realtime Data Systems. 1st. Greenwich, CT, USA: Manning Publications Co., 2015. ISBN: 1617290343, 9781617290343.Google Scholar
  33. [Neu+10]
    Leonardo Neumeyer et al. “S4: Distributed Stream Computing Platform”. In: Proceedings of the 2010 IEEE International Conference on Data Mining Workshops ICDMW ’10. Washington, DC, USA: IEEE Computer Society, 2010, pp. 170–177. ISBN: 978-0-7695-4257-7. url: https://doi.org/10.1109/ICDMW2010.172. https://doi.org/10.1109/ICDMW.2010.172.
  34. [Nog+17]
    Shadi A. Noghabi et al. “Samza: Stateful Scalable Stream Processing at LinkedIn”. In: Proc. VLDB Endow. 10.12 (Aug. 2017), pp. 1634–1645. issn: 2150-8097. URL: https://doi.org/10.14778/3137765.3137770. https://doi.org/10.14778/3137765.3137770.CrossRefGoogle Scholar
  35. [PM16]
    Pat Patterson and Ted Malaska. “Ingest & Stream Processing – What Will You Choose?” In: QCon (Aug. 2016). Accessed: 2018-05-25. url: https://www.infoq.com/presentations/ingest-stream-processing.
  36. [Ram15]
    Navina Ramesh. “Apache Samza, LinkedIn’s Framework for Stream Processing”. In: thenewstack.io (2015). Accessed: 2016-09-21. url: http://thenewstack.io/apache-samza-linkedins-framework-for-stream-processing/.
  37. [Ram16]
    Karthik Ramasamy. “Open Sourcing Twitter Heron”. In: Twitter Blog (May 2016). Accessed: 2017-01-15. url: https://blog.twitter.com/2016/open-sourcing-twitter-heron.
  38. [SAK07]
    Mark Slee, Aditya Agarwal, and Marc Kwiatkowski. Thrift: Scalable Cross-Language Services Implementation. Tech. rep. Accessed: 2018-08-19. Facebook Inc., Apr. 2007. url: http://thrift.apache.org/static/files/thrift-20070401.pdf.
  39. [Sax15]
    Matthias J. Sax. “Storm Compatibility in Apache Flink: How to run existing Storm topologies on Flink”. In: Apache Flink Blog (Dec. 2015).Google Scholar
  40. [Shv+10]
    Konstantin Shvachko et al. “The Hadoop Distributed File System”. In: Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) MSST ’10. Washington, DC, USA: IEEE Computer Society 2010, pp. 1–10. ISBN: 978-1-4244-7152-2. url: https://doi.org/10.1109/MSST.2010.5496972. http://dx.doi.org/10.1109/MSST.2010.5496972.
  41. [Sto]
    Guaranteeing Message Processing Accessed: 2018-08-19. Apache Software Foundation. 2018. url: http://storm.apache.org/releases/1.2.2/Guaranteeing-message-processing.html.
  42. [Ven+15]
    Bharat Venkat et al. “Can Spark Streaming survive Chaos Monkey?” In: Netflix Tech Blog (2015). Accessed: 2016-01-11. url: http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html.
  43. [Wal17]
    Timo Walther. “From Streams to Tables and Back Again: An Update on Flink’s Table & SQL API”. In: Flink Blog (Mar. 2017). Accessed: 2017-10-27. url: https://flink.apache.org/news/2017/03/29/table-sql-api-update.html.
  44. [Yan+12]
    Fan Yang et al. Sonora: A Platform for Continuous Mobile-Cloud Computing. Tech. rep. MSR-TR-2012-34. Microsoft Research, 2012. url: http://research.microsoft.com/apps/pubs/default.aspx?id=161446.
  45. [ZCD+12]
    Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, et al. “Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing”. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation NSDI’12. San Jose, CA: USENIX Association, 2012, pp. 2–2. url: http://dl.acm.org/citation.cfm?id=2228298.2228301.
  46. [ZDL+13]
    Matei Zaharia, Tathagata Das, Haoyuan Li, et al. “Discretized Streams: Fault-tolerant Streaming Computation at Scale”. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. SOSP ’13. Farminton, Pennsylvania: ACM, 2013, pp. 423–438. ISBN: 978-1-4503-2388-8. url: https://doi.org/10.1145/2517349.2522737. http://doi.acm.org/10.1145/2517349.2522737.
  47. [Ama18]
    Amazon Kinesis. Amazon Kinesis. Accessed: 2018-08-19. 2018. url: https://aws.amazon.com/kinesis/.
  48. [Apa15]
    Apache Software Foundation. “The Apache Software Foundation Announces ApacheⓇ FlinkTM as a Top-Level Project”. In: Apache Software Foundation Blog (Jan. 2015). Accessed: 2016-11-25. url: https://blogsapacheorg/foundation/entry/the_apache_software_foundation_announces69.
  49. [Apa16a]
    Apache Software Foundation. “Announcing Apache Flink 1.0.0”. In: Apache Flink Blog (Mar. 2016). Accessed: 2017-01-15. url: https://flink.apache.org/news/2016/03/08/release-1.0.0.html.
  50. [Apa16b]
    Apache Software Foundation. “Apache Flink: Powered By Flink”. In: Apache Flink website (2016). Accessed: 2016-10-17. url: https://flink.apache.org/poweredby.html.
  51. [Apa16c]
    Apache Software Foundation. Flink Accessed: 2016-09-18. 2016. url: https://flink.apache.org/.
  52. [Apa16d]
    Apache Software Foundation. Flume Accessed: 2016-10-17. 2016. url: https://flume.apache.org/.
  53. [Apa16e]
    Apache Software Foundation. “Level of Parallelism in Data Processing”. In: Spark Streaming – 2.0.0 Documentation (2016). Accessed: 2016-09-23. url: https://spark.apache.org/docs/2.0.0/streaming-programming-guide.html#level-of-parallelism-in-data-receiving.
  54. [Apa16f]
    Apache Software Foundation. “Powered By Spark”. In: Apache Spark Website (2016). Accessed: 2016-10-17. url: https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark.
  55. [Apa16g]
    Apache Software Foundation. “The Apache Software Foundation Announces ApacheⓇApexTM as a Top-Level Project”. In: Apache Software Foundation Blog (Apr. 2016). Accessed: 2016-11-25. url: https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces90.
  56. [Apa16h]
    Apache Software Foundation. YARN. Accessed: 2016-10-17. 2016. url: http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html.
  57. [Apa18b]
    Apache Software Foundation. Apex. Accessed: 2018-08-18. 2018. url: http://apex.apache.org/.
  58. [Apa18c]
    Apache Software Foundation. Beam. Accessed: 2018-05-10. 2018. url: https://beam.apache.org/.
  59. [Apa18e]
    Apache Software Foundation. GitHub: Apache Apex Core. Accessed: 2018-08-18. 2018. url: https://github.com/apache/apex-core.
  60. [Dat18]
    Databricks Inc. “Resilient Distributed Dataset (RDD)”. In: Databricks Glossary (2018). Accessed: 2018-07-22. url: https://databricks.com/glossary/what-is-rdd.
  61. [IBM14]
    IBM Corporation. Of Streams and Storms. Tech. rep. IBM Software Group, 2014.Google Scholar

Copyright information

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Wolfram Wingerath
    • 1
  • Norbert Ritter
    • 2
  • Felix Gessert
    • 1
  1. 1.Baqend GmbHHamburgGermany
  2. 2.University of HamburgHamburgGermany

Personalised recommendations