Encyclopedia of Big Data Technologies

Living Edition
| Editors: Sherif Sakr, Albert Zomaya

Apache Samza

Living reference work entry

Latest version View entry history

DOI: https://doi.org/10.1007/978-3-319-63962-8_197-2


Apache Samza is an open source framework for distributed processing of high-volume event streams. Its primary design goal is to support high throughput for a wide range of processing patterns, while providing operational robustness at the massive scale required by Internet companies. Samza achieves this goal through a small number of carefully designed abstractions: partitioned logs for messaging, fault-tolerant local state, and cluster-based task scheduling.


Stream processing is playing an increasingly important part of the data management needs of many organizations. Event streams can represent many kinds of data, for example, the activity of users on a website, the movement of goods or vehicles, or the writes of records to a database.

Stream processing jobs are long-running processes that continuously consume one or more event streams, invoking some application logic on every event, producing derived output streams, and potentially writing output to databases for...

This is a preview of subscription content, log in to check access.


  1. Calisi L (2016) How to convert legacy Hadoop Map/Reduce ETL systems to Samza streaming. https://www.youtube.com/watch?v=KQ5OnL2hMBY
  2. Carbone P, Katsifodimos A, Ewen S, Markl V, Haridi S, Tzoumas K (2015) Apache flink: stream and batch processing in a single engine. IEEE Data Eng Bull 38(4):28–38. http://sites.computer.org/debull/A15dec/p28.pdf
  3. Chen S (2016) Scalable complex event processing on Samza @Uber. https://www.slideshare.net/ShuyiChen2/scalable-complex-event-processing-on-samza-uber
  4. Das S, Botev C, Surlaker K, Ghosh B, Varadarajan B, Nagaraj S, Zhang D, Gao L, Westerman J, Ganti P, Shkolnik B, Topiwala S, Pachev A, Somasundaram N, Subramaniam S (2012) All aboard the Databus! LinkedIn’s scalable consistent change data capture platform. In: 3rd ACM symposium on cloud computing (SoCC). https://doi.org/10.1145/2391229.2391247
  5. Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: 6th USENIX symposium on operating system design and implementation (OSDI)Google Scholar
  6. Feng T (2015) Benchmarking apache Samza: 1.2 million messages per second on a single node. http://engineering.linkedin.com/performance/benchmarking-apache-samza-12-million-messages-second-single-node
  7. Goodhope K, Koshy J, Kreps J, Narkhede N, Park R, Rao J, Ye VY (2012) Building LinkedIn’s real-time activity data pipeline. IEEE Data Eng Bull 35(2):33–45. http://sites.computer.org/debull/A12june/A12JUN-CD.pdf
  8. Hermann J, Balso MD (2017) Meet michelangelo: uber’s machine learning platform. https://eng.uber.com/michelangelo/
  9. Hindman B, Konwinski A, Zaharia M, Ghodsi A, Joseph AD, Katz R, Shenker S, Stoica I (2011) Mesos: a platform for fine-grained resource sharing in the data center. In: 8th USENIX symposium on networked systems design and implementation (NSDI)Google Scholar
  10. Junqueira FP, Reed BC, Serafini M (2011) Zab: high-performance broadcast for primary-backup systems. In: 41st IEEE/IFIP international conference on dependable systems and networks (DSN), pp 245–256.  https://doi.org/10.1109/DSN.2011.5958223
  11. Kleppmann M (2017) Designing data-intensive applications. O’Reilly Media. ISBN:978-1-4493-7332-0Google Scholar
  12. Kleppmann M, Kreps J (2015) Kafka, Samza and the Unix philosophy of distributed data. IEEE Data Eng Bull 38(4):4–14. http://sites.computer.org/debull/A15dec/p4.pdf Google Scholar
  13. Kreps J (2014) Why local state is a fundamental primitive in stream processing. https://www.oreilly.com/ideas/why-local-state-is-a-fundamental-primitive-in-stream-processing
  14. Kreps J, Narkhede N, Rao J (2011) Kafka: a distributed messaging system for log processing. In: 6th international workshop on networking meets databases (NetDB)Google Scholar
  15. Kulkarni S, Bhagat N, Fu M, Kedigehalli V, Kellogg C, Mittal S, Patel JM, Ramasamy K, Taneja S (2015) Twitter heron: stream processing at scale. In: ACM international conference on management of data (SIGMOD), pp 239–250. https://doi.org/10.1145/2723372.2723374
  16. Netflix Technology Blog (2016) Kafka inside Keystone pipeline. http://techblog.netflix.com/2016/04/kafka-inside-keystone-pipeline.html
  17. Noghabi SA, Paramasivam K, Pan Y, Ramesh N, Bringhurst J, Gupta I, Campbell RH (2017) Samza: stateful scalable stream processing at LinkedIn. Proc VLDB Endow 10(12):1634–1645. https://doi.org/10.14778/3137765.3137770 CrossRefGoogle Scholar
  18. Paramasivam K (2016) Stream processing with Apache Samza – current and future. https://engineering.linkedin.com/blog/2016/01/whats-new-samza
  19. Pathirage M, Hyde J, Pan Y, Plale B (2016) SamzaSQL: scalable fast data management with streaming SQL. In: IEEE international workshop on high-performance big data computing (HPBDC), pp 1627–1636.  https://doi.org/10.1109/IPDPSW.2016.141
  20. Qiao L, Auradar A, Beaver C, Brandt G, Gandhi M, Gopalakrishna K, Ip W, Jgadish S, Lu S, Pachev A, Ramesh A, Surlaker K, Sebastian A, Shanbhag R, Subramaniam S, Sun Y, Topiwala S, Tran C, Westerman J, Zhang D, Das S, Quiggle T, Schulman B, Ghosh B, Curtis A, Seeliger O, Zhang Z (2013) On brewing fresh Espresso: LinkedIn’s distributed data serving platform. In: ACM international conference on management of data (SIGMOD), pp 1135–1146. https://doi.org/10.1145/2463676.2465298
  21. Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar M, Evans R, Graves T, Lowe J, Shah H, Seth S, Saha B, Curino C, O’Malley O, Radia S, Reed B, Baldeschwieler E (2013) Apache Hadoop YARN: yet another resource negotiator. In: 4th ACM symposium on cloud computing (SoCC). https://doi.org/10.1145/2523616.2523633
  22. Wang G, Koshy J, Subramanian S, Paramasivam K, Zadeh M, Narkhede N, Rao J, Kreps J, Stein J (2015) Building a replicated logging system with Apache Kafka. Proc VLDB Endow 8(12):1654–1655. https://doi.org/10.14778/2824032.2824063 CrossRefGoogle Scholar

Authors and Affiliations

  1. 1.University of CambridgeCambridgeUK

Section editors and affiliations

  • Alessandro Margara
    • 1
  • Tilmann Rabl
    • 2
  1. 1.Politecnico di Milano
  2. 2.Database Systems and Information Management GroupTechnische Universität BerlinBerlinGermany