Encyclopedia of Big Data Technologies

Living Edition
| Editors: Sherif Sakr, Albert Zomaya

Apache Kafka

Living reference work entry
DOI: https://doi.org/10.1007/978-3-319-63962-8_196-1


Apache Kafka (Apache Software Foundation 2017b; Kreps et al. 2011; Goodhope et al. 2012; Wang et al. 2015; Kleppmann and Kreps 2015) is a scalable, fault-tolerant, and highly available distributed streaming platform that can be used to store and process data streams.

Kafka consists of three main components:
  • the Kafka cluster,

  • the Connect framework (Connect API),

  • and the Streams programming library (Streams API).

The Kafka cluster stores data streams, which are sequences of messages/events continuously produced by applications and sequentially and incrementally consumed by other applications. The Connect API is used to ingest data into Kafka and export data streams to external systems like distributed file systems, databases, and others. For data stream processing, the Streams API allows developers to specify sophisticated stream processing pipelines that read input streams from the Kafka cluster and write results back to Kafka.

Kafka supports many different use cases...

This is a preview of subscription content, log in to check access.


  1. Apache Software Foundation (2017a) Apache Hadoop project web page. https://hadoop.apache.org/
  2. Apache Software Foundation (2017b) Apache Kafka project web page. https://kafka.apache.org/
  3. Apache Software Foundation (2017c) Apache Samza project web page. https://samza.apache.org/
  4. Apache Software Foundation (2017d) Apache ZooKeeper project web page. https://zookeeper.apache.org/
  5. Facebook Inc (2017) RocksDB project web page. http://rocksdb.org/
  6. Goodhope K, Koshy J, Kreps J, Narkhede N, Park R, Rao J, Ye VY (2012) Building Linkedin’s real-time activity data pipeline. IEEE Data Eng Bull 35(2):33–45. http://sites.computer.org/debull/A12june/pipeline.pdf
  7. Hunt P, Konar M, Junqueira FP, Reed B (2010) ZooKeeper: wait-free coordination for internet-scale systems. In: Proceedings of the 2010 USENIX conference on USENIX annual technical conference, USENIX ATC’10. USENIX Association, Berkeley, p 11. http://dl.acm.org/citation.cfm?id=1855840.1855851
  8. Kleppmann M (2016) Making sense of stream processing, 1st edn. O’Reilly Media Inc., 183 pagesGoogle Scholar
  9. Kleppmann M (2017) Designing data-intensive applications. O’Reilly Media Inc., SebastopolGoogle Scholar
  10. Kleppmann M, Kreps J (2015) Kafka, Samza and the Unix philosophy of distributed data. IEEE Data Eng Bull 38(4):4–14. http://sites.computer.org/debull/A15dec/p4.pdf
  11. Kreps J, Narkhede N, Rao J (2011) Kafka: a distributed messaging system for log processing. In: Proceedings of the NetDB, pp 1–7Google Scholar
  12. Noghabi SA, Paramasivam K, Pan Y, Ramesh N, Bringhurst J, Gupta I, Campbell RH (2017) Samza: stateful scalable stream processing at LinkedIn. Proc VLDB Endow 10(12):1634–1645. https://doi.org/10.14778/3137765.3137770 CrossRefGoogle Scholar
  13. Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar M, Evans R, Graves T, Lowe J, Shah H, Seth S, Saha B, Curino C, O’Malley O, Radia S, Reed B, Baldeschwieler E (2013) Apache Hadoop YARN: yet another resource negotiator. In: 4th ACM symposium on cloud computing (SoCC). https://doi.org/10.1145/2523616.2523633
  14. Wang G, Koshy J, Subramanian S, Paramasivam K, Zadeh M, Narkhede N, Rao J, Kreps J, Stein J (2015) Building a replicated logging system with Apache Kafka. PVLDB 8(12):1654–1655. http://www.vldb.org/pvldb/vol8/p1654-wang.pdf Google Scholar

Authors and Affiliations

  1. 1.Confluent Inc.Palo AltoUSA

Section editors and affiliations

  • Alessandro Margara
    • 1
  • Tilmann Rabl
    • 2
  1. 1.Politecnico di Milano
  2. 2.Database Systems and Information Management GroupTechnische Universität BerlinBerlinGermany