Apache Kafka is a distributed event streaming platform capable of handling trillions of events per day. Originally conceived as a messaging queue, Kafka has evolved into a full event streaming platform used for building real-time data pipelines and streaming applications. It combines messaging, storage, and stream processing to allow organizations to publish, subscribe to, store, and process streams of records in real time and at scale.

Kafka’s architecture is built around the concepts of topics (categories of records), partitions (ordered, immutable sequences of records within a topic), producers (clients that publish records), and consumers (clients that read records). Records are persisted to disk and replicated across configurable numbers of broker nodes for fault tolerance. Kafka’s append-only commit log design enables extremely high throughput, often exceeding millions of messages per second on modest hardware. The Kafka Connect framework provides pre-built connectors for integrating with databases, key-value stores, search indexes, and file systems, while Kafka Streams offers a lightweight client library for building stream processing applications.

Kafka has become the backbone of event-driven architectures at organizations of all sizes. Companies like LinkedIn (where it originated), Netflix, Uber, Twitter, and Goldman Sachs rely on Kafka for mission-critical data infrastructure. The Confluent Platform, created by Kafka’s original developers, extends open-source Kafka with enterprise features, a managed cloud service, and a schema registry.

Key Capabilities

  • High-Throughput Messaging - Handles millions of messages per second with low latency through sequential disk I/O and zero-copy transfers
  • Kafka Connect - Framework with hundreds of pre-built connectors for change data capture and integration with external systems
  • Kafka Streams - Lightweight stream processing library that requires no separate processing cluster
  • Durable Storage - Append-only commit log with configurable retention, enabling replay and time-travel queries on event history

Cloud Equivalents

Apache Kafka is the open-source foundation behind AWS MSK (Managed Streaming for Apache Kafka), Azure Event Hubs (Kafka-compatible), and Confluent Cloud. Managed services reduce operational burden of broker management and ZooKeeper/KRaft coordination, but self-hosted Kafka provides full control over configuration, retention policies, and multi-datacenter topologies.

Origins and History

Apache Kafka was created at LinkedIn by Jay Kreps, Neha Narkhede, and Jun Rao in 2011. It was open-sourced and donated to the Apache Software Foundation, becoming a top-level project in 2012. Kafka is licensed under the Apache License 2.0. Kreps, Narkhede, and Rao founded Confluent in 2014 to provide commercial support. A landmark architectural change came with KIP-500, which replaced the ZooKeeper dependency with an internal Raft-based consensus protocol called KRaft, production-ready as of Kafka 3.3 (2022).

Sources

  1. https://kafka.apache.org/
  2. Kreps, J., Narkhede, N., and Rao, J. “Kafka: a Distributed Messaging System for Log Processing.” NetDB Workshop, 2011.