Apache Kafka and Event Streaming


Introduction

Apache Kafka is an open-source distributed event streaming platform.

Traditional message brokers are based off of the JMS / AMQP standard. These message brokers focus on a pub/sub model where publishers write messages to a queue and the queue is consumed by subscribers. When a message is consumed by a subscriber, the subscriber acknowledges the consumption back to the broker, thereby removing the message from the queue. In this sense, traditional message brokers are not like databases since messages are not durably stored. In addition, it has no ordering guarantees for the messages in the queue.

Kafka deviates from these traditional message brokers - it is a log-based message broker. Log-based message brokers will, as the name implies, append entries to a log on disk, meaning they are durable. This is what allows Kafka (and other log-based message brokers) to replay events that might have already have been consumed by another client.

In addition, Kafka scales these logs across multiple machines by having partitions of logs, where one machine is mapped to one log partition. This makes Kafka a highly scalable and distributed message broker with fault tolerance and replay capabilities.

Characteristics

  • Distributed
    • You can add more machines to it to distribute the load among all of the machines
  • Durable
    • Kafka is specifically a log-based message broker. This means that logs are stored on disk. This also means events can be replayed to consumers.
  • Event Streaming
    • You can stream data in a pub/sub fashion, or have distributed queueing
  • Fault Tolerant
    • If one machine dies, the other machines will cover for it until the machine has been replaced. Kafka has a replication_factor for this.
    • Kafka can recover data quickly because it persists all of its message on disk for a certain amount of time
  • Decoupling
    • Services that need data from Kafka's event stream can request it whenever they want, i.e. even after replacing a failed service host

Basic Concepts

There are a few key terminologies and concepts that need to be explained:

Data

  • Topics
    • Events go into topics which are akin to channels. For example, if you want to send events for purchase order IDs, you might have a topic for purchases. If you need to also send events for transaction logs, you might send them to a topic for logs.
    • You can specify replication factors and the partition size. A higher replication factor adds redundancy by allowing more hosts to hold the same copy of data in the event that a particular host dies.
  • Partitions
    • When events go into topics, they actually go into a partition inside the topic. You can have \([1...n]\) partitions inside a topic. Each partition is like an ordered queue.
    • One machine handles publishing to one partition.

The benefit of having only one partition per topic is to maintain the order of events. This is crucial if you have events that need to be transmitted in the correct order.

The benefit of having more than one partition per topic is to distribute the load, allowing for higher performance.

Entities

These entities are core entities that are vital to the Kafka ecosystem. Each of these (broker, producer, consumer) are usually represented as clusters, meaning you can have more than one machine in each cluster.

  • Broker
    • The broker holds topics (or chunks of it - partitions) in a distributed fashion. It also acts as the manager of the producers and consumers and orchestrates their behaviors for sending/receiving to a topic.
    • In Kafka, you can swap broker implementations to your liking - if you want to use RabbitMQ, then go for it
  • Producer
    • The producer pushes events to the topics and their partitions
  • Consumer
    • The consumer pulls events from the topics
    • The consumer needs to also commit the pulled events (similar to an ACK) so that old events do not get replayed over and over
    • Partitions in a topic can be load-balanced within a group of consumers (consumer group)

Use Cases

There are numerous real-world use cases for Kafka, such as:

  • Logging Service
    • Emit logging events on every transaction. The consumer can store these events into a database for housekeeping or analytical purposes.
  • Notification Service (Mobile)
    • If you want to notify users of important events, such as an outage, you can emit a notification event into a topic. Consumers of the topic, such as a Notification Service, can then send push notifications to users via mobile phones using APNS or Android push notifications. Or maybe you might want to email users instead.
  • Chat Service
    • If you create your topics and set the partition sizes to 1, you can have messages appear in the correct chronological order. You can leverage this to have consumers listen to the topics and provide chat messages to the front-end website or app. The nice part about Kafka is that you can replay messages from the topics that have been sent before, in case the consumer host goes down and you lost some of that data in the disk.