Learn about Message Streams and Kafka

A message stream is conceptually similar to a message queue but has some critical differences, particularly in how data is consumed and managed. Let's explore this through an example scenario.


Scenario: Building a Blog Website

Imagine we are building a blog platform like Hashnode or Medium. Every time a blog is published, several actions need to be performed:

  1. Index the blog in a search engine.

  2. Increment the total post count in the database.

Now, let’s analyze various approaches to implement this workflow.


Approach 1: Using a Message Queue

In this approach:

  1. The user publishes a blog via the API.

  2. The API server saves the blog in the main database.

  3. The API server sends a message to a RabbitMQ message queue.

  4. Multiple consumers pull messages from the queue. For instance:

    • One consumer indexes the blog in Elasticsearch.

    • Another consumer increments the total post count in the main database.

Issue:

What happens if one consumer succeeds while the other fails? For example, if the total post count is incremented in the database but the blog is not indexed in the search engine, the data becomes inconsistent.


Approach 2: Using Two Brokers and Two Sets of Consumers

To address the issue, one might think of having two separate message queues and sets of consumers:

  1. The API server writes messages to both queues.

  2. Consumers read messages from their respective queues and perform tasks.

Issue:

This reduces the chance of consumer failure but introduces a new problem:

  • What if the API server succeeds in writing to one queue but fails to write to the other? Again, inconsistency arises.

Approach 3: Using Message Streams

Here’s where message streaming platforms like Apache Kafka or AWS Kinesis shine. With a message stream:

  1. The API server writes data to a single stream.

  2. Multiple consumers read data from the stream, each tailored to perform specific tasks (e.g., indexing or updating the database).

Advantages:

  1. Write Once, Read Many: Writing to one stream ensures a single source of truth. Any number of consumers can read from it.

  2. Data Persistence: Unlike traditional message queues, streams retain data for a specified duration, allowing consumers to reprocess data if needed.

  3. Consumer Independence: Consumers are specialized, meaning each performs a specific task, reducing redundancy and improving consistency.

In a traditional message queue, consumers actively pull data and process it. In contrast, in a message stream, data persists in the stream, and consumers iterate over it. Retention policies determine when data is removed.


Apache Kafka Essentials

Apache Kafka is a robust message streaming platform used to build real-time data pipelines and streaming applications. Its components are as follows:

a. Producers

Producers are applications that send (or "produce") data to Kafka.

Example: A web app logs user activities and sends these logs to Kafka.

b. Topics

  • Topics are categories or channels where data is stored.

  • Producers send data to specific topics.

Example: A "UserLogs" topic stores user activity logs.

c. Partitions

  • Topics are divided into smaller chunks called partitions.

  • Partitions allow parallelism, enabling Kafka to handle large volumes of data efficiently.

  • Each partition has a unique ID (e.g., Partition 0, Partition 1).

Example: The "UserLogs" topic has 3 partitions, distributing logs among them.

d. Consumers

Consumers are applications that read (or "consume") data from Kafka topics.

Example: A data analytics app consumes logs from the "UserLogs" topic to generate insights.

e. Brokers

  • Brokers are Kafka servers that store and manage data.

  • A Kafka cluster consists of multiple brokers working together.

Example: A cluster with 3 brokers (Broker 1, Broker 2, Broker 3) distributes topic partitions among the brokers.

f. Consumer Groups

  • Consumers can collaborate in a group to read data from a topic.

  • Each partition is assigned to one consumer in the group, ensuring parallel processing.

Example: If a topic has 3 partitions and there are 3 consumers in a group, each consumer processes one partition.


How Kafka Works

  1. Producers Send Data: Producers write data to a specific topic. Kafka appends this data to a partition within the topic.

  2. Kafka Stores Data: Kafka brokers store the data in partitions. For fault tolerance, data is replicated across brokers. For example, if a partition's replication factor is 3, its data is stored on 3 brokers.

  3. Consumers Read Data: Consumers fetch data from topics. Kafka maintains an offset for each consumer, tracking where they last read.

Limitation:

A single consumer can only read from one partition at a time.


Conclusion

Message streams like Kafka offer a more consistent and scalable solution compared to traditional message queues. By enabling “write once, read many” capabilities, Kafka ensures data consistency, fault tolerance, and parallel processing—key requirements for modern, real-time applications like our blog platform.