KAFKA for Mere Mortals : Topics and Partitions

Introduction

Well, we already learned a lot about KAFKA essentials and it is time to dive into the details of “Kafka topics and partitions”.

PS: If you want to learn more about the basics of KAFKA, use the below links with the given order:

  1. Introduction to KAFKA
  2. KAFKA and ETL
  3. Installing KAFKA and Zookeeper
  4. Cluster and Brokers

As we already know from the previous articles, KAFKA adds the following attributes to our software architecture.

  1. Communication complexity isolation
  2. Avoiding Communication complexity duplication
  3. Horizontal scaling
  4. High performance
  5. Fault Tolerance
  6. Event Streaming
  7. Data and log aggregation
  8. Data transformation and processing etc.

So, how exactly does Kafka archive from the technical point of view, the decoupling of the Source and the Target?

Of course, using Kafka Core API’s  Producer -> Kafka cluster -> Consumer combination.

Kafka Distribution

The responsibility of the KAFKA Producer is to produce data. On the other hand, Kafka Cluster acts as an isolator and “storage” for the Producer and Consumer.

It is some time of Target for Producer, but Source for Consumer. We have described the “KAFKA Cluster and Broker concept” in more detail in our previous article.

I’m also planning to write in detail article about Producer and Consumer. But, this article’s main focus is to explain what is Topic and Partition and how to interact with them.

Before producing data, KAFKA brokers need some time storage boxes to store the data. These storage boxes are called topics.

Topic is a stream of data with acts as a logical isolator over partitions.

The topic is important from the user's point of view because when reading/writing the data, we’re referring mostly to the topic rather than partitions. (Of course, when defining partitions in the producing/consuming process, it is mandatory to point topic name, but in general, it is possible to produce/consume data without directly indicating partitions)

Kafka Broker

The topic concept helps us – mere mortals, to interact with KAFKA without worrying about the internal storage mechanism.

Every topic should have a unique name because the identification process of topics is done through their names. You can create as many topics as you want/your business requires.

Topics are not a thing that can live in one broker in production systems. Instead, using partitions, topics spread out to brokers. It means, using partitions a topic lives in multiple brokers. It helps KAFKA to make a fault-tolerant, scalable, and distributed system.

Topics are durable, meaning that the data in them is persisted on disk. This makes Kafka a good choice for applications that need to reliably store and process data streams.

But how about Partitions?

Under the hood, KAFKA uses Partitions to store data. Every topic in production consists of multiple partitions. Kafka uses the Topic concept for mainly 2 purposes.

  1. to group partitions under one box for storing “one business point” data.
  2. To help users interact with KAFKA without worrying about internal structure.

Let's explain these 2 concepts compared with namespaces/classes in .NET/C#.

Namespaces act as an isolator for grouped classes. You can have at least one class under a namespace or depending on the design process, it can contain multiple classes. We’re not interacting with namespaces but interacting with classes. No way to call a namespace, instantiate it, etc. It is just a box for classes.

So, topic logic is approximately the same as namespaces. On the other hand, partitions are similar to classes. Under the hood, KAFKA uses partitions to physically store the final data.

Kafka uses partitions to achieve parallelism and scalability. This means that multiple producers and consumers can work on the same topic at the same time, and the data is evenly distributed across the brokers in the cluster

So, why do we need the concept of partitions if we have topics?

Well, using partitions KAFKA can reach distributive data storing and ISR(in sync replica) concept. Partitions help us to distribute the topic and archive fault-tolerant systems.

Every partition is identified by its ID. Every topic can have as many partitions as you want/your business requires. In production, it is very important to define partition count when creating a topic, otherwise, the system will use the default configuration for partition count. It means, that without defining the partition count, the system automatically will create some amount of partitions per topic every time. The partition count should vary depending on business requirements. One topic can have 40 partitions, meanwhile the other may require 200 partitions, etc.

You can think about partitions as a collection with a stack algorithm. Every partition is an array and their indexes are called offsets.

A partition has dynamic offset count and there is no fixed size for it. It is dynamically extendable and sizes of partitions can vary in the same topic.

Every unit of information in a partition is called a message. Consumers can read data in a stacked manner.

Partition splitting across KAFKA brokers

Kafka partitions are split into Kafka brokers using a round-robin algorithm. This means that each broker in the cluster is assigned an equal number of partitions, as much as possible.

But the process of splitting partitions across Kafka brokers also depends on the following factors:

  1. Number of Partitions: When you create a Kafka topic, you should specify the number of partitions it can have. This number determines how many parallel consumers or producers can work with the topic. The number of partitions should be chosen based on the expected workload and the level of parallelism required.
  2. Broker Assignment: The assignment is typically done in a balanced manner to ensure an even distribution of partitions across brokers, but it can be influenced by partition assignment strategies.
  3. Partition Assignment Strategies: Kafka provides different strategies for partition assignment, mainly controlled by the consumer group coordinator.  We'll have a detailed article about it.
  4. Replication Factor: Kafka provides fault tolerance through data replication. Each partition has a specified replication factor, which determines how many copies of the data are maintained.

In short, here is why we need partitions in KAFKA.

  1. It is a core unit of parallelism and distribution
  2. It helps KAFKA to horizontally scale and distribute data
  3. Enables high throughput and fault tolerance
  4. Acts as an internal storage mechanism

It makes sense to note that once a topic is created with a certain number of partitions, it's not a proper way to change the number of partitions for that topic. Instead, you need to create a new topic with the required number of partitions and migrate data if needed.

Conclusion

It is important to understand that, the key benefits of using Kafka topics and partitions.

  • Parallelism
  • Scalability
  • Fault tolerance
  • Durability

In our next article, we will talk about the real practice of creating/manipulating topics and partitions.


Similar Articles