Apache Kafka Basics And Architecture

Kafka is a distributed streaming platform. Distributed streaming is nothing but having a lot of applications(web or mobile) running and writing logs or embedding data. We need to store all the data in an efficient manner.

Kafka uses publish-subscribe messaging system that lets you send messages between processes, apps, and servers. It stores streams of records in fault -tolerant durable way and process streams of records as they occur. Kafka has built-in support for replication, partitioning, fault tolerance, and better throughput.

Below are the reasons why we should use Kafka. 

  • Scalability
    Kafka supports high performance sequential writes and separates topics into partitions to facilitate highly scalable reads and writes. Kafka enable multiple producers and consumers to read and write at the same time.
     
  • Low Latency
    Kafka can provide high throughput with low latency and high availability.
     
  • High Throughput
    Kafka is capable of handling massive volumes of incoming messages at high velocity per second. 10k messages per second.
     
  • High Performance
    Kafka can deliver messages at high speeds and high volumes.
     
  • Durability
    Kafka messages are highly durable because Kafka stores the messages on the disk, and not in memory.
     
  • Highly Reliable
    Kafka can replicate data and handle many subscribers. In Apache Kafka, the messages are durable even after they have been consumed. This enables the Kafka Producer and Kafka Consumer to be available at different times and increases resilience and fault tolerance

Kafka differs from traditional messaging queues in several ways. Kafka retains message even after consumed while a Messaging Queue deletes once it is consumed. Kafka fetches message using pulling while Messaging Queue pushes the message to consumers. Kafka can be scaled horizontally while a messaging queue is scaled vertically.

Apache Kafka Architecture

Kafka Architecture contains below components. 

Topic

Topic defines a channel for the transmission of data. When producer publishes data to the topics, the consumer reads messages from them. A unique name identifies a topic pertaining to Cluster.

Producer

This serves as datasource for one or more kafka topics and is responsible for writing, optimising and publishing these messages to topics. Producer connects to cluster through zookeeper.

Consumer

Consumer consumes the data by reading messages on the topics they have subscribed to. Each Consumer belongs to a consumer group and all consumers belonging to same consumer group share a common task.

ZooKeeper

This manages and coordinates the kafka brokers in a cluster. Also notifies producers and consumers in a cluster of the existence of new brokers or failure of brokers.

Broker

Broker acts as a middleman between producers and consumers, hosting topics and enables sending & receiving messages between them. Producers and consumers don't communicate directly. Instead, they communicate using these brokers. If producers and consumers goes down, the communication pipeline continues to function.

Cluster

Cluster comprises of one or more servers of kafka brokers. Kafka cluster contains many brokers and each of them has there own partition. Since these are stateless, Zookeeper is used to maintain the state.

When Apache Kafka is not a right alternative

Kafka is not a good choice if you need messages to be processed in a particular order. To process messages in a specific order, you should have one consumer and one partition but the Kafka has multiple consumers and multiple partitions, Kafka is not the right choice if you need to process just a few messages per day.

Setting up Apache Kafka

1.  Make sure you have JRE Installed on your machine and now download the Binary version of Kafka from the below link

2. Copy the downloaded to C:/Kafka and Execute below commands

tar -xzf kafka_2.13-3.3.1.tgz
cd kafka_2.13-3.3.1

3. Now update the log.dirs value in server.properties in the path C:\kafka\kafka_2.12-3.3.1\config to some local folder. 

# A comma separated list of directories under which to store log files
log.dirs=D:/kafka-logs

 4. Also update zookeper.properties in the same directory. 

dataDir=D:\kafka\zookeeper-data

 5. Now open command prompt and navigate to kafka directory and execute below command to start zoo keeper. 

C:\kafka\kafka_2.12-3.3.1>.\bin\windows\zookeeper-server-start.bat  config\zookeeper.properties

 6. Open another command prompt and invoke the kafka server.bat file to start kafka broker/Cluster.

C:\kafka\kafka_2.12-3.3.1> .\bin\windows\kafka-server-start.bat config\server.properties

7. Open another command prompt to start kafka producer. The topic used here is named quickstart-events and consumer has to use the same topic to consume the events produced with this producer.

C:\kafka\kafka_2.12-3.3.1\bin\windows> .\kafka-console-producer.bat --topic quickstart-events --bootstrap-server localhost:9092

 8. Now on another command prompt we need to invoke the kafka consumer to consume events that are produced from producer. 

C:\kafka\kafka_2.12-3.3.1\bin\windows> .\kafka-console-consumer.bat --topic quickstart-events --from-beginning --bootstrap-server localhost:9092

 9. Now switch back to cmd prompt of Kafka Producer and type the messages. 

10. Now switch back to cmd prompt of Kafka Consumer to validate the received messages. You can also see other messages that were sent by producer earlier. This is how we could see that Kafka uses disk storage to save the messages.

We can also use confluent Cloud Kafka instead of setting up it on your local. We can use this setup in the next article where we implement Kafka in an asp.net core application.