What is Apache Kafka ?
Apache Kafka is a distributed streaming platform that uses the publisher/subscriber messaging model. Where Producer publishes a stream of data and the consumer consumes/subscribe that stream of data. It largely decouples the system dependencies.
Consider this example to understand how Kafka decouples the system dependencies. Suppose you had m source systems(Publishers) and n Target Systems(Subscribers). Then the total no of integration you need to write is m * n. Each of these integrations could come up with a new complexity or a problem that would unique to the integration. Also, Now in this mess of integration your system is tightly coupled and also becomes complicated. Each time now you would have to add a source system that would only pile on the existing problem.
This is where Apache Kafka comes in. The source system now wouldn’t directly publish the message to the target system but rather in Kafka and similarly the source systems would consume from Kafka. Now this also clears up that Kafka holds the data stream which is known as ‘Topics‘. Which we can use in any way we want.
History of Apache Kafka
Apache Kafka is an open-source distributed stream processing system that was developed by LinkedIn. It was then donated to Apache Software Foundation.
What are the Use Cases of Apache Kafka?
Some of the major applications of kafka are:
- Messaging System.
- Logs gathering.
- Activity/ Event tracking.
- Stream Processing.
- Decoupling Systems.
- Integration with Big Data Processing systems.
What are Topics in Kafka ?
As mentioned before Topics contain the stream of data. Now Topics in Kafka are similar to a table in a database. We can also have multiple topics as per our needs. A Topic is identified by its name. A publisher would publish the data to a Topic that will be read by a consumer.
Partitions in Topics.
Now a Topic is split into multiple partitions. These partitions now are separated in a particular order. If there are m partitions in a topic the order starts from 0 and goes up to m-1. You need to specify the no of partitions while creating a topic. Now each message that will be stored in these partitions is stored incremental id. This is also known as offset.
Let’s say We have 3 Partitions in a Topic. Now when you get a message on this topic. Say it is written to partition 0. So, it will have an offset 0 when the next message comes on to partition 0 it will have offset 1 and so on the value of the offset will increase by one. Similarly when the first message comes to any other partition say 2 it will have value 0 the value at offset 0 at partition 1 or partition 0 does not have any relation with this current value.
All these partitions are independent of each other. This means that the order is only guaranteed inside a partition and not across the partition. This means we can say order k inside partition m was written before order k-i inside the same partition m but we can not say that the order k-1 inside the partition n (m != n) was written before or after the order k inside partition m.
Now there is no limit in the value of the offset. All we can understand is that it is Infinite. The data written to a partition can not be changed but the data is only available for a limited time. This Data is assigned a particular partition only if you assign it a key otherwise a round-robin assignment is done. All the messages that come from a particular key would be assigned the same partition. Using which we can track data tied to a particular event chronologically.
What are Brokers in Kafka?
A Broker holds the partitions. A broker is a Kafka server. Using this knowledge we know what a Kafka cluster is, After all a cluster is a group of connected servers. So, A Kafka cluster is composed of multiple brokers. We identity each broker by its ID.
Each broker contains only a certain number of partitions but not the entire Topic(it is something done by Kafka when we create a topic with n partitions . It distributes those partitions to brokers.). When you connect to any of the brokers. It will connect you to the entire cluster.
Topic Replication
We do have replication in Topics as Kafka is a distributed system. So, We do need to have a mechanism such that when a system goes down. It doesn’t break the entire system. While creating topics we decide what should be the replication factor say currently we have 3 brokers and consider a topic with 1 partition and 2 replication factor. Now when this topic is created let’s say its partition 0 is assigned to broker 0 but because we had a replication factor of 2. The Topic is replicated to broker 1 and 2 as well.
In case any of our brokers goes down we have the replicated partition. That will continue to function without breaking our system. Now there is a concept of a leader in such a case. In case our leader is working it will be the one that will serve data for the partition. While the broker in which partition was replicated will only synchronize and maintain data.
What are Zookeeper in Kafka?
Zookeeper keeps the list of brokers and manages them. It is also responsible for leader detection and synchronization of partitions. So, it detects a change of state in the system i.e., when a broker or topic dies, it recovers, etc. As with broker Zookeeper also has a concept of a leader.
How to write data to Kafka ?
As mentioned before it is the role of producers to write data to topics. Now we do not need to explicitly specify to the producer that on which broker should it write. All of this is handled automatically.
We earlier mentioned that if you do not define Key (The key assigned to a partition). Producers write to a broker in a round-robin manner. It also contributes to load balancing. In case a Key is sent all the messages for that key will always go to the same partition. This will allow us to track data tied to a particular event chronologically.
When the Producer writes data it has an option to receive an acknowledgment for confirmation that data was written. It has two different modes to receive an acknowledgment.
- When the Producer writes Data to a particular partition in a broker. The leader broker for that partition will send an acknowledgment.
- In the second mode, not only the leader but also all replicas will have to send an acknowledgment.
Besides this the Producer can also opt not to receive the acknowledgment.
How is Data Consumed from Kafka ?
It is the role of consumers to read data from topics. Consumers automatically know which broker should they read data from. The consumers read data in the order of the offset(of course this order is within a partition). The order in which Consumers read from multiple partitions is not guaranteed. They can read from partition 1 first and then from partition 0. This order is not guaranteed.
Consumer reads data in a group called Consumer group. Now each consumer will read data from exclusive partitions which means let’s say you have 3 partitions and a consumer group with 2 consumers. The first consumer reads data from the first partition while the second consumer may read data from the other two partitions. It would never read from the first partition. The first partition here is exclusive to the first consumer in the group.
In case you have more consumers than partitions then In this case few consumers will be inactive. Now Kafka stores the offset that the consumer is reading in a topic named __consumer_offsets. So, that if a consumer fails and it comes back up again it can read the data from where it last left.