Soumyarian98/kafka.md

## kafka.md

      
    Raw
  

              kafka.md
            
          
    Kafka Architecture Overview


Kafka has high throughput but the storage time is very less. However, the storage time (retention period) can be configured based on the use case, and data can be stored for days, weeks, or indefinitely if there is enough storage.
Kafka has multiple producers and multiple consumers.
Inside Kafka, we have "topics".
Each topic is divided into partitions.
The user can set multiple rules to create different partitions.
Let's say we have a location topic, and the user can have two partitions based on the North Pole and South Pole.
Each consumer can subscribe to a partition.
Partitions work based on index, so if we have 3 partitions, the first one will be 0, the second one will be 1, and the third one will be 2.
Let's say we have one producer for a topic. Inside the topic, there are 4 partitions.

One Consumer: All the partition data will be transferred to the one and only consumer.
Two Consumers: Each consumer will receive data from 2 partitions.
Three Consumers: One consumer will receive data from 2 partitions, and the other two consumers will receive data from one partition each.
Four Consumers: Each consumer will receive data from one partition.
Five Consumers: Only 4 consumers will receive data from one partition each, and the 5th consumer will not receive any data.


The rule is: one consumer can track multiple partitions, but each partition can be tracked by only one consumer within a group.
Now, we can have consumer groups.
Any consumer won't exist without being in a group.
Let's say we have two groups, and each group has 2 consumers. If we have 4 partitions, each consumer from each group will receive data from 2 partitions. The rules mentioned in point 9 are unique for each group.
By having consumer groups, Kafka can act as a Queue or Pub/Sub system.

If we have 4 partitions and 1 consumer group with 4 consumers inside, Kafka acts like a Queue (each consumer gets a unique partition).
In the same scenario, if we create multiple consumer groups, it acts like a Pub/Sub (each group gets its own copy of the data).


Additional Points


Each partition has a replication factor for fault tolerance. If a broker fails, a replica becomes the new leader to keep the data available.
When consumers join or leave a group, Kafka rebalances the partitions among the active consumers. This ensures load is balanced but can cause temporary disruption.
Consumers track their position in partitions using offsets. These offsets can be stored by Kafka, allowing consumers to pick up from where they left off in case of failure.
Data inside a single partition is always ordered. Across partitions, data order is not guaranteed.
Consumer lag is the difference between the latest offset and the current consumer's offset. It's important to monitor lag to ensure real-time processing.
Producers can send messages with keys. The key determines which partition the message goes to, so all messages with the same key go to the same partition, which helps maintain order for specific data (e.g., all data for one user or location).
Kafka uses different partition assignment strategies like range and round-robin to distribute partitions among consumers in a group.
Adding more partitions can help scale the topic across more consumers, but more partitions add complexity and overhead. The number of partitions should match the expected number of consumers for optimal performance.
No results found