Sharding vs. Partitioning: The Devil's in the Data Distribution

Kafka’s sharding, more formally known as partitioning, is how it achieves horizontal scalability, but the real magic isn’t just splitting data; it’s how that split data can be processed independently and in parallel by consumers.

Let’s watch it in action. Imagine a Kafka topic named user_events with a simple producer sending messages like this:

from kafka import KafkaProducer
import json

producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

producer.send('user_events', key=b'user_123', value={'event': 'login', 'timestamp': 1678886400})
producer.send('user_events', key=b'user_456', value={'event': 'click', 'timestamp': 1678886405})
producer.send('user_events', key=b'user_123', value={'event': 'logout', 'timestamp': 1678886410})

Now, if user_events is partitioned, Kafka uses the key of the message (user_123, user_456) to determine which partition it lands in. All messages with the same key will always go to the same partition. This is the bedrock of ordered processing for a given key.

Here’s how you’d configure a topic with 4 partitions:

kafka-topics.sh --bootstrap-server localhost:9092 --create --topic user_events --partitions 4 --replication-factor 1

When a consumer group (my_consumer_group) reads from this topic, Kafka assigns partitions to consumers within that group. If you have one consumer, it gets all 4 partitions. If you have four consumers, each gets one partition. If you have five consumers, one consumer will be idle, and the other four will each get one partition.

from kafka import KafkaConsumer
import json

consumer = KafkaConsumer(
    'user_events',
    bootstrap_servers='localhost:9092',
    auto_offset_reset='earliest',
    enable_auto_commit=True,
    group_id='my_consumer_group',
    value_deserializer=lambda x: json.loads(x.decode('utf-8'))
)

for message in consumer:
    print(f"Partition: {message.partition}, Key: {message.key.decode()}, Value: {message.value}")

The problem this solves is straightforward: as your data volume grows, a single Kafka broker and a single consumer instance become bottlenecks. Partitioning allows you to:

Distribute Storage: Data for a topic is spread across multiple brokers if you have multiple brokers and replication.
Enable Parallel Consumption: Multiple consumers in the same group can read from different partitions concurrently, processing data much faster.

The mental model is that a topic is a collection of ordered logs, and each log (partition) is an independent unit. Kafka guarantees ordering within a partition, but not across partitions. The key is your mechanism for ensuring related messages (like all events for a specific user) land on the same partition, allowing for stateful processing or ordered analysis for that key.

The number of partitions is a critical decision. You can only increase the number of partitions for a topic, never decrease it. If you have 100 partitions and only 10 consumers, you’re leaving a lot of processing power on the table. Conversely, if you have 100 partitions and 1000 consumers, you’re creating overhead with too many small assignments and potential network chatter for partition leader elections. The number of partitions dictates the maximum parallelism for a consumer group.

What most people don’t realize is that Kafka’s partitioning strategy is fundamentally tied to its consumer group rebalancing mechanism. When a consumer joins or leaves a group, Kafka reassigns partitions to maintain an even distribution. This rebalancing process can briefly pause consumption for affected partitions, so a very high number of partitions can lead to more frequent and longer rebalancing events, impacting overall throughput.

The next concept you’ll grapple with is how to choose the right partitioning key to ensure even data distribution and predictable message ordering.