Kafka’s log compaction is a mechanism that, at first glance, seems like a simple TTL for your messages, but its true power lies in its ability to prune older versions of messages identified by the same key.

Let’s see it in action. Imagine we have a Kafka topic named user-profiles with three partitions, and we’re sending updates for user alice.

# Produce messages with keys
kafka-console-producer.sh --broker-list localhost:9092 --topic user-profiles --property parse.key=true --property key.separator=,
> alice,{"name": "Alice", "email": "alice@example.com", "status": "active"}
> bob,{"name": "Bob", "email": "bob@example.com", "status": "active"}
> alice,{"name": "Alice Smith", "email": "alice.smith@example.com", "status": "active"}
> alice,{"name": "Alice Smith", "email": "alice.smith@example.com", "status": "inactive"}
> bob,{"name": "Bob", "email": "bob@example.com", "status": "inactive"}
> alice,{"name": "Alice Smith", "email": "alice.smith@example.com", "status": "active"}

Now, if we were to consume from this topic without compaction, we’d see all these messages in order. But with compaction enabled, Kafka’s behavior is much more nuanced. It doesn’t just delete old messages; it retains the latest unique key and discards all preceding messages for that same key.

Here’s the key configuration for enabling compaction on a topic:

# Enable log compaction for the 'user-profiles' topic
kafka-configs.sh --bootstrap-server localhost:9092 --entity-type topics --entity-name user-profiles --alter --add-config cleanup.policy=compact

And to ensure Kafka actually does the compaction and doesn’t just mark it for later, you’ll likely want to set delete.retention.ms to a very low value (or even 0, though delete.retention.ms=60000 is a common practical minimum for testing). This dictates how long "deleted" messages (tombstoned messages, which we’ll cover) stick around before being permanently removed.

# Set delete retention to 1 minute (for testing compaction of tombstones)
kafka-configs.sh --bootstrap-server localhost:9092 --entity-type topics --entity-name user-profiles --alter --add-config delete.retention.ms=60000

You’ll also want to set segment.ms and segment.bytes to values that encourage segment rolling, as compaction happens on a segment-by-segment basis. For example:

# Set segment properties to encourage segment rolling for testing
kafka-configs.sh --bootstrap-server localhost:9092 --entity-type topics --entity-name user-profiles --alter --add-config segment.ms=10000
kafka-configs.sh --bootstrap-server localhost:9092 --entity-type topics --entity-name user-profiles --alter --add-config segment.bytes=1048576 # 1MB

The core idea behind compaction is to maintain an "always-available" view of the latest state for each key. When a message with a key arrives, Kafka checks if it already has a message for that key. If it does, the older message is marked for deletion (it becomes a "tombstone"). Later, during a compaction process, Kafka writes a new, smaller log segment containing only the latest message for each key and any messages that have unique keys.

Consider the alice messages from our example. After compaction, a consumer reading from the beginning of the topic (or after a specific offset) would effectively only see the last message produced for alice:

alice,{"name": "Alice Smith", "email": "alice.smith@example.com", "status": "active"}

The intermediate updates are gone. This is incredibly useful for scenarios like user profiles, configuration settings, or any data where only the most recent value matters.

What about deleting data? This is where "tombstones" come in. A tombstone is simply a message with a null payload. When compaction encounters a message with a null payload, it treats it as a deletion instruction for that key.

Let’s produce a tombstone for bob:

# Produce a tombstone message for bob
kafka-console-producer.sh --broker-list localhost:9092 --topic user-profiles --property parse.key=true --property key.separator=,
> bob,

After compaction runs, the message for bob will be removed entirely. The delete.retention.ms configuration determines how long this tombstone message itself persists before being permanently purged from the log. This allows consumers to see the deletion event and clean up their own state, but then the data is gone.

The compaction process itself is triggered by the log.cleaner.enable broker setting (which should be true) and runs in the background. The cleaner thread periodically checks topic partitions for segments that can be compacted. It reads through a segment, identifies the latest message for each key, and writes a new, compact segment containing only those latest messages. Older segments that have been fully compacted are then eligible for deletion.

The cleanup.policy setting can be set to delete (the default, which uses time-based retention), compact, or compact,delete (which applies both policies). When both are applied, messages are retained based on the maximum of the time-based retention and the compaction policy.

The min.cleanable.dirty.ratio broker configuration is also important. It specifies the minimum ratio of "dirty" bytes (uncompacted data) to total bytes in a log segment before the cleaner thread will consider compacting it. This prevents the cleaner from constantly running on segments that are already mostly compacted. A typical value might be 0.5 (50%).

When compaction runs, it’s not just about keeping the latest value. It’s about creating a new, smaller log segment that contains only the unique keys that should be retained. Any message not present in this new "cleaned" segment is considered deleted. This process is independent of the consumer’s offset. Compaction happens on the broker, regardless of what consumers have read.

The log.cleaner.threads broker configuration controls how many threads are dedicated to running the cleaning process across all compacted topics. Increasing this can help if you have many compacted topics and compaction isn’t keeping up.

The "surprise" about Kafka log compaction is that it’s not just a simple "delete old messages" feature; it’s fundamentally about maintaining the latest state for a given key. It effectively turns Kafka into a distributed, immutable key-value store where you can query the current state of any key by reading the latest message. This is a powerful abstraction that many users miss, thinking it’s just for cleaning up old logs.

The next hurdle you’ll likely encounter is understanding how to effectively manage the delete.retention.ms for tombstone messages, especially when dealing with consumers that might lag behind.

Want structured learning?

Take the full Kafka course →