Kafka topics can keep messages forever, but sometimes you just need the latest version of a key.
Let’s say you’re building a user profile service. You don’t need every single old username change or address update; you just need the current state of the user. Kafka’s compaction feature is designed for exactly this scenario. Instead of deleting old messages after a certain time (delete policy), compaction ensures that for each unique message key, at least the last known offset is retained. Older messages with the same key are then eligible for deletion, effectively shrinking the topic’s size while preserving the most recent state.
Here’s how you’d configure a topic for compaction. We’ll use the kafka-topics.sh command-line tool, assuming you have a running Kafka cluster.
First, let’s create a new topic named user-profiles and immediately configure it for compaction.
kafka-topics.sh --create --topic user-profiles --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1 --config cleanup.policy=compact --config segment.ms=1000 --config delete.retention.ms=1000
Let’s break down these settings:
cleanup.policy=compact: This is the core setting that enables compaction. Without this, Kafka uses the defaultdeletepolicy.segment.ms=1000: This setting defines the maximum time a segment file can exist before it’s rolled over. For compaction to work effectively and allow older messages to be identified as "stale," segments need to be created and eventually closed. A smaller value here (1000 milliseconds, or 1 second) means segments roll over frequently, giving Kafka more opportunities to scan and compact.delete.retention.ms=1000: This is a bit of a gotcha. Even with compaction, Kafka still uses a retention time. For a compacted topic, thisdelete.retention.msvalue determines how long deleted messages (those that have been superseded by a newer message with the same key) are kept before being physically removed from the logs. Setting it low here, along withsegment.ms, ensures that stale messages are eligible for deletion quickly after a new message for the same key arrives.
Now, let’s see this in action. We’ll produce some messages to our user-profiles topic.
# Start a console producer
kafka-console-producer.sh --topic user-profiles --bootstrap-server localhost:9092 --property "parse.key=true" --property "key.separator=:"
# Produce messages for user 123
123:{"name": "Alice", "email": "alice@example.com"}
123:{"name": "Alice Smith", "email": "alice.smith@example.com"}
456:{"name": "Bob", "email": "bob@example.com"}
123:{"name": "Alice S.", "email": "alice.s@example.com"}
If you were to consume these messages immediately, you’d see all of them. However, Kafka’s compaction process runs in the background. A cleaner thread periodically scans log segments. When it encounters messages with the same key, it marks all but the last message for that key as eligible for deletion. These marked messages are then physically removed from the log segments once those segments are no longer needed for the active log or meet the delete.retention.ms criteria.
To verify the compaction, you’d typically look at the topic’s log directories on the Kafka broker. You’d see segment files. Older segment files that have been fully compacted and whose messages have passed their retention period will be deleted. You can also inspect individual segment files, but this is usually done for debugging.
A more practical way to observe the effect is by checking the topic’s properties.
kafka-topics.sh --describe --topic user-profiles --bootstrap-server localhost:9092
This command will show you the topic’s configuration, including cleanup.policy=compact. You won’t directly see the "compacted" state here, but rather the configuration that enables it.
You can also set the min.cleanable.dirty.ratio broker configuration. This setting determines how much of a log segment must be "dirty" (i.e., contain deletable messages) before Kafka’s cleaner thread will consider cleaning it. A lower ratio means the cleaner is more aggressive. The default is 0.5 (50%).
To modify this on a running broker (requires restarting the broker or re-assigning partitions if dynamic updates are enabled for this specific setting):
- Edit
server.propertieson your Kafka broker. - Add or modify the line:
min.cleanable.dirty.ratio=0.2 - Restart the Kafka broker.
This tells the cleaner to start compacting a segment if at least 20% of its messages are eligible for deletion.
The other crucial setting for compaction is max.compaction.lag.ms. If you set this, Kafka will stop trying to compact messages older than this duration. This is useful if you have extremely long-lived keys whose history you do want to preserve for a while, even if newer versions exist. For example, max.compaction.lag.ms=86400000 (24 hours) would mean messages older than a day, even if they have newer versions, won’t be compacted.
Let’s re-apply our topic settings with max.compaction.lag.ms included to demonstrate:
kafka-topics.sh --alter --topic user-profiles --bootstrap-server localhost:9092 --config max.compaction.lag.ms=60000
This alters the existing topic to ensure that even if a key has a very recent message, older messages for that key will be kept for at least 60 seconds. This is an important nuance: compaction tries to keep the last message, but max.compaction.lag.ms acts as a safeguard against losing history too quickly if the compaction process is slow or if you explicitly want to retain older versions for a specific duration.
The critical insight most people miss is that compaction isn’t a real-time process. It’s a background batch job that operates on log segments. This means that immediately after producing a new message for a key, the old messages for that key are still physically present in the log. They are just marked internally as deletable. Only when the cleaner thread runs, processes the relevant segment, and the segment itself eventually rolls over and is eligible for physical deletion, are those old messages truly gone. This lag can be significant, especially if you have a low volume of writes or a high segment.ms setting.
After you’ve successfully configured compaction and verified your topic is shrinking as expected, the next thing you’ll likely encounter is understanding how to tune the cleaner thread’s performance and concurrency.