The Kafka broker’s storage is full because the logs segment files are not being cleaned up as expected, causing the broker to reject new produce requests.
Common Causes and Fixes:
-
Incorrect
log.retention.hoursorlog.retention.bytesConfiguration:- Diagnosis: Check the broker’s
server.propertiesfile forlog.retention.hoursandlog.retention.bytes. If these are set to very high values or are commented out, Kafka will retain logs indefinitely until manual cleanup or disk exhaustion. - Fix: Set
log.retention.hoursto a reasonable value (e.g.,168for 7 days) orlog.retention.bytesto a size limit (e.g.,1073741824for 1GB per partition). This tells Kafka to automatically delete old log segments. - Why it works: These properties define the maximum age or size of log segments before they are eligible for deletion by the log cleaner.
- Diagnosis: Check the broker’s
-
delete.retention.msToo High:- Diagnosis: Examine
server.propertiesfordelete.retention.ms. If this value is extremely large, Kafka will keep "delete tombstone" records for an extended period, preventing the actual deletion of message data even if the retention policy suggests it should be gone. - Fix: Set
delete.retention.msto a value that aligns with your retention policy, typically a few hours or days (e.g.,86400000for 1 day). - Why it works: This setting dictates how long Kafka retains delete markers for messages that have been logically deleted (e.g., via
0timestamp). Until these markers expire, the associated data cannot be physically removed from log segments.
- Diagnosis: Examine
-
Log Cleaner Not Running or Configured Incorrectly:
- Diagnosis: Check broker logs for messages indicating the log cleaner is struggling or disabled. Verify
log.cleaner.enable=trueinserver.properties. Also, checklog.cleaner.threads(e.g.,1or2) andlog.cleaner.min.cleanable.ratio(e.g.,0.5). Ifmin.cleanable.ratiois too high, segments might not be considered "dirty" enough to clean. - Fix: Ensure
log.cleaner.enableistrue. Adjustlog.cleaner.threadsto a small number like1or2if CPU is a concern, or increase it if cleaning is lagging. Lowerlog.cleaner.min.cleanable.ratioto0.3or0.4if segments aren’t being cleaned. - Why it works: The log cleaner is a separate process that reclaims disk space by compacting or deleting old log segments. If it’s off or misconfigured, retention policies won’t be enforced.
- Diagnosis: Check broker logs for messages indicating the log cleaner is struggling or disabled. Verify
-
Unclean Leader Election (
unclean.leader.election.enable=true) with Data Loss:- Diagnosis: While not directly a disk space cause, if
unclean.leader.election.enable=trueand a partition leader fails and a replica without the latest data is elected leader, new data might be produced to a partition that has lost messages. This can lead to a situation where the disk appears to be filling up with "new" data that is actually a subset of what was previously there, but in a fragmented state. Check broker logs for unclean leader election events. - Fix: Set
unclean.leader.election.enable=falseinserver.propertiesto prevent this. After fixing disk space, you may need to re-sync partitions or re-assign them to ensure data integrity. - Why it works: Disabling unclean leader election ensures that only replicas with the complete, up-to-date log can become leaders, preventing data loss and the associated disk space anomalies.
- Diagnosis: While not directly a disk space cause, if
-
Topics with
cleanup.policy=compactand Nosegment.msorsegment.bytes:- Diagnosis: For compacted topics, if
cleanup.policy=compactis set without definingdelete.retention.ms(which is the primary retention for compacted topics) orsegment.ms/segment.byteson the topic level (which dictates when segments are eligible for cleaning), Kafka will retain all messages indefinitely until they are explicitly deleted via a tombstone record. Check topic configurations usingkafka-topics.sh --describe --topic <topic_name> --bootstrap-server <broker_list>. - Fix: For compacted topics, ensure
delete.retention.msis set at the broker level (or topic level) to a reasonable value. If you want to limit the size of individual segments that the cleaner operates on, setsegment.bytesandsegment.msat the topic level (e.g.,segment.bytes=1073741824for 1GB). - Why it works: In compaction mode, Kafka retains the latest value for each key. It only removes older versions of a key once a tombstone record is encountered and
delete.retention.mshas passed. Without a defineddelete.retention.ms, these tombstones (and thus the data they mark for deletion) are kept forever.
- Diagnosis: For compacted topics, if
-
Under-provisioned Disk or Insufficient
num.partitionsfor High Throughput:- Diagnosis: Monitor disk usage over time. If disk usage consistently grows faster than your retention policies can prune it, your disk is simply too small for the volume of data being produced, or a topic has an excessive number of partitions leading to many small log files that are slower to clean. Use
df -hon the broker host and Kafka monitoring tools. - Fix: Increase the disk size on the broker or add more brokers to the cluster. For specific topics, consider if
num.partitionsis excessively high and if it can be reduced without impacting parallelism requirements. - Why it works: More disk space provides a larger buffer. Reducing partitions reduces the total number of log files and management overhead, allowing retention and cleaning to keep pace.
- Diagnosis: Monitor disk usage over time. If disk usage consistently grows faster than your retention policies can prune it, your disk is simply too small for the volume of data being produced, or a topic has an excessive number of partitions leading to many small log files that are slower to clean. Use
-
Manual Log Deletion or Incorrect
log.dirs:- Diagnosis: Check if any manual
rm -rfcommands were executed on the Kafka log directories (log.dirsinserver.properties). This can corrupt Kafka’s internal state and lead to it not recognizing space as free. Also, verify thatlog.dirspoints to the correct, intended directories. - Fix: If manual deletion occurred, you’ll likely need to restart the broker and potentially re-create topics or even the cluster if state is irrecoverably corrupted. Ensure
log.dirsis correctly configured. - Why it works: Kafka manages its segment files meticulously. Manual deletion bypasses these mechanisms, leaving the broker in an inconsistent state where it believes space is occupied but cannot manage it.
- Diagnosis: Check if any manual
After resolving the disk space issue, you will likely encounter LEADER_NOT_AVAILABLE errors if any partitions were on the affected broker and its leadership was lost during the outage, or if the disk full condition corrupted partition leadership metadata.