The Kafka broker’s storage is full because the logs segment files are not being cleaned up as expected, causing the broker to reject new produce requests.

Common Causes and Fixes:

  1. Incorrect log.retention.hours or log.retention.bytes Configuration:

    • Diagnosis: Check the broker’s server.properties file for log.retention.hours and log.retention.bytes. If these are set to very high values or are commented out, Kafka will retain logs indefinitely until manual cleanup or disk exhaustion.
    • Fix: Set log.retention.hours to a reasonable value (e.g., 168 for 7 days) or log.retention.bytes to a size limit (e.g., 1073741824 for 1GB per partition). This tells Kafka to automatically delete old log segments.
    • Why it works: These properties define the maximum age or size of log segments before they are eligible for deletion by the log cleaner.
  2. delete.retention.ms Too High:

    • Diagnosis: Examine server.properties for delete.retention.ms. If this value is extremely large, Kafka will keep "delete tombstone" records for an extended period, preventing the actual deletion of message data even if the retention policy suggests it should be gone.
    • Fix: Set delete.retention.ms to a value that aligns with your retention policy, typically a few hours or days (e.g., 86400000 for 1 day).
    • Why it works: This setting dictates how long Kafka retains delete markers for messages that have been logically deleted (e.g., via 0 timestamp). Until these markers expire, the associated data cannot be physically removed from log segments.
  3. Log Cleaner Not Running or Configured Incorrectly:

    • Diagnosis: Check broker logs for messages indicating the log cleaner is struggling or disabled. Verify log.cleaner.enable=true in server.properties. Also, check log.cleaner.threads (e.g., 1 or 2) and log.cleaner.min.cleanable.ratio (e.g., 0.5). If min.cleanable.ratio is too high, segments might not be considered "dirty" enough to clean.
    • Fix: Ensure log.cleaner.enable is true. Adjust log.cleaner.threads to a small number like 1 or 2 if CPU is a concern, or increase it if cleaning is lagging. Lower log.cleaner.min.cleanable.ratio to 0.3 or 0.4 if segments aren’t being cleaned.
    • Why it works: The log cleaner is a separate process that reclaims disk space by compacting or deleting old log segments. If it’s off or misconfigured, retention policies won’t be enforced.
  4. Unclean Leader Election (unclean.leader.election.enable=true) with Data Loss:

    • Diagnosis: While not directly a disk space cause, if unclean.leader.election.enable=true and a partition leader fails and a replica without the latest data is elected leader, new data might be produced to a partition that has lost messages. This can lead to a situation where the disk appears to be filling up with "new" data that is actually a subset of what was previously there, but in a fragmented state. Check broker logs for unclean leader election events.
    • Fix: Set unclean.leader.election.enable=false in server.properties to prevent this. After fixing disk space, you may need to re-sync partitions or re-assign them to ensure data integrity.
    • Why it works: Disabling unclean leader election ensures that only replicas with the complete, up-to-date log can become leaders, preventing data loss and the associated disk space anomalies.
  5. Topics with cleanup.policy=compact and No segment.ms or segment.bytes:

    • Diagnosis: For compacted topics, if cleanup.policy=compact is set without defining delete.retention.ms (which is the primary retention for compacted topics) or segment.ms/segment.bytes on the topic level (which dictates when segments are eligible for cleaning), Kafka will retain all messages indefinitely until they are explicitly deleted via a tombstone record. Check topic configurations using kafka-topics.sh --describe --topic <topic_name> --bootstrap-server <broker_list>.
    • Fix: For compacted topics, ensure delete.retention.ms is set at the broker level (or topic level) to a reasonable value. If you want to limit the size of individual segments that the cleaner operates on, set segment.bytes and segment.ms at the topic level (e.g., segment.bytes=1073741824 for 1GB).
    • Why it works: In compaction mode, Kafka retains the latest value for each key. It only removes older versions of a key once a tombstone record is encountered and delete.retention.ms has passed. Without a defined delete.retention.ms, these tombstones (and thus the data they mark for deletion) are kept forever.
  6. Under-provisioned Disk or Insufficient num.partitions for High Throughput:

    • Diagnosis: Monitor disk usage over time. If disk usage consistently grows faster than your retention policies can prune it, your disk is simply too small for the volume of data being produced, or a topic has an excessive number of partitions leading to many small log files that are slower to clean. Use df -h on the broker host and Kafka monitoring tools.
    • Fix: Increase the disk size on the broker or add more brokers to the cluster. For specific topics, consider if num.partitions is excessively high and if it can be reduced without impacting parallelism requirements.
    • Why it works: More disk space provides a larger buffer. Reducing partitions reduces the total number of log files and management overhead, allowing retention and cleaning to keep pace.
  7. Manual Log Deletion or Incorrect log.dirs:

    • Diagnosis: Check if any manual rm -rf commands were executed on the Kafka log directories (log.dirs in server.properties). This can corrupt Kafka’s internal state and lead to it not recognizing space as free. Also, verify that log.dirs points to the correct, intended directories.
    • Fix: If manual deletion occurred, you’ll likely need to restart the broker and potentially re-create topics or even the cluster if state is irrecoverably corrupted. Ensure log.dirs is correctly configured.
    • Why it works: Kafka manages its segment files meticulously. Manual deletion bypasses these mechanisms, leaving the broker in an inconsistent state where it believes space is occupied but cannot manage it.

After resolving the disk space issue, you will likely encounter LEADER_NOT_AVAILABLE errors if any partitions were on the affected broker and its leadership was lost during the outage, or if the disk full condition corrupted partition leadership metadata.

Want structured learning?

Take the full Kafka course →