The Kafka broker failed to find a log segment file that a follower replica requested, indicating a data inconsistency between the leader and follower.
Common Causes and Fixes
1. Log Segment File Physically Missing on Follower Broker
- Diagnosis: On the follower broker, check the Kafka logs (
server.log) forERRORmessages likeFound unexpected end of fileorLog segment ... not found. Then, navigate to the topic’s log directory on the follower and verify the specific segment file is absent. The path will be something like/path/to/kafka/logs/<topic_name>/<partition_id>/<segment_file_name>. The segment file name will be a hexadecimal number representing the base offset. - Fix:
- Stop the follower broker.
- Identify the leader broker for the affected partition.
- Copy the missing segment file(s) from the leader broker’s log directory to the follower broker’s corresponding directory. Ensure file permissions are identical.
- Start the follower broker.
- Why it works: This directly replaces the missing data, allowing the follower to catch up by reading the replicated log.
2. Corrupted or Incomplete Log Segment File on Follower Broker
- Diagnosis: Similar to #1, look for
ERRORmessages inserver.logpointing to specific segment files. You might seeCorrupt fileerrors orjava.io.IOException: Premature EOF. On disk, the file size might be unexpectedly small or zero. - Fix:
- Stop the follower broker.
- Delete the corrupted segment file(s) from the follower’s log directory.
- Restart the follower broker. Kafka will request these segments from the leader as part of its recovery process.
- Why it works: By removing the bad data, the follower is forced to re-replicate it from the leader, ensuring data integrity.
3. Leader Broker Experiencing Disk I/O Issues or Network Latency
- Diagnosis: Check the leader broker’s
server.logfor I/O errors, disk full warnings, or network connection issues (Connection refused,SocketTimeoutException). Monitor disk I/O metrics (e.g.,iostat) and network latency. - Fix:
- Address underlying disk issues: Free up disk space, replace failing drives, or optimize I/O.
- Address network issues: Improve network connectivity between brokers, reduce packet loss, or increase network bandwidth.
- Restart the affected broker(s) if the issues are severe and cannot be immediately resolved.
- Why it works: Ensures the leader can reliably serve read requests and write data to its own disks, allowing it to fulfill follower requests.
4. Incorrect log.segment.bytes or log.segment.ms Configuration
- Diagnosis: If segment files are being rolled over too aggressively or not aggressively enough, it can lead to situations where segments are expected but not yet created or have been improperly cleaned up. Check
server.propertiesforlog.segment.bytes(default 1GB) andlog.segment.ms(default 7 days). - Fix:
- Adjust
log.segment.bytesandlog.segment.msto values appropriate for your retention policies and disk usage patterns. For example, if you have very high throughput, you might need smaller segments to avoid very large files. If you have low throughput, larger segments might be more efficient. - Restart all brokers after changing these configurations.
- Adjust
- Why it works: Proper segment sizing and rollover timing prevent premature cleanup or unexpected gaps in the log.
5. Incorrect log.dirs Configuration on a Broker
- Diagnosis: A typo or incorrect path in
server.propertiesforlog.dirscan cause a broker to look for log segments in the wrong location, leading to "not found" errors even if the data exists elsewhere on disk. - Fix:
- Verify
log.dirsinserver.propertiesfor all brokers. - Correct any incorrect paths.
- Restart the affected broker(s).
- Verify
- Why it works: Ensures each broker is pointing to the correct physical location where its partition data is stored.
6. ZooKeeper Session Expiration or Network Partition
- Diagnosis: While not a direct log segment error, ZooKeeper issues can lead to brokers losing track of partition leadership and replica states. Check
server.logforZooKeeper session expiredorConnectionLossExceptionerrors. Verify ZooKeeper cluster health. - Fix:
- Ensure ZooKeeper is healthy and accessible from all Kafka brokers.
- Restart Kafka brokers if they were disconnected from ZooKeeper.
- If ZooKeeper is overloaded, scale it up or optimize its configuration.
- Why it works: A stable ZooKeeper connection is crucial for Kafka to maintain accurate metadata about partitions and replicas, preventing inconsistencies.
7. Log Cleaner Thread Issues or Misconfiguration
- Diagnosis: If log cleaning is enabled (
log.cleanup.policyset tocompactordelete) and the cleaner thread is malfunctioning or misconfigured, it might delete segments prematurely or incorrectly. Checkserver.logfor errors related to the log cleaner. - Fix:
- Review
log.cleanup.policy,log.cleaner.threads,log.cleaner.min.compaction.lag.ms, andlog.cleaner.max.compaction.lag.msinserver.properties. - If
deletepolicy is used and retention is too short, increaselog.retention.hoursorlog.retention.bytes. - If
compactpolicy is used, ensurelog.cleaner.threadsis sufficient andmin.cleanable.dirty.ratiois appropriately set. - Restart brokers after configuration changes.
- Review
- Why it works: Ensures the log cleaner operates correctly according to retention and compaction policies, preventing accidental deletion of needed segments.
The next error you’ll likely encounter is a LEADER_NOT_AVAILABLE error for the affected partition, as the broker might not be able to establish leadership due to ongoing replication issues.