Kafka is throwing out-of-order sequence number errors because a consumer is receiving messages with sequence numbers that don’t strictly increment, indicating that messages that should have been processed in order are arriving out of sequence.
Common Causes and Fixes:
-
Producer Retries with
enable.idempotence=true:- Diagnosis: Check producer logs for messages like "Idempotent producer retried" or "Producer sent message with sequence number X, but expected Y." On the consumer side, you’ll see
RecordTooLargeExceptionorOutOfOrderSequenceNumbererrors. - Cause: When a producer with idempotence enabled retries a message that was actually delivered to Kafka but the acknowledgment was lost, it resends the same message with the same producer ID and a new sequence number. Kafka accepts this new sequence number, but if the original message was processed by the consumer and then the retry message arrives, it appears out of order.
- Fix: Increase
delivery.timeout.mson the producer to a value significantly larger thanrequest.timeout.msandlinger.ms. For example, ifrequest.timeout.ms=30000andlinger.ms=0, setdelivery.timeout.ms=60000. This gives the producer more time to detect whether a message was successfully acknowledged before retrying, reducing the chance of duplicate "successful" deliveries. - Why it works: A longer
delivery.timeout.msallows the producer to wait longer for Kafka’s acknowledgment. If the acknowledgment arrives within this extended timeout, the producer won’t retry. If it doesn’t arrive, it means the message might not have been delivered, justifying a retry with a new sequence number. This is a trade-off between latency and reliability.
- Diagnosis: Check producer logs for messages like "Idempotent producer retried" or "Producer sent message with sequence number X, but expected Y." On the consumer side, you’ll see
-
Consumer Rebalancing and Message Re-delivery:
- Diagnosis: Look for consumer group rebalance events in consumer logs (e.g., "Rebalancing," "New partition assignment"). The out-of-order errors will often start immediately after a rebalance.
- Cause: During a consumer rebalance, partitions are reassigned to different consumers in the group. If a consumer crashes or is shut down uncleanly after fetching messages but before committing offsets, those messages will be re-delivered to the new consumer assigned to that partition. If the original consumer had processed some of these messages and committed their offsets, the new consumer might receive them again, potentially out of order relative to subsequent messages.
- Fix: Ensure consumers are shut down gracefully. Implement a mechanism to wait for
consumer.close()to complete, which includes flushing any in-flightcommitSync()calls. For critical applications, consider usingisolation.level=read_committedon the consumer and ensuring producers are configured withenable.idempotence=trueandmax.in.flight.requests.per.connection=1. - Why it works: Graceful shutdown ensures offsets are committed before partitions are revoked.
read_committedwith idempotence andmax.in.flight.requests.per.connection=1on the producer prevents duplicate writes from causing sequence number issues during retries and ensures consumers only see committed messages, even if rebalances occur.
-
Message Ordering Guarantees with Partitioning:
- Diagnosis: Verify how messages are being partitioned. If messages that must be ordered are being sent to different partitions, this error can occur if the consumer’s processing logic assumes order across partitions.
- Cause: Kafka guarantees message order only within a partition. If you have multiple consumers processing different partitions of the same topic, or if messages with the same key are not consistently sent to the same partition (e.g., due to producer configuration changes or bugs), order can be lost from the application’s perspective.
- Fix: Ensure that messages requiring strict ordering share the same Kafka message key. Configure the producer to use a partitioning strategy that consistently assigns messages with the same key to the same partition. For example,
producer.send(record, callback)whererecord.key()is consistently set. - Why it works: By using a consistent key, Kafka’s default
RoundRobinPartitioner(or a customPartitioner) will always send messages with that key to the same partition, preserving order for that logical stream of data.
-
Broker Configuration:
message.downconversion.enable:- Diagnosis: Check broker configuration. This is less common but can manifest as subtle ordering issues.
- Cause: If
message.downconversion.enable=true(the default), brokers can down-convert messages from newer formats to older ones if a broker is running an older version than the producer. This process can sometimes lead to minor inconsistencies or delays that might, in rare edge cases, contribute to perceived out-of-order delivery if not handled perfectly by the client. - Fix: Set
message.downconversion.enable=falseon all Kafka brokers. - Why it works: Disabling down-conversion ensures that brokers pass messages along in their original format, eliminating any potential for introduced inconsistencies during format conversion.
-
Network Latency and Jitter Between Broker and Consumer:
- Diagnosis: Monitor network performance between your Kafka brokers and your consumer instances. High latency or significant packet loss can cause fetch requests to be delayed or reordered.
- Cause: Consumers fetch messages in batches. If network conditions are poor, a fetch request for a later batch might arrive at the consumer before a fetch request for an earlier batch, even if the broker sent them in order. The consumer’s internal buffer might then present them out of order.
- Fix: Improve network connectivity. This could involve ensuring consumers are in the same availability zone/region as brokers, increasing network bandwidth, or troubleshooting any network devices in between. Consider increasing
fetch.max.wait.mson the consumer to allow brokers to batch more messages before responding, potentially smoothing out minor network hiccups. - Why it works: Better network stability reduces the likelihood of fetch requests being delayed or arriving out of their intended sequence. A higher
fetch.max.wait.msallows the broker to wait for more messages to accumulate, potentially creating larger, more resilient fetch responses that are less susceptible to minor network jitter.
-
Consumer
max.poll.recordsToo High:- Diagnosis: Examine the consumer’s
poll()loop. Ifmax.poll.recordsis set very high (e.g., 1000 or more) and the consumer takes a long time to process these records before callingpoll()again, it can lead to out-of-order issues, especially during rebalances. - Cause: A large batch of records fetched by
poll()might take a significant amount of time to process. If a rebalance occurs during this processing window, and the consumer hasn’t yet committed offsets for the records it’s currently working on, the rebalance might assign those same partitions to a new consumer. When the original consumer eventually finishes processing and commits, it might have already processed some messages that the new consumer is now also fetching, leading to perceived out-of-order delivery. - Fix: Reduce
max.poll.recordsto a smaller value, such as 100 or 500. Ensure that the time taken to process records fetched bypoll()is less than thesession.timeout.msconfigured for the consumer group. - Why it works: A smaller
max.poll.recordsmeans the consumer processes and commits offsets for smaller batches, making it more responsive to rebalances and reducing the window where it holds uncommitted data. This ensures that if a rebalance occurs, the new consumer is less likely to receive duplicate or out-of-order messages due to the previous consumer’s long processing time.
- Diagnosis: Examine the consumer’s
The next error you’ll likely encounter after fixing these sequence number issues is a CommitFailedException, often due to network issues or broker unavailability during offset commits.