Kafka is throwing out-of-order sequence number errors because a consumer is receiving messages with sequence numbers that don’t strictly increment, indicating that messages that should have been processed in order are arriving out of sequence.

Common Causes and Fixes:

  1. Producer Retries with enable.idempotence=true:

    • Diagnosis: Check producer logs for messages like "Idempotent producer retried" or "Producer sent message with sequence number X, but expected Y." On the consumer side, you’ll see RecordTooLargeException or OutOfOrderSequenceNumber errors.
    • Cause: When a producer with idempotence enabled retries a message that was actually delivered to Kafka but the acknowledgment was lost, it resends the same message with the same producer ID and a new sequence number. Kafka accepts this new sequence number, but if the original message was processed by the consumer and then the retry message arrives, it appears out of order.
    • Fix: Increase delivery.timeout.ms on the producer to a value significantly larger than request.timeout.ms and linger.ms. For example, if request.timeout.ms=30000 and linger.ms=0, set delivery.timeout.ms=60000. This gives the producer more time to detect whether a message was successfully acknowledged before retrying, reducing the chance of duplicate "successful" deliveries.
    • Why it works: A longer delivery.timeout.ms allows the producer to wait longer for Kafka’s acknowledgment. If the acknowledgment arrives within this extended timeout, the producer won’t retry. If it doesn’t arrive, it means the message might not have been delivered, justifying a retry with a new sequence number. This is a trade-off between latency and reliability.
  2. Consumer Rebalancing and Message Re-delivery:

    • Diagnosis: Look for consumer group rebalance events in consumer logs (e.g., "Rebalancing," "New partition assignment"). The out-of-order errors will often start immediately after a rebalance.
    • Cause: During a consumer rebalance, partitions are reassigned to different consumers in the group. If a consumer crashes or is shut down uncleanly after fetching messages but before committing offsets, those messages will be re-delivered to the new consumer assigned to that partition. If the original consumer had processed some of these messages and committed their offsets, the new consumer might receive them again, potentially out of order relative to subsequent messages.
    • Fix: Ensure consumers are shut down gracefully. Implement a mechanism to wait for consumer.close() to complete, which includes flushing any in-flight commitSync() calls. For critical applications, consider using isolation.level=read_committed on the consumer and ensuring producers are configured with enable.idempotence=true and max.in.flight.requests.per.connection=1.
    • Why it works: Graceful shutdown ensures offsets are committed before partitions are revoked. read_committed with idempotence and max.in.flight.requests.per.connection=1 on the producer prevents duplicate writes from causing sequence number issues during retries and ensures consumers only see committed messages, even if rebalances occur.
  3. Message Ordering Guarantees with Partitioning:

    • Diagnosis: Verify how messages are being partitioned. If messages that must be ordered are being sent to different partitions, this error can occur if the consumer’s processing logic assumes order across partitions.
    • Cause: Kafka guarantees message order only within a partition. If you have multiple consumers processing different partitions of the same topic, or if messages with the same key are not consistently sent to the same partition (e.g., due to producer configuration changes or bugs), order can be lost from the application’s perspective.
    • Fix: Ensure that messages requiring strict ordering share the same Kafka message key. Configure the producer to use a partitioning strategy that consistently assigns messages with the same key to the same partition. For example, producer.send(record, callback) where record.key() is consistently set.
    • Why it works: By using a consistent key, Kafka’s default RoundRobinPartitioner (or a custom Partitioner) will always send messages with that key to the same partition, preserving order for that logical stream of data.
  4. Broker Configuration: message.downconversion.enable:

    • Diagnosis: Check broker configuration. This is less common but can manifest as subtle ordering issues.
    • Cause: If message.downconversion.enable=true (the default), brokers can down-convert messages from newer formats to older ones if a broker is running an older version than the producer. This process can sometimes lead to minor inconsistencies or delays that might, in rare edge cases, contribute to perceived out-of-order delivery if not handled perfectly by the client.
    • Fix: Set message.downconversion.enable=false on all Kafka brokers.
    • Why it works: Disabling down-conversion ensures that brokers pass messages along in their original format, eliminating any potential for introduced inconsistencies during format conversion.
  5. Network Latency and Jitter Between Broker and Consumer:

    • Diagnosis: Monitor network performance between your Kafka brokers and your consumer instances. High latency or significant packet loss can cause fetch requests to be delayed or reordered.
    • Cause: Consumers fetch messages in batches. If network conditions are poor, a fetch request for a later batch might arrive at the consumer before a fetch request for an earlier batch, even if the broker sent them in order. The consumer’s internal buffer might then present them out of order.
    • Fix: Improve network connectivity. This could involve ensuring consumers are in the same availability zone/region as brokers, increasing network bandwidth, or troubleshooting any network devices in between. Consider increasing fetch.max.wait.ms on the consumer to allow brokers to batch more messages before responding, potentially smoothing out minor network hiccups.
    • Why it works: Better network stability reduces the likelihood of fetch requests being delayed or arriving out of their intended sequence. A higher fetch.max.wait.ms allows the broker to wait for more messages to accumulate, potentially creating larger, more resilient fetch responses that are less susceptible to minor network jitter.
  6. Consumer max.poll.records Too High:

    • Diagnosis: Examine the consumer’s poll() loop. If max.poll.records is set very high (e.g., 1000 or more) and the consumer takes a long time to process these records before calling poll() again, it can lead to out-of-order issues, especially during rebalances.
    • Cause: A large batch of records fetched by poll() might take a significant amount of time to process. If a rebalance occurs during this processing window, and the consumer hasn’t yet committed offsets for the records it’s currently working on, the rebalance might assign those same partitions to a new consumer. When the original consumer eventually finishes processing and commits, it might have already processed some messages that the new consumer is now also fetching, leading to perceived out-of-order delivery.
    • Fix: Reduce max.poll.records to a smaller value, such as 100 or 500. Ensure that the time taken to process records fetched by poll() is less than the session.timeout.ms configured for the consumer group.
    • Why it works: A smaller max.poll.records means the consumer processes and commits offsets for smaller batches, making it more responsive to rebalances and reducing the window where it holds uncommitted data. This ensures that if a rebalance occurs, the new consumer is less likely to receive duplicate or out-of-order messages due to the previous consumer’s long processing time.

The next error you’ll likely encounter after fixing these sequence number issues is a CommitFailedException, often due to network issues or broker unavailability during offset commits.

Want structured learning?

Take the full Kafka course →