The RequestTimedOutException in Kafka means a broker didn’t respond to a request from a client (producer, consumer, or another broker) within the configured timeout period. This is interesting because it usually points to network issues or overloaded brokers, not necessarily a Kafka configuration problem itself.
Common Causes and Fixes:
1. Network Latency/Packet Loss Between Client and Broker
- Diagnosis: Use
pingandtraceroute(ormtr) from the client machine to the broker’s IP address. Look for high latency (>50ms consistently), jitter, or packet loss.ping <broker_ip> mtr <broker_ip> - Fix: Address underlying network infrastructure problems. This might involve:
- Increasing TCP Keepalive Timers: On the client/server OS, increase
tcp_keepalive_timeandtcp_keepalive_intvlto allow stale connections to remain open longer.
This allows the OS to periodically check if a connection is still alive, preventing it from being dropped due to inactivity and potentially re-establishing it if broken.# On Linux: sudo sysctl -w net.ipv4.tcp_keepalive_time=1800 sudo sysctl -w net.ipv4.tcp_keepalive_intvl=60 - Optimizing Network Hardware: Work with your network team to ensure switches, routers, and firewalls are not overloaded or misconfigured.
- Increasing TCP Keepalive Timers: On the client/server OS, increase
- Why it works: Kafka relies on stable TCP connections. High latency or packet loss can cause the broker to appear unresponsive, triggering the timeout. Adjusting keepalives helps maintain connection integrity over less-than-perfect networks.
2. Broker Under-Replication or Leader Not Available
- Diagnosis: Check broker health and topic replication status.
Look for partitions with "Isr" count lower than "Replicas" count, or if the leader is marked as# From kafka bin directory ./kafka-topics.sh --bootstrap-server <broker_address> --describe --topic <your_topic_name>None. - Fix:
- Increase Replication Factor: If topics are consistently under-replicated, increase the
replication.factorfor those topics.# Example for adding a replica to an existing topic (requires controller action) # This is typically done by reassigning partitions. # First, create a JSON file for partition reassignment: # { # "partitions": [ # {"topic": "your_topic_name", "partition": 0, "replicas": [0, 1, 2]}, # {"topic": "your_topic_name", "partition": 1, "replicas": [0, 1, 2]} # ] # } # Then run: ./kafka-reassign-partitions.sh --bootstrap-server <broker_address> --execute --reassignment-json-file <path_to_json_file> - Address Broker Failures: Ensure all brokers are running and healthy. If a broker is down, investigate why and restart it.
- Increase Replication Factor: If topics are consistently under-replicated, increase the
- Why it works: If a partition leader is unavailable or its in-sync replicas (ISRs) are lagging, Kafka cannot guarantee consistency. Producers waiting for acknowledgments (
acks=all) or consumers trying to fetch data from a non-existent leader will time out.
3. Broker Overload (High CPU, Memory, Disk I/O)
- Diagnosis: Monitor broker resource utilization.
Kafka’s JVM heap usage can also be monitored. Look for high GC activity or sustained high CPU.# On the broker machine: top -n 1 -b # Check CPU and Memory iostat -xz 1 5 # Check Disk I/O - Fix:
- Scale Broker Resources: Increase CPU, RAM, or disk speed (e.g., use SSDs).
- Tune Kafka Configurations:
num.io.threads: Increase if disk I/O is the bottleneck.num.network.threads: Increase if network is saturated.message.max.bytes: Reduce if very large messages are causing excessive processing.replica.fetch.max.bytes: Reduce if follower fetches are too large.
- Add More Brokers: Distribute the load across more machines.
- Why it works: An overloaded broker cannot process incoming requests fast enough. Network requests to the broker might be queued indefinitely, leading to timeouts on the client side.
4. Insufficient request.timeout.ms on Client
- Diagnosis: Review the client (producer/consumer) configuration.
If// Example Producer Config Properties props = new Properties(); props.put("bootstrap.servers", "localhost:9092"); props.put("request.timeout.ms", "30000"); // Default is 30000 (30 seconds) props.put("acks", "1");request.timeout.msis set too low (e.g., 5 seconds) and network conditions or broker load cause slightly longer delays, timeouts will occur. - Fix: Increase the
request.timeout.msin your producer or consumer configuration.props.put("request.timeout.ms", "60000"); // e.g., 60 seconds - Why it works: This directly increases the window of time the client will wait for a response from the broker before declaring the request failed. It doesn’t fix the underlying issue but provides more tolerance.
5. Broker Firewall Blocking or Restricting Traffic
- Diagnosis: Check firewall rules on both client and broker machines, and any network firewalls in between. Ensure Kafka ports (9092 for clients, 2888/3888 for ZooKeeper if used) are open.
# On broker machine, check iptables: sudo iptables -L -n | grep <client_ip> - Fix: Open the necessary ports in the firewall.
# Example to allow traffic on port 9092 from a specific IP sudo iptables -A INPUT -p tcp --dport 9092 -s <client_ip> -j ACCEPT - Why it works: Firewalls can silently drop packets or reject connections, which manifests as a timeout on the client side because no response is ever received.
6. ZooKeeper Issues (if applicable)
- Diagnosis: If your Kafka cluster uses ZooKeeper, check ZooKeeper ensemble health.
Look for high latency in ZooKeeper operations or ZooKeeper nodes being down.# On a ZooKeeper node: echo "stat" | nc localhost 2181 # Check if ZooKeeper is running # Check ZooKeeper logs for errors. - Fix: Ensure the ZooKeeper ensemble is healthy, has sufficient resources, and is properly configured. Restart ZooKeeper nodes if necessary.
- Why it works: Kafka brokers rely on ZooKeeper for metadata management (broker registration, topic configuration, leader election). If ZooKeeper is slow or unavailable, brokers may fail to elect leaders or respond to requests, indirectly causing client timeouts.
If you’ve addressed all these, the next error you might see is a LeaderNotAvailable exception if a partition leader is truly gone, or a NetworkException if the underlying TCP connection is fundamentally broken.