Fix Kafka RequestTimedOutException Errors (2026)

The RequestTimedOutException in Kafka means a broker didn’t respond to a request from a client (producer, consumer, or another broker) within the configured timeout period. This is interesting because it usually points to network issues or overloaded brokers, not necessarily a Kafka configuration problem itself.

Common Causes and Fixes:

1. Network Latency/Packet Loss Between Client and Broker

Diagnosis: Use ping and traceroute (or mtr) from the client machine to the broker’s IP address. Look for high latency (>50ms consistently), jitter, or packet loss.
```
ping <broker_ip>
mtr <broker_ip>
```
Fix: Address underlying network infrastructure problems. This might involve:
- Increasing TCP Keepalive Timers: On the client/server OS, increase tcp_keepalive_time and tcp_keepalive_intvl to allow stale connections to remain open longer.
```
# On Linux:
sudo sysctl -w net.ipv4.tcp_keepalive_time=1800
sudo sysctl -w net.ipv4.tcp_keepalive_intvl=60
```
  This allows the OS to periodically check if a connection is still alive, preventing it from being dropped due to inactivity and potentially re-establishing it if broken.
- Optimizing Network Hardware: Work with your network team to ensure switches, routers, and firewalls are not overloaded or misconfigured.
Why it works: Kafka relies on stable TCP connections. High latency or packet loss can cause the broker to appear unresponsive, triggering the timeout. Adjusting keepalives helps maintain connection integrity over less-than-perfect networks.

2. Broker Under-Replication or Leader Not Available

Diagnosis: Check broker health and topic replication status.
```
# From kafka bin directory
./kafka-topics.sh --bootstrap-server <broker_address> --describe --topic <your_topic_name>
```
Look for partitions with "Isr" count lower than "Replicas" count, or if the leader is marked as None.

Fix:

Increase Replication Factor: If topics are consistently under-replicated, increase the replication.factor for those topics.

# Example for adding a replica to an existing topic (requires controller action)
# This is typically done by reassigning partitions.
# First, create a JSON file for partition reassignment:
# {
#   "partitions": [
#     {"topic": "your_topic_name", "partition": 0, "replicas": [0, 1, 2]},
#     {"topic": "your_topic_name", "partition": 1, "replicas": [0, 1, 2]}
#   ]
# }
# Then run:
./kafka-reassign-partitions.sh --bootstrap-server <broker_address> --execute --reassignment-json-file <path_to_json_file>

Address Broker Failures: Ensure all brokers are running and healthy. If a broker is down, investigate why and restart it.

Why it works: If a partition leader is unavailable or its in-sync replicas (ISRs) are lagging, Kafka cannot guarantee consistency. Producers waiting for acknowledgments (acks=all) or consumers trying to fetch data from a non-existent leader will time out.

3. Broker Overload (High CPU, Memory, Disk I/O)

Diagnosis: Monitor broker resource utilization.
```
# On the broker machine:
top -n 1 -b # Check CPU and Memory
iostat -xz 1 5 # Check Disk I/O
```
Kafka’s JVM heap usage can also be monitored. Look for high GC activity or sustained high CPU.
Fix:
- Scale Broker Resources: Increase CPU, RAM, or disk speed (e.g., use SSDs).
- Tune Kafka Configurations:
  - num.io.threads: Increase if disk I/O is the bottleneck.
  - num.network.threads: Increase if network is saturated.
  - message.max.bytes: Reduce if very large messages are causing excessive processing.
  - replica.fetch.max.bytes: Reduce if follower fetches are too large.
- Add More Brokers: Distribute the load across more machines.
Why it works: An overloaded broker cannot process incoming requests fast enough. Network requests to the broker might be queued indefinitely, leading to timeouts on the client side.

4. Insufficient request.timeout.ms on Client

Diagnosis: Review the client (producer/consumer) configuration.

// Example Producer Config
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("request.timeout.ms", "30000"); // Default is 30000 (30 seconds)
props.put("acks", "1");

If request.timeout.ms is set too low (e.g., 5 seconds) and network conditions or broker load cause slightly longer delays, timeouts will occur.

Fix: Increase the request.timeout.ms in your producer or consumer configuration.
```
props.put("request.timeout.ms", "60000"); // e.g., 60 seconds
```
Why it works: This directly increases the window of time the client will wait for a response from the broker before declaring the request failed. It doesn’t fix the underlying issue but provides more tolerance.

5. Broker Firewall Blocking or Restricting Traffic

Diagnosis: Check firewall rules on both client and broker machines, and any network firewalls in between. Ensure Kafka ports (9092 for clients, 2888/3888 for ZooKeeper if used) are open.
```
# On broker machine, check iptables:
sudo iptables -L -n | grep <client_ip>
```

Fix: Open the necessary ports in the firewall.

# Example to allow traffic on port 9092 from a specific IP
sudo iptables -A INPUT -p tcp --dport 9092 -s <client_ip> -j ACCEPT

Why it works: Firewalls can silently drop packets or reject connections, which manifests as a timeout on the client side because no response is ever received.

6. ZooKeeper Issues (if applicable)

Diagnosis: If your Kafka cluster uses ZooKeeper, check ZooKeeper ensemble health.
```
# On a ZooKeeper node:
echo "stat" | nc localhost 2181 # Check if ZooKeeper is running
# Check ZooKeeper logs for errors.
```
Look for high latency in ZooKeeper operations or ZooKeeper nodes being down.
Fix: Ensure the ZooKeeper ensemble is healthy, has sufficient resources, and is properly configured. Restart ZooKeeper nodes if necessary.
Why it works: Kafka brokers rely on ZooKeeper for metadata management (broker registration, topic configuration, leader election). If ZooKeeper is slow or unavailable, brokers may fail to elect leaders or respond to requests, indirectly causing client timeouts.

If you’ve addressed all these, the next error you might see is a LeaderNotAvailable exception if a partition leader is truly gone, or a NetworkException if the underlying TCP connection is fundamentally broken.