Kafka brokers need to talk to clients and other brokers, and they do this using threads. The broker manages two main types of threads for this communication: network threads and I/O threads. Tuning these can dramatically improve throughput and reduce latency.
Let’s see a broker in action, handling requests. Imagine a client sending a ProduceRequest. The broker’s network threads pick this up, deserialize it, and hand it off to a request queue. Then, I/O threads take the request from the queue, perform the actual disk write or data retrieval, and send the response back via the network threads.
The problem Kafka solves here is efficiently managing concurrent network connections and disk operations. Without dedicated threads, a single slow disk operation could block all network activity, bringing the broker to a standstill. By separating network I/O from disk I/O, Kafka can keep accepting new requests while existing ones are being processed.
The configuration parameters are num.network.threads and num.io.threads. The former controls how many threads listen on the network ports, accepting incoming connections and dispatching requests. The latter controls how many threads actually perform the heavy lifting of reading from or writing to disk.
Here’s a typical setup in server.properties:
num.network.threads=3
num.io.threads=8
The surprising truth is that num.io.threads should almost always be higher than num.network.threads. This is because disk operations are inherently slower than network operations. You want enough I/O threads to keep the disks busy without starving the network threads that are waiting to receive new requests and send back responses.
Consider a scenario with high client traffic. Many clients are sending messages. The network threads (num.network.threads) are responsible for accepting all these incoming connections and requests. If there aren’t enough network threads, they become a bottleneck. Requests might queue up at the network layer, leading to increased latency even if the disks are free.
Once a request is accepted by a network thread, it’s placed into a request queue. The I/O threads (num.io.threads) pick up these requests from the queue. For a ProduceRequest, an I/O thread will write the data to the relevant log segments on disk. For a FetchRequest, it will read data from disk and prepare it to be sent back. If there aren’t enough I/O threads, these threads become the bottleneck, and disk operations will back up, again increasing latency and potentially causing network threads to wait for I/O threads to free up.
A common recommendation is to set num.network.threads to 3 for most workloads. This is usually sufficient to handle concurrent connections and basic request parsing. For num.io.threads, the optimal number is more workload-dependent. A good starting point is often 2 to 4 times the number of CPU cores on the broker, especially if you have fast disks (SSDs). For example, on a server with 16 CPU cores, you might start with num.io.threads=32. This allows Kafka to saturate your disk I/O capabilities.
If you are seeing high latency on produce or fetch requests, and your broker’s CPU is not maxed out, it’s a strong indicator that your I/O threads are the bottleneck. You can monitor this by looking at the KafkaRequestMetrics for Produce and Fetch requests, specifically the RequestQueueTimeMs and LocalTimeMs. High RequestQueueTimeMs suggests a network thread bottleneck, while high LocalTimeMs points to an I/O thread bottleneck.
The ratio of num.io.threads to num.network.threads is critical. If num.network.threads is too high relative to num.io.threads, you’ll have network threads waiting for I/O threads, and your system won’t be able to process requests as fast as they arrive. Conversely, if num.network.threads is too low, requests will back up at the network layer even if your I/O threads are idle.
The actual disk throughput of your broker is the ultimate limit. If you have very slow disks, increasing num.io.threads beyond a certain point will yield diminishing returns and might even cause contention on the disk. Conversely, with very fast NVMe SSDs, you might be able to effectively utilize a large number of I/O threads.
When you adjust these settings, remember to restart the Kafka broker for them to take effect. It’s best to tune these parameters incrementally and monitor performance metrics after each change. A common mistake is to set num.io.threads too low, leading to I/O starvation.
The next thing you’ll likely tune after network and I/O threads is the number of Kafka controller threads, which manage cluster metadata and leader elections.