Your application is crashing because the kernel can’t allocate enough memory for its TCP network buffers.

Common Causes and Fixes for TCP Out of Memory Errors

This error, often seen as System error: 12 (Cannot allocate memory) in application logs or Out of memory: Kill process ... (memory used: X.XGB, process memory: X.XGB, process oom_score_adj: X) in dmesg, means the kernel ran out of RAM to satisfy the operating system’s or applications’ requests for memory, specifically when trying to create or expand network buffers. This isn’t just about your application’s heap; it’s about the entire system’s ability to manage network connections.

Here are the most common culprits and how to address them:

  1. Excessive Socket Buffer Usage (SO_RCVBUF / SO_SNDBUF)

    • Diagnosis: Check the current limits with sysctl net.core.rmem_max and sysctl net.core.wmem_max. Then, inspect individual application socket usage. For a running Java process, you might use netstat -tunp | grep <pid> | awk '{print $7}' | cut -d: -f1 | sort | uniq -c | sort -nr | head -n 10 to see which ports are using the most memory, and then ss -tm to get more detailed TCP socket memory stats, looking at rcv-buf and snd-buf. If you see very large values for many connections, this is a prime suspect.
    • Fix: Increase the system-wide maximum receive and send buffer sizes. Edit /etc/sysctl.conf and add or modify these lines:
      net.core.rmem_max = 16777216
      net.core.wmem_max = 16777216
      net.ipv4.tcp_rmem = 4096 87380 16777216
      net.ipv4.tcp_wmem = 4096 65536 16777216
      
      Apply the changes with sysctl -p.
    • Why it works: These parameters define the maximum amount of memory the kernel can allocate for receive and send buffers for any TCP socket. By increasing these limits, you give the kernel more room to accommodate large or numerous active connections that require significant buffering. tcp_rmem and tcp_wmem are min, default, max triples that allow TCP to dynamically adjust buffer sizes within these bounds.
  2. Too Many Open Connections (TCP TIME_WAIT State)

    • Diagnosis: A high number of connections stuck in TIME_WAIT can consume memory for their associated kernel structures. Check with ss -s (look for TIME-WAIT count) or netstat -an | grep TIME_WAIT | wc -l. If this number is in the hundreds of thousands or millions, it’s a problem.
    • Fix: Tune TCP to reuse sockets more aggressively and shorten the TIME_WAIT duration. Add or modify these lines in /etc/sysctl.conf:
      net.ipv4.tcp_tw_reuse = 1
      net.ipv4.tcp_tw_recycle = 1
      net.ipv4.tcp_fin_timeout = 30
      
      Apply with sysctl -p.
    • Why it works: tcp_tw_reuse allows new TCP connections to reuse sockets in TIME_WAIT state for new connections if the local IP address and port are the same. tcp_tw_recycle (use with caution, can cause issues with NAT) speeds up the TIME_WAIT state expiry. tcp_fin_timeout reduces the time a socket stays in FIN-WAIT-2 state.
  3. Kernel Memory Leaks or Fragmentation

    • Diagnosis: While less common, a kernel bug or a poorly behaved module could be leaking memory. Monitor system memory usage over time using free -m or top/htop. If you see a steady, unexplained increase in "used" memory that never decreases, and cached memory isn’t growing proportionally, it might indicate a leak. Check dmesg for any unusual kernel messages related to memory allocation failures or OOM killer activity.
    • Fix: In most cases, the fix is to reboot the system. For specific kernel bugs, you’ll need to update the kernel to a patched version. If a specific module is suspected, it may need to be disabled or updated.
    • Why it works: A reboot clears all kernel memory and resets network buffers. Kernel updates fix underlying code defects causing the leak.
  4. Insufficient System RAM

    • Diagnosis: If your system’s total RAM (free -m) is consistently close to maxed out, and cached memory is also low, you simply don’t have enough physical memory to handle the workload, including network buffers. High memory usage by the OOM killer in dmesg is a clear indicator.
    • Fix: Increase the physical RAM on the server or distribute the workload across more servers.
    • Why it works: More RAM provides more memory for the kernel to allocate for all its needs, including network buffers.
  5. Application-Level Memory Issues

    • Diagnosis: While the OOM error points to the kernel, a misbehaving application can indirectly cause it. An application that rapidly opens and closes many connections, or one that has a memory leak within its own processes, can exhaust system resources, forcing the kernel to struggle for memory for its own operations. Use top -o %MEM or htop to identify memory-hungry processes. If a specific application is using a large percentage of RAM and its connection count is very high, it’s a suspect.
    • Fix: Optimize the application to reduce its memory footprint, manage connections more efficiently (e.g., connection pooling), or fix internal memory leaks. This might involve profiling the application with tools like valgrind, jemalloc, or language-specific profilers.
    • Why it works: By reducing the application’s demand for memory, you free up resources for the kernel to manage its network stack, preventing the OOM condition.
  6. Network Driver Issues

    • Diagnosis: Though rare, a faulty network driver can sometimes lead to excessive memory allocation or leaks within the kernel’s network subsystem. Check dmesg for any network-related error messages or warnings about driver behavior.
    • Fix: Update your network interface card (NIC) drivers to the latest stable version.
    • Why it works: Updated drivers often contain bug fixes that resolve memory management issues.

After addressing these, you might encounter Broken pipe errors if applications try to write to sockets that have been reset by the OOM killer.

Want structured learning?

Take the full Computer Networking course →