Your application is crashing because the kernel can’t allocate enough memory for its TCP network buffers.
Common Causes and Fixes for TCP Out of Memory Errors
This error, often seen as System error: 12 (Cannot allocate memory) in application logs or Out of memory: Kill process ... (memory used: X.XGB, process memory: X.XGB, process oom_score_adj: X) in dmesg, means the kernel ran out of RAM to satisfy the operating system’s or applications’ requests for memory, specifically when trying to create or expand network buffers. This isn’t just about your application’s heap; it’s about the entire system’s ability to manage network connections.
Here are the most common culprits and how to address them:
-
Excessive Socket Buffer Usage (SO_RCVBUF / SO_SNDBUF)
- Diagnosis: Check the current limits with
sysctl net.core.rmem_maxandsysctl net.core.wmem_max. Then, inspect individual application socket usage. For a running Java process, you might usenetstat -tunp | grep <pid> | awk '{print $7}' | cut -d: -f1 | sort | uniq -c | sort -nr | head -n 10to see which ports are using the most memory, and thenss -tmto get more detailed TCP socket memory stats, looking atrcv-bufandsnd-buf. If you see very large values for many connections, this is a prime suspect. - Fix: Increase the system-wide maximum receive and send buffer sizes. Edit
/etc/sysctl.confand add or modify these lines:
Apply the changes withnet.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216sysctl -p. - Why it works: These parameters define the maximum amount of memory the kernel can allocate for receive and send buffers for any TCP socket. By increasing these limits, you give the kernel more room to accommodate large or numerous active connections that require significant buffering.
tcp_rmemandtcp_wmemare min, default, max triples that allow TCP to dynamically adjust buffer sizes within these bounds.
- Diagnosis: Check the current limits with
-
Too Many Open Connections (TCP TIME_WAIT State)
- Diagnosis: A high number of connections stuck in
TIME_WAITcan consume memory for their associated kernel structures. Check withss -s(look forTIME-WAITcount) ornetstat -an | grep TIME_WAIT | wc -l. If this number is in the hundreds of thousands or millions, it’s a problem. - Fix: Tune TCP to reuse sockets more aggressively and shorten the
TIME_WAITduration. Add or modify these lines in/etc/sysctl.conf:
Apply withnet.ipv4.tcp_tw_reuse = 1 net.ipv4.tcp_tw_recycle = 1 net.ipv4.tcp_fin_timeout = 30sysctl -p. - Why it works:
tcp_tw_reuseallows new TCP connections to reuse sockets inTIME_WAITstate for new connections if the local IP address and port are the same.tcp_tw_recycle(use with caution, can cause issues with NAT) speeds up theTIME_WAITstate expiry.tcp_fin_timeoutreduces the time a socket stays inFIN-WAIT-2state.
- Diagnosis: A high number of connections stuck in
-
Kernel Memory Leaks or Fragmentation
- Diagnosis: While less common, a kernel bug or a poorly behaved module could be leaking memory. Monitor system memory usage over time using
free -mortop/htop. If you see a steady, unexplained increase in "used" memory that never decreases, andcachedmemory isn’t growing proportionally, it might indicate a leak. Checkdmesgfor any unusual kernel messages related to memory allocation failures or OOM killer activity. - Fix: In most cases, the fix is to reboot the system. For specific kernel bugs, you’ll need to update the kernel to a patched version. If a specific module is suspected, it may need to be disabled or updated.
- Why it works: A reboot clears all kernel memory and resets network buffers. Kernel updates fix underlying code defects causing the leak.
- Diagnosis: While less common, a kernel bug or a poorly behaved module could be leaking memory. Monitor system memory usage over time using
-
Insufficient System RAM
- Diagnosis: If your system’s total RAM (
free -m) is consistently close to maxed out, andcachedmemory is also low, you simply don’t have enough physical memory to handle the workload, including network buffers. High memory usage by the OOM killer indmesgis a clear indicator. - Fix: Increase the physical RAM on the server or distribute the workload across more servers.
- Why it works: More RAM provides more memory for the kernel to allocate for all its needs, including network buffers.
- Diagnosis: If your system’s total RAM (
-
Application-Level Memory Issues
- Diagnosis: While the OOM error points to the kernel, a misbehaving application can indirectly cause it. An application that rapidly opens and closes many connections, or one that has a memory leak within its own processes, can exhaust system resources, forcing the kernel to struggle for memory for its own operations. Use
top -o %MEMorhtopto identify memory-hungry processes. If a specific application is using a large percentage of RAM and its connection count is very high, it’s a suspect. - Fix: Optimize the application to reduce its memory footprint, manage connections more efficiently (e.g., connection pooling), or fix internal memory leaks. This might involve profiling the application with tools like
valgrind,jemalloc, or language-specific profilers. - Why it works: By reducing the application’s demand for memory, you free up resources for the kernel to manage its network stack, preventing the OOM condition.
- Diagnosis: While the OOM error points to the kernel, a misbehaving application can indirectly cause it. An application that rapidly opens and closes many connections, or one that has a memory leak within its own processes, can exhaust system resources, forcing the kernel to struggle for memory for its own operations. Use
-
Network Driver Issues
- Diagnosis: Though rare, a faulty network driver can sometimes lead to excessive memory allocation or leaks within the kernel’s network subsystem. Check
dmesgfor any network-related error messages or warnings about driver behavior. - Fix: Update your network interface card (NIC) drivers to the latest stable version.
- Why it works: Updated drivers often contain bug fixes that resolve memory management issues.
- Diagnosis: Though rare, a faulty network driver can sometimes lead to excessive memory allocation or leaks within the kernel’s network subsystem. Check
After addressing these, you might encounter Broken pipe errors if applications try to write to sockets that have been reset by the OOM killer.