A "connection reset by peer" error means the remote end of your TCP connection abruptly terminated it, and your system received an RST (reset) packet. This isn’t a graceful shutdown; it’s like someone slamming the phone down.
Here’s why that happens, from most to least common:
1. Network Intermediaries Dropping Connections
Firewalls, load balancers, NAT devices, or even overloaded routers between you and the peer are often configured to time out idle connections or drop connections exceeding certain state limits. The peer might still think the connection is alive, but the intermediary has forgotten about it and discards traffic. When your system sends data, the intermediary drops it, and the peer eventually times out your side and sends an RST.
- Diagnosis: Check firewall/load balancer logs for connection timeouts or state table full errors. Use
tcpdumpon both client and server to see if packets are reaching the intermediary but not the other side, or if RST packets are being generated by an intermediary IP.# On client, capturing packets going to server IP 192.168.1.100 on port 80 sudo tcpdump -i eth0 host 192.168.1.100 and port 80 -w client_capture.pcap # On server, capturing packets from client IP 192.168.1.50 on port 80 sudo tcpdump -i eth0 host 192.168.1.50 and port 80 -w server_capture.pcap - Fix: Configure idle timeouts on network devices to be longer than your application’s expected inactivity period. For example, on a Cisco ASA firewall, you might adjust the
timeout connectionsetting.
Or, if a load balancer’s state table is full, you might need to increase its capacity or implement connection draining.show running-config firewall idle-timeout ! Example: set connection timeout to 3600 seconds (1 hour) set timeout connection 3600 - Why it works: By extending the idle timeout, network devices keep track of connections for longer, preventing them from being prematurely terminated due to inactivity.
2. Application on the Peer Crashing or Restarting
The application on the remote server might have crashed, been killed by an OOM (Out Of Memory) killer, or been deliberately restarted. When this happens, the operating system on that peer cleans up all its open network connections, sending RST packets to the connected clients.
- Diagnosis: Check the system logs (
/var/log/syslog,/var/log/messages,journalctl) on the peer server for application crashes, segmentation faults, or OOM killer messages. If you have access, check application-specific logs.# On the peer server, look for OOM killer messages sudo journalctl -k | grep -i "killed process" # Or check general system logs for application errors sudo grep -i "error\|fail\|crash" /var/log/syslog - Fix: Debug and fix the application on the peer server. This might involve fixing memory leaks, increasing available memory, or resolving unhandled exceptions. If it’s a controlled restart, the application should ideally perform a graceful shutdown (FIN packets) instead of abrupt termination.
- Why it works: A stable, running application doesn’t abruptly close connections. Fixing the root cause of the crash or restart ensures the application remains available and manages connections properly.
3. Peer Server Rebooting or Shutting Down
The entire server the peer application is running on might have been rebooted or shut down unexpectedly. This would cause all its network connections to be terminated by the OS.
- Diagnosis: Check the system uptime (
uptime) and the system logs (/var/log/syslog,journalctl) on the peer server for recent reboots or shutdown events.# Check uptime on the peer server uptime # Check for shutdown messages sudo journalctl -b -1 | grep -i "shutdown\|reboot" - Fix: Ensure scheduled reboots are communicated and that unplanned shutdowns are investigated. For critical services, implement high availability and failover mechanisms.
- Why it works: A continuously running server won’t abruptly terminate connections due to an OS-level shutdown.
4. Application-Level Keep-Alive Mismatch or Failure
Many applications implement their own keep-alive mechanisms (e.g., PING/PONG messages) to detect dead connections. If the peer application stops sending its keep-alives, or if your client stops responding to them, the peer application might decide the connection is dead and close it. Conversely, if your client sends a keep-alive and the peer doesn’t respond (because it’s hung or crashed), your client might close the connection.
- Diagnosis: Examine application logs on both sides for messages related to keep-alives, heartbeats, or detected connection inactivity. Use
tcpdumpto see if keep-alive packets are being sent and acknowledged. - Fix: Adjust the keep-alive interval and timeout settings in the application configuration on both ends to be consistent and appropriate for the expected network latency and application behavior.
- Why it works: Synchronized and correctly functioning keep-alives ensure both ends agree on the connection’s liveness, preventing premature termination due to perceived inactivity.
5. TCP Keep-Alive Failing at the OS Level
Beyond application-level keep-alives, the operating system itself has TCP keep-alives. These are probes sent by the OS when a connection has been idle for a long time. If the peer’s OS fails to respond to these OS-level probes (e.g., due to a hung network stack or a firewall silently dropping ICMP/TCP probes), the OS will eventually declare the connection dead and send an RST.
- Diagnosis: Check OS-level TCP keep-alive settings on both client and server.
If these are very short, they might be aggressively closing connections.# On Linux, view current settings (seconds) sysctl net.ipv4.tcp_keepalive_time sysctl net.ipv4.tcp_keepalive_intvl sysctl net.ipv4.tcp_keepalive_probes - Fix: Increase the
tcp_keepalive_time(initial idle time before probes start),tcp_keepalive_intvl(interval between probes), and/ortcp_keepalive_probes(number of probes before giving up) on the peer server.# Example: Increase idle time to 2 hours (7200s) sudo sysctl -w net.ipv4.tcp_keepalive_time=7200 # Make permanent by editing /etc/sysctl.conf - Why it works: Increasing the OS keep-alive timeouts gives the connection more time to recover from transient network issues or allows more time for the peer application to become responsive before the OS tears down the connection.
6. Resource Exhaustion on the Peer Server
The peer server might be running out of critical resources like memory, file descriptors, or ephemeral ports. When this happens, the OS might struggle to maintain active connections, leading to them being dropped. For instance, if the server runs out of available ephemeral ports for outgoing connections, new connections might fail, and existing ones could become unstable.
- Diagnosis: Monitor resource utilization on the peer server: memory (
free -h,top), file descriptors (ulimit -n,lsof | wc -l), and ephemeral port usage (netstat -s | grep "out-of-sockets"or similar). - Fix: Address the resource bottleneck on the peer server. This could involve optimizing the application to use fewer resources, increasing server capacity, or tuning OS limits (e.g.,
ulimit -n,net.ipv4.ip_local_port_range). - Why it works: A server with ample resources can reliably manage its network connections without the OS being forced to terminate them due to scarcity.
The next error you’ll likely encounter if you fix all these is a Broken pipe error, which is the client-side manifestation of the server having already sent an RST.