Jenkins executors are hanging because the Jenkins master is unable to reclaim agent connections after builds on those agents have already finished.
Common Causes and Fixes
1. Agent remoting.jar Mismatch
- Diagnosis: Check the
remoting.jarversion on the Jenkins master ($JENKINS_HOME/remoting/remoting.jar) and compare it to the version being used by an affected agent. You can find the agent’sremoting.jarin its temporary directory, often something like/home/jenkins/.jenkins/agent/remoting.jaror/usr/local/jenkins/agent/remoting.jar. A simplediffcommand on the files will reveal differences. - Cause: Incompatible
remoting.jarversions between the master and agents can lead to communication failures, preventing the master from knowing an agent has become free. This is especially common after Jenkins master upgrades if agentremoting.jarfiles aren’t updated consistently. - Fix: Ensure the
remoting.jaron all agents matches the one on the master. The easiest way is to restart the agent process. When an agent starts, it typically downloads the correctremoting.jarfrom the master. If this doesn’t work, manually copy the master’sremoting.jarto the agent’sremotingdirectory and restart the agent. - Why it works: The
remoting.jaris the core communication library. A mismatch means the master and agent are speaking different dialects of the protocol, leading to confusion and dropped connections.
2. Network Issues / Firewalls
- Diagnosis: Use
pingandtraceroutefrom the Jenkins master to the agent’s IP address and vice-versa. Check firewall logs on both the master and agent machines, as well as any network devices in between, for dropped packets or blocked connections on the Jenkins agent port (default 50000). - Cause: Intermittent network connectivity, firewalls blocking the agent port (50000 by default), or aggressive network timeouts can cause the master to lose its connection to an agent without properly registering the agent as free.
- Fix: Whitelist the Jenkins agent port (e.g., TCP 50000) in all firewalls between the master and agents. If network instability is suspected, investigate routing, switch configurations, or consider using a more robust network path.
- Why it works: A stable, open network path is fundamental for the persistent WebSocket or TCP connection the agent maintains with the master.
3. Agent Process Crash or OOM
- Diagnosis: Check the agent’s system logs (e.g.,
/var/log/syslog,journalctl) for any errors related to the Jenkins agent process (agent.jarorjenkins-agentservice). Look for Out-Of-Memory (OOM) killer messages or segmentation faults. - Cause: The agent process itself might have crashed or been killed due to resource exhaustion (memory, CPU), especially if the builds running on it were very resource-intensive.
- Fix: Restart the Jenkins agent process. If OOM is suspected, increase the memory allocated to the agent process or the agent machine itself. You might need to tune JVM options for the agent if it’s launched with a specific
javacommand. For systemd services,systemctl restart jenkins-agent@<agent-name>.serviceis typical. - Why it works: A crashed agent can’t report its status, leaving the master in an indeterminate state about its availability. Restarting it re-establishes the connection and reports its readiness.
4. Stale Lock Files on Agent
- Diagnosis: On the agent machine, navigate to the Jenkins agent’s workspace directory (or temporary agent directory, e.g.,
/home/jenkins/.jenkins/agent/) and look for any.lockfiles related to builds that have already completed. - Cause: Sometimes, build artifacts or temporary files that Jenkins uses to track build status can be left behind on the agent, especially if a build terminated abruptly. These can confuse the agent’s reporting mechanism.
- Fix: Manually delete any stale
.lockfiles or other temporary build-related files in the agent’s directories. A full workspace cleanup might be necessary if the issue persists. - Why it works: These lock files can prevent the agent from correctly signaling that its resources are free, as Jenkins might still believe a process associated with that lock is active.
5. Jenkins Master JVM Issues
- Diagnosis: Monitor the Jenkins master JVM’s heap usage and garbage collection activity. High memory usage or excessive GC pauses can make the master unresponsive, unable to process agent heartbeats or status updates. Use
jstat -gcutil <pid> 1000or JConsole/VisualVM. - Cause: The Jenkins master JVM might be overloaded, experiencing frequent long garbage collection pauses, or even running out of heap memory. This prevents it from processing incoming agent status messages in a timely manner.
- Fix: Increase the JVM heap size for the Jenkins master. For example, if running via
systemd, edit/etc/sysconfig/jenkinsor/etc/default/jenkinsand setJAVA_ARGS="-Xms2g -Xmx4g"(adjusting 2g and 4g as needed). Restart the Jenkins master service. - Why it works: A healthy, responsive master JVM is crucial for managing all agent connections and build states.
6. Agent Timeout Configuration
- Diagnosis: In Jenkins, navigate to "Manage Jenkins" -> "Configure System". Look for settings related to "Agent connection timeout" or similar. Check the
jenkins.model.Jenkins#getComputerTranslator().getHandshakeTimeout()andjenkins.model.Jenkins#getComputerTranslator().getHeartbeatTimeout()values, though these are often not directly configurable via UI. - Cause: The timeout settings for agent connections or heartbeats might be set too low, causing Jenkins to prematurely declare an agent disconnected even if it’s just experiencing a brief network blip or a slow build.
- Fix: While direct UI configuration for these specific timeouts is limited, ensure your network environment is stable. For advanced tuning, you might need to explore Jenkins system properties or plugins that offer more granular control over agent communication. In many cases, increasing master JVM heap and ensuring network stability addresses the underlying cause for these timeouts.
- Why it works: Appropriate timeouts prevent premature disconnections, allowing agents to recover from transient issues and report their status correctly.
The next error you’ll likely see is a java.net.SocketException: Connection reset or java.io.IOException: Channel closed on the agent’s logs, or the executor simply remaining stuck indefinitely.