Your GitLab CI jobs are failing because the environment they’re trying to run in is being shut down prematurely. This isn’t a CI-specific bug; it’s a symptom of the underlying infrastructure that your CI jobs depend on.

Here are the most common reasons why your environment might be stopping, and how to fix them:

1. Auto-scaling Group Scaling Down Too Aggressively

Diagnosis: Your cloud provider’s auto-scaling group (e.g., AWS EC2 Auto Scaling, GCP Managed Instance Groups) is configured with a low desired capacity or a very sensitive scale-down policy. This causes instances to be terminated even when they are still needed for active CI jobs.

Check:

  • AWS: Navigate to the EC2 Auto Scaling Groups console. Select your CI runner’s auto-scaling group and view its "Activity history" and "Instance scale-in" policies. Look for recent scale-in events that coincide with job failures. Check the "Desired capacity" and "Minimum capacity" settings.
  • GCP: Go to Compute Engine > Instance groups. Select your managed instance group. Under "Autoscaling," check the "Minimum number of instances" and the "Scale-in controls" or "Cool down period."

Fix:

  • AWS:
    • Increase the "Desired capacity" of your auto-scaling group to a value that accommodates your typical peak load of concurrent CI jobs.
    • Adjust the "Scale-in policies" to be less aggressive. For example, instead of scaling in based on a low average CPU utilization, you might scale in based on a longer period of inactivity or a higher threshold.
    • A common fix is to set "Min size" to a reasonable number (e.g., 2 or 3) and "Max size" to accommodate peak load. Then, configure a scaling policy that scales out when CPU utilization exceeds, say, 70% for 5 minutes, and scales in when it drops below 30% for 15 minutes.
  • GCP:
    • Increase the "Minimum number of instances" in your autoscaling configuration.
    • Adjust the "Scale-in controls" to increase the cool-down period after an instance is created before it can be scaled in. This gives running jobs more time to complete. For example, set the cool-down period to 120s (2 minutes) or more.
    • Consider modifying the autoscaling metric. If you’re scaling based on CPU, a brief dip might trigger a scale-in. Scaling based on the number of running jobs (if your runner registration mechanism supports it) or a longer CPU average could be more robust.

Why it works: By increasing the minimum number of instances or making the scale-down policies less sensitive, you ensure that there are always enough runners available to pick up jobs and that running jobs are not terminated before they can finish.

2. Runner Registration Timeout or Health Check Failure

Diagnosis: The GitLab Runner itself is configured to periodically check in with the GitLab API to signal it’s alive and ready. If this check fails, or if the runner takes too long to register with GitLab, GitLab might assume the runner is dead and stop assigning jobs to it. This can also happen if the underlying instance hosting the runner becomes unresponsive.

Check:

  • GitLab Runner Logs: SSH into one of the affected runner instances and check the runner logs. The common location is /var/log/gitlab-runner/runner.log or viewed via sudo journalctl -u gitlab-runner. Look for messages indicating a failure to connect to the GitLab API (POST /api/v4/runners/register or POST /api/v4/runners/sync), or frequent disconnections.
  • GitLab Admin UI: Go to your GitLab instance’s Admin Area > CI/CD > Runners. Check the "Last contact" time for your runners. If it’s old, the runner is not communicating.

Fix:

  • Network Connectivity: Ensure the runner instances have stable network access to your GitLab instance’s FQDN. Check firewall rules, security groups, and proxy configurations.
  • Runner Configuration (config.toml):
    • Increase the connection_timeout in your config.toml file (usually located at /etc/gitlab-runner/config.toml). For example, to 60s or 120s. This allows more time for the runner to establish a connection to the GitLab API.
    • Ensure the listen_address is correctly configured if you have specific network requirements.
  • Instance Health: If the runner process is running but the instance is unhealthy (e.g., out of memory, high I/O wait), the runner won’t be able to communicate. Check system resource utilization (top, htop, vmstat).
  • GitLab API Rate Limiting: If you have a very large number of runners or very frequent job starts/stops, you might be hitting API rate limits on your GitLab instance. Check your GitLab server logs for rate limiting messages.

Why it works: A healthy runner needs to constantly signal its availability. Increasing timeouts or fixing network issues ensures this signal gets through, so GitLab knows the runner is still active and capable of taking jobs.

3. Docker/Kubernetes Executor Issues

Diagnosis: If you’re using Docker or Kubernetes executors, the problem might lie with the container runtime or orchestration layer rather than the runner itself. Pods or containers might be crashing, failing to start, or being terminated by Kubernetes due to resource constraints or misconfigurations.

Check:

  • Kubernetes Pod Logs: If using the Kubernetes executor, kubectl logs <pod-name> for the pod that was supposed to run your job. Look for errors during container creation or within the job’s execution.
  • Kubernetes Events: kubectl get events --sort-by='.metadata.creationTimestamp' in the namespace where your runners operate. Look for events related to pod scheduling failures, image pull errors, or OOMKilled (Out Of Memory) events.
  • Docker Daemon Logs: On the runner host (if using Docker executor directly), check the Docker daemon logs (sudo journalctl -u docker or /var/log/docker.log).

Fix:

  • Kubernetes Executor:
    • Ensure your Kubernetes cluster has sufficient resources (CPU, memory) for the jobs.
    • Check your config.toml for the Kubernetes executor: privileged = true might be needed for certain operations, or adjust cpu_limit and memory_limit to be more generous.
    • If pods are OOMKilled, increase the memory limits in your config.toml or your Kubernetes pod templates.
    • Ensure the service account used by the runner has the necessary RBAC permissions in Kubernetes.
  • Docker Executor:
    • Ensure the Docker daemon on the runner host has enough disk space and memory available.
    • Check config.toml for [runners.docker] section: memory and cpus limits might be too low.
    • Ensure the Docker images your jobs use are valid and accessible.

Why it works: The runner’s job is to orchestrate containers. If the underlying container platform (Docker or Kubernetes) cannot reliably create, run, and manage these containers, jobs will fail because the execution environment is unstable or unavailable.

4. Instance Termination Policies (Manual or Scheduled)

Diagnosis: The underlying virtual machines or physical servers hosting your runners are being shut down manually, or via a scheduled shutdown policy that doesn’t account for active CI jobs.

Check:

  • Cloud Provider Console:
    • AWS: Check "Instances" for manual terminations. Look at "Scheduled Events" and "EC2 Auto Scaling Group" activity history for automated terminations.
    • GCP: Check "VM instances" for manual stops. Look at "Operations" or "Activity log" for scheduled maintenance or automated shutdowns.
    • Azure: Check "Virtual machines" for manual stops. Look at "Activity log" for scheduled events.
  • On-Premise: Check server management tools, scheduled task logs, or physical presence for manual shutdowns.

Fix:

  • Update Schedules: If using scheduled shutdowns (e.g., to save costs overnight), ensure these schedules are adjusted to only shut down instances when no CI jobs are expected to be running, or ensure a grace period for jobs to complete.
  • Instance Protection:
    • AWS: Enable "Termination Protection" on critical instances to prevent accidental manual termination.
    • GCP/Azure: Use instance tags or labels to identify and exclude runners from automated shutdown scripts.
  • Graceful Shutdown Scripts: Implement or configure scripts that allow running jobs to finish before an instance is shut down. This often involves the runner detecting a shutdown signal and entering a "drain" mode.

Why it works: Instances are the physical or virtual hosts for your runners. If they are turned off, the runners on them become unavailable, and any jobs running on them will be terminated.

5. GitLab Runner Service Crashing

Diagnosis: The GitLab Runner process itself is crashing unexpectedly on the host machine. This could be due to bugs in the runner, resource exhaustion on the host, or external factors causing the process to terminate.

Check:

  • Runner Logs: As mentioned in point 2, check /var/log/gitlab-runner/runner.log or journalctl -u gitlab-runner. Look for panic messages, segmentation faults, or unhandled exceptions.
  • System Logs: Check /var/log/syslog, /var/log/messages, or dmesg for kernel-level issues or OOM killer events that might have terminated the runner process.
  • Runner Version: Verify you are running a recent, stable version of GitLab Runner. Older versions might have known stability issues.

Fix:

  • Update GitLab Runner: Ensure you are on the latest stable release of GitLab Runner. Update using your package manager (apt update && apt upgrade gitlab-runner on Debian/Ubuntu) or by replacing the binary.
  • Resource Allocation: If the runner host is consistently running out of memory or CPU, increase the resources allocated to the instance.
  • Runner Configuration: Review config.toml for any unusual settings that might stress the runner process.
  • Reinstall/Restart: A simple restart (sudo systemctl restart gitlab-runner) or reinstallation can sometimes resolve transient issues.

Why it works: The GitLab Runner process is the agent that communicates with GitLab, picks up jobs, and orchestrates their execution. If this process dies, the runner becomes non-functional.

6. Network Latency or Intermittent Connectivity to GitLab

Diagnosis: High network latency or intermittent packet loss between your runner hosts and your GitLab instance is causing the runner to miss its health check-in intervals or fail to communicate job status updates, leading GitLab to believe the runner is offline.

Check:

  • ping and traceroute: From a runner host, ping your.gitlab.instance.com and traceroute your.gitlab.instance.com to identify latency and potential network hops with issues.
  • mtr (My Traceroute): A more advanced tool that combines ping and traceroute to give a continuous view of network path quality.
  • GitLab Runner Logs: Look for recurring "connection refused," "timeout," or "EOF" errors when communicating with the GitLab API.

Fix:

  • Improve Network Path: Work with your network team to optimize the route between your runners and GitLab. This might involve peering, direct connections, or resolving BGP issues.
  • Use a Closer GitLab Instance: If your runners are in a different geographic region than your GitLab instance, consider deploying a dedicated GitLab instance closer to your runners or using GitLab’s Geo feature.
  • CDN/Proxy Configuration: Ensure any intermediate proxies or CDNs are not causing bottlenecks or dropping connections.
  • GitLab Instance Tuning: If your GitLab instance is on-premises or self-hosted, ensure its network interfaces and web server (Nginx/Apache) are not overloaded.

Why it works: Reliable communication is paramount. If the runner cannot consistently talk to GitLab, GitLab will eventually stop sending it work, assuming it’s unavailable.

The next error you’ll likely hit after fixing these is "Runner timed out" or "Job exceeded script execution time," as your jobs will now have the stable environment they need to run, but might still be too slow or complex.

Want structured learning?

Take the full Gitlab-ci course →