GitHub Actions retries flaky tests automatically when they fail.

This is a common problem in CI/CD: tests that sometimes pass and sometimes fail, making it impossible to tell if a new change actually broke something or if it was just a transient blip. GitHub Actions has a built-in way to handle this, but it’s not immediately obvious how to configure it.

The core idea is to tell GitHub Actions to re-run a specific job or step if it fails. This isn’t a global setting; you apply it to individual jobs.

Here’s how you’d configure a job to retry:

jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false # This is important for retries to work as expected
      matrix:
        node-version: [16.x, 18.x]
    steps:
      - uses: actions/checkout@v3

      - name: Use Node.js ${{ matrix.node-version }}

        uses: actions/setup-node@v3
        with:

          node-version: ${{ matrix.node-version }}

      - run: npm ci
      - run: npm test
        retry:
          # Number of times to retry the job.
          # Default: 0 (no retries)
          max: 3
          # Delay between retries in seconds.
          # Default: 30
          delay: 60
          # Retry only on specific exit codes.
          # Default: all exit codes
          when: failure() # Or specific codes like: exit-code: [1, 3]

Let’s break down the retry block:

  • max: This is the total number of times the job can be retried after the initial failure. So, max: 3 means the job will run once, and if it fails, it will be retried up to 3 more times. The total maximum runs would be 4.
  • delay: This is the amount of time, in seconds, that GitHub Actions will wait before starting the next retry. A common value is 60 seconds, giving the system a minute to potentially recover or for transient network issues to resolve.
  • when: This is crucial. By default, it retries on any failure. You can make this more granular. failure() is the default and retries on any non-zero exit code. You can specify specific exit codes if only certain types of failures are considered flaky, for example, when: exit-code: [127, 137]. This is useful if you know, for instance, that exit code 127 always means "command not found" which might be a transient environment issue.

The strategy.fail-fast: false setting within the job is also important. If fail-fast is true (which is the default), the entire job matrix will stop as soon as any job in the matrix fails. For retries to have a chance to work on individual matrix permutations, you need to allow other permutations to continue even if one fails, or at least allow the same permutation to retry. Setting fail-fast to false ensures that if one matrix job fails and retries, other independent matrix jobs can still proceed.

Why this works:

When a job runs and exits with a non-zero status code (indicating failure), GitHub Actions checks if there’s a retry configuration. If max is greater than 0 and the when condition is met, it increments a retry counter for that specific job run. After the specified delay, it attempts to re-execute the entire job, starting from the first step. This process repeats until the job succeeds or the max retry count is reached.

A subtle but important point: Retries happen at the job level, not the step level. If you have multiple steps within a job, and one step fails, the entire job will be retried from the beginning. This means that any steps that ran successfully before the failing step will also run again on the retry. This is often acceptable for test suites, as re-running setup steps is usually harmless. If you need step-level retries, you’d typically implement that logic within your test runner or script itself, or use a community action designed for that specific purpose.

Consider the case where your tests are slow and sometimes time out due to load on the runner or network congestion. A retry with a delay of 60-120 seconds can give the runner a chance to clear its buffers or for network conditions to improve, allowing the test to pass on a subsequent attempt. Similarly, if your tests interact with external services that might be temporarily unavailable, a retry strategy can overcome these transient issues without human intervention.

The when: failure() condition is quite powerful. You can also specify success() for a retry only if it previously failed, but that’s less common for flaky tests and more for ensuring a final state. You can also use always() to retry regardless of exit code, though this is rarely what you want for test failures.

If you have a matrix of jobs, and one specific combination fails, the retry configuration will apply to that specific permutation of the matrix. For instance, if your matrix is os: [ubuntu, windows] and node-version: [16, 18], and the ubuntu-16 job fails, it will retry. The windows-18 job, if it’s running, will continue independently unless fail-fast is enabled.

After all retries are exhausted and the job still fails, the workflow run will be marked as failed. The next error you’ll see is the original failure message from your test runner, but now you’ll know you’ve exhausted all automated recovery attempts for that specific run.

Want structured learning?

Take the full Github-actions course →