GitLab CI’s cache can be a massive speedup, but most people build it inefficiently, wasting valuable pipeline time.
Let’s see how this looks in action. Imagine a simple Ruby project. We want to cache the vendor/bundle directory.
# .gitlab-ci.yml
cache:
paths:
- vendor/bundle/
build_job:
image: ruby:3.1
script:
- bundle install --path vendor/bundle
- echo "This is a dummy build step"
When this pipeline runs, GitLab will upload vendor/bundle/ after build_job completes if it’s the first time or if the cache key changes. On subsequent runs, it will download it before build_job starts. This sounds great, but there’s a catch.
The problem is that bundle install is often too slow, and the cache doesn’t help if the dependencies themselves change frequently. The real win comes from intelligently defining what goes into the cache and when it’s invalidated.
The core idea is to make the cache key reflect the actual dependencies, not just a generic job name or a fixed string. This ensures that the cache is only rebuilt when the dependencies actually change.
Here’s how we can do it for Ruby:
# .gitlab-ci.yml
cache:
key:
files:
- Gemfile.lock
paths:
- vendor/bundle/
build_job:
image: ruby:3.1
script:
- bundle install --path vendor/bundle
- echo "This is a dummy build step"
In this improved version, the cache:key uses files: - Gemfile.lock. This means GitLab CI will generate a cache key based on the content of Gemfile.lock. If Gemfile.lock doesn’t change, the cache key remains the same, and GitLab will use the existing cache. If Gemfile.lock does change (meaning dependencies were added, removed, or updated), a new cache key is generated, and bundle install will run from scratch.
This applies to many other languages and package managers. For Node.js projects using npm, you’d cache node_modules based on package-lock.json:
cache:
key:
files:
- package-lock.json
paths:
- node_modules/
build_node_job:
image: node:18
script:
- npm ci --cache .npm --prefer-offline
- echo "Node build step"
Notice the npm ci command. This is crucial. npm ci is designed for continuous integration environments. It installs dependencies exactly as specified in package-lock.json and is generally faster and more reliable than npm install for CI. The --cache .npm --prefer-offline flags further optimize by using a local npm cache directory, which can also be cached itself if needed, though package-lock.json is the primary driver for cache invalidation.
For Python projects using pip and requirements.txt (or poetry.lock / Pipfile.lock):
cache:
key:
files:
- requirements.txt
paths:
- .venv/
build_python_job:
image: python:3.10
script:
- python -m venv .venv
- source .venv/bin/activate
- pip install -r requirements.txt
- echo "Python build step"
Here, the cache is tied to requirements.txt. If you use a dependency manager like Poetry, you’d point to poetry.lock instead. The .venv directory is where your Python packages are installed.
The most surprising thing about effective caching is how granular it needs to be. It’s not enough to just cache a directory. You need to tie the cache’s existence to the specific artifact that defines your project’s dependencies. This means understanding your build tool and its lock file. For instance, with Yarn, you’d cache node_modules based on yarn.lock. With Composer for PHP, it would be composer.lock and the vendor/ directory.
The real magic happens when you realize that the cache key generation is a powerful tool. GitLab hashes the content of the specified files. If even a single character changes in Gemfile.lock, package-lock.json, requirements.txt, or their equivalents, a new hash is computed, and GitLab treats it as a new cache. This prevents stale dependencies from causing unexpected build failures or security vulnerabilities. It ensures that your pipeline always uses the exact set of dependencies defined in your lock file.
The next step in optimizing your pipelines is often exploring distributed caching or advanced cache strategies for monorepos.