GitHub Actions can leverage GPU-accelerated hardware for your CI/CD workflows, but only on specific runner types and with careful configuration.
Here’s how to get those powerful GPUs working for you:
Setting Up GPU Runners
First, you need to select a runner that has GPU hardware attached. GitHub offers gpu labels for certain runner types. You can find these in the GitHub Actions documentation or by inspecting the available runner types in your organization or repository settings.
To target a GPU runner in your workflow, you’ll specify the runs-on field in your job definition. For example:
jobs:
gpu_job:
runs-on: [self-hosted, gpu]
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Run GPU workload
run: |
# Your GPU-accelerated commands here
echo "Running on a GPU runner!"
# Example: nvidia-smi to check GPU status
nvidia-smi
The [self-hosted, gpu] syntax tells Actions to look for a self-hosted runner that also has the gpu label. If you’re using GitHub-hosted runners with GPU capabilities, you might use a label like runs-on: [ubuntu-latest-gpu], depending on what GitHub offers at the time. Always check the official GitHub Actions runner availability.
Installing GPU Drivers and Libraries
The most common pitfall is assuming the GPU drivers and necessary libraries (like CUDA or cuDNN for NVIDIA GPUs) are pre-installed on the runner. They are not. You’ll need to install them as part of your workflow.
For NVIDIA GPUs:
-
Install NVIDIA Drivers: This is crucial. You can often use pre-built Docker images that include drivers, or install them directly.
-
Diagnosis: Before installing, check if
nvidia-smiis available. If it’s not found, drivers are missing. -
Fix (Example using a Docker image):
jobs: gpu_job: runs-on: [self-hosted, gpu] container: image: nvidia/cuda:11.8.0-base-ubuntu22.04 # Example CUDA version options: --gpus all # This is key to exposing GPUs to the container steps: - name: Checkout code uses: actions/checkout@v4 - name: Verify GPU access run: nvidia-smiThe
options: --gpus allin thecontainerdefinition is essential for Docker to pass the host’s GPUs into the container. -
Fix (Example installing drivers on a bare runner): This is more complex and often involves downloading and running NVIDIA’s installer scripts. You’d typically do this in a
runstep. Caution: This can be brittle due to OS updates.jobs: gpu_job: runs-on: [self-hosted, gpu] steps: - name: Install NVIDIA Drivers run: | # Example: Download and run NVIDIA installer # This is a simplified example; actual commands depend on OS and driver version curl -O https://us.download.nvidia.com/tesla/525.60.13/NVIDIA-Linux-x86_64-525.60.13.run sh NVIDIA-Linux-x86_64-525.60.13.run --silent --dkms - name: Install CUDA Toolkit run: | # Download and install CUDA toolkit (e.g., from NVIDIA's repo) # ... commands to add repo and install cuda-toolkit-11-8 - name: Verify GPU access run: nvidia-smiThis approach requires careful management of the runner’s environment.
-
-
Install CUDA Toolkit and cuDNN: Once drivers are in place, you need the CUDA toolkit and cuDNN library for deep learning frameworks.
-
Diagnosis: Your deep learning framework (TensorFlow, PyTorch) will likely throw errors like "Could not load dynamic library 'libcudart.so.11.0'" or "cuDNN not found."
-
Fix (within a Docker container): Use a CUDA-enabled base image like
nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04. These images come with drivers (if the host has them andnvidia-container-runtimeis set up) and the necessary CUDA/cuDNN libraries. -
Fix (on a bare runner): Download and install the specific CUDA toolkit and cuDNN versions required by your application. Ensure the
PATHandLD_LIBRARY_PATHenvironment variables are set correctly to point to the CUDA binaries and libraries.jobs: gpu_job: runs-on: [self-hosted, gpu] steps: - name: Checkout code uses: actions/checkout@v4 - name: Install CUDA Toolkit 11.8 run: | # Commands to download and install CUDA from NVIDIA's repo # e.g., wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb # sudo dpkg -i cuda-keyring_1.0-1_all.deb # sudo apt-get update # sudo apt-get -y install cuda-toolkit-11-8 # export PATH=/usr/local/cuda-11.8/bin:$PATH # export LD_LIBRARY_PATH=/usr/local/cuda-11.8/lib64:$LD_LIBRARY_PATH - name: Install cuDNN 8.6 run: | # Download cuDNN tarball from NVIDIA, extract, and copy files # to CUDA installation directories. # ... commands to extract and copy ... - name: Verify CUDA/cuDNN run: | nvcc --version # Check for cuDNN by importing in Python (if installed) # python -c "import torch; print(torch.backends.cudnn.version())"
-
Runner Configuration (Self-Hosted)
If you’re using self-hosted runners, the runner software itself needs to be aware of the GPU.
-
GPU Label: When registering the self-hosted runner, you must assign it the
gpulabel.- Diagnosis: Jobs targeting
runs-on: [self-hosted, gpu]will never be picked up. - Fix: When starting the runner, use the
--labelsflag:
Ensure that the machine running the self-hosted runner actually has a GPU and that the system can see it (e.g.,./run.sh --url https://github.com/YOUR_ORG/YOUR_REPO --token YOUR_TOKEN --labels "self-hosted,gpu"nvidia-smiworks on the host).
- Diagnosis: Jobs targeting
-
Docker with
--gpus all: If your workflow uses Docker, the runner process itself needs to be able to pass the--gpus allflag to Docker commands.- Diagnosis: Containers started by the Actions runner cannot see or use the GPUs.
- Fix: Ensure the Docker daemon on the runner machine is configured to allow GPU access, or that the
docker runcommands within your workflow explicitly include--gpus all. This is often handled by thecontainerconfiguration in the workflow YAML, as shown above, but if you’re runningdocker runcommands directly in arunstep, you’ll need to add it there.
Caching Dependencies
GPU-dependent libraries and models can be large. Caching them significantly speeds up subsequent runs.
-
Diagnosis: Jobs take a very long time to set up, re-downloading drivers, CUDA, cuDNN, or large model files on every run.
-
Fix: Use the
actions/cacheaction. Cache the directories where drivers, CUDA, cuDNN, or your model artifacts are installed or downloaded.jobs: gpu_job: runs-on: [self-hosted, gpu] steps: - name: Checkout code uses: actions/checkout@v4 - name: Set up Python uses: actions/setup-python@v5 with: python-version: '3.10' - name: Cache CUDA/cuDNN uses: actions/cache@v3 id: cuda-cache with: path: | /usr/local/cuda-11.8 # Example path /usr/local/cudnn-8.6 # Example path key: ${{ runner.os }}-cuda-${{ hashFiles('**/cuda-keyring_1.0-1_all.deb') }} # Adjust key based on how you install - name: Install CUDA/cuDNN (if not cached) if: steps.cuda-cache.outputs.cache-hit != 'true' run: | # ... commands to install CUDA/cuDNN ... - name: Cache Python dependencies uses: actions/cache@v3 id: python-cache with: path: | ~/.cache/pip venv key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }} - name: Install Python dependencies (if not cached) if: steps.python-cache.outputs.cache-hit != 'true' run: | python -m venv venv source venv/bin/activate pip install -r requirements.txt - name: Run GPU workload run: | # Your GPU-accelerated commands here source venv/bin/activate python your_gpu_script.pyThe key for caching needs to be carefully crafted to ensure you get cache hits when appropriate but invalidation when the underlying installation commands change.
The next hurdle you’ll likely face is optimizing memory usage on the GPU to avoid CUDA out of memory errors.