The NVIDIA GPU Operator is not just a driver installer; it’s a Kubernetes-native solution that automates the deployment and management of NVIDIA GPU resources within your cluster.
Imagine you’ve got a Kubernetes cluster and you want to leverage the power of NVIDIA GPUs for machine learning, scientific computing, or other demanding workloads. You could manually install drivers, CUDA toolkits, and other NVIDIA software on each node. That sounds like a nightmare to keep consistent and up-to-date across your entire fleet, especially as nodes join and leave the cluster. The GPU Operator tackles this by treating GPU management as a Kubernetes control plane problem.
Here’s a simplified view of how it works in action. Let’s say you have a Kubernetes cluster with two nodes, gpu-node-1 and gpu-node-2, both equipped with NVIDIA Tesla T4 GPUs.
First, you’ll apply the GPU Operator’s Custom Resource Definitions (CRDs) and its core components. These components, running as pods within your cluster, watch for specific Kubernetes resources.
# Example: GPUClusterPolicy Custom Resource
apiVersion: gpu.nvidia.com/v1
kind: GPUClusterPolicy
metadata:
name: gpu-cluster-policy
spec:
# This tells the operator to manage the driver installation
driver:
enabled: true
# Specify the desired driver version
version: "470.82.01"
# This tells the operator to manage the CUDA toolkit
cuda:
enabled: true
# Specify the desired CUDA version
version: "11.4"
# This tells the operator to manage the device plugin
devicePlugin:
enabled: true
# This tells the operator to manage the MIG manager (if applicable)
migManager:
enabled: false
When you apply this GPUClusterPolicy, the GPU Operator’s gpu-feature-discovery component runs on each node. It scans the node for NVIDIA GPUs and their capabilities, then creates a Device Custom Resource (CR) for each detected GPU.
# Example: Device Custom Resource (created by gpu-feature-discovery)
apiVersion: gpu.nvidia.com/v1
kind: Device
metadata:
name: gpu-0-node-gpu-node-1
labels:
nvidia.com/gpu.product: NVIDIA-Tesla-T4
nvidia.com/gpu.uuid: GPU-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
spec:
node: gpu-node-1
index: 0
product: NVIDIA-Tesla-T4
memory: 16384
# ... other GPU properties
The driver-installer component then watches for these Device CRs. If the GPUClusterPolicy has driver.enabled: true and specifies a driver.version, the driver-installer will ensure that the correct NVIDIA driver is installed on the node. It typically does this by deploying a DaemonSet that, on each node, either installs the driver from a specified container image or uses a host-based installation method. For example, to install driver version 470.82.01 on a node using a containerized approach, it might deploy a DaemonSet pod that runs a command like:
# Inside the driver-installer's DaemonSet pod on a node
nvidia-installer --silent --utility-prefix=/usr/local/nvidia --no-opengl-files --driver-version 470.82.01 --no-kernel-module-check --run-nvidia-xconfig=false
This ensures that the host kernel modules and user-space libraries are present and correctly configured for that specific driver version.
Similarly, the cuda-toolkit component, also deployed as a DaemonSet, manages the installation of the CUDA toolkit. If cuda.enabled: true and a cuda.version is specified, it will make the CUDA libraries and executables available on the node, often by mounting them into pods that request GPUs.
Finally, the device-plugin component is responsible for advertising the available GPUs to the Kubernetes scheduler. It runs as a DaemonSet and registers itself with the Kubernetes API server. When a pod requests a GPU resource (e.g., nvidia.com/gpu: 1), the device plugin ensures that the pod is scheduled onto a node with an available GPU and that the necessary device files and environment variables are set up for the pod to access the GPU.
The core problem the GPU Operator solves is the operational overhead of managing GPU hardware and software in a dynamic Kubernetes environment. It abstracts away the complexities of driver installation, CUDA toolkit management, and device discovery, treating them as declarative Kubernetes resources. This allows you to simply request GPUs in your pod specifications, and the operator ensures they are provisioned and accessible.
One of the most surprising things about the GPU Operator is how it handles driver upgrades. When you change the driver.version in your GPUClusterPolicy, the operator doesn’t just magically update everything. It intelligently orchestrates rolling updates of the driver DaemonSet. For each node, it will first drain the node of user workloads, then update the driver, and finally cordon and uncordon the node to allow new workloads to be scheduled. This ensures that GPU-accelerated applications experience minimal downtime during driver maintenance.
The next concept you’ll likely encounter is managing specific GPU features like MIG (Multi-Instance GPU) or NVLink, which require further configuration within the GPUClusterPolicy and understanding how the operator exposes these advanced capabilities.