Kubernetes GPU Scheduling Deep Dive

Kubernetes doesn’t magically make your GPUs available to pods; it relies on a specific device plugin that announces their presence to the Kubernetes scheduler.

Here’s how it works and how to get your ML workloads running:

The Core Problem: How Kubernetes Sees GPUs

Kubernetes itself doesn’t have native awareness of hardware like GPUs. It’s an orchestrator of containers. To make GPUs visible, you need a "device plugin." The NVIDIA device plugin is the standard for NVIDIA GPUs. It runs as a DaemonSet on your Kubernetes nodes, discovers the GPUs, and registers them with the Kubelet (the agent on each node). The Kubelet then tells the Kubernetes API server about these "device resources," which the scheduler can then use to place pods that request them.

Setting Up the NVIDIA Device Plugin

The most common way to deploy the NVIDIA device plugin is via a Helm chart.

Add the NVIDIA Helm repository:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

Install the device plugin: The basic installation looks like this. You’ll want to customize cluster.local if your cluster uses a different domain.
```
helm install gpu-operator nvidia/gpu-operator \
  --version 23.9.1 \
  --namespace gpu-operator \
  --create-namespace \
  --set cluster.local=true \
  --set devicePlugin.enabled=true
```
This Helm chart installs not just the device plugin but also the NVIDIA driver, container runtime, and other components that make GPU acceleration work seamlessly. The devicePlugin.enabled=true ensures the device plugin is part of the installation.
Verify the deployment: After installation, check that the gpu-operator pods are running in the gpu-operator namespace.
```
kubectl get pods -n gpu-operator
```
You should see pods like gpu-operator-xxxxxxxxxx-yyyyy, nvidia-container-toolkit-xxxxx, and nvidia-device-plugin-xxxxx.
Check for GPU resources: Once the device plugin is running, your nodes should report GPU resources.
```
kubectl get nodes -o jsonpath='{.items[*].status.allocatable.nvidia\.com/gpu}'
```
This should output a list of numbers (e.g., 0 1 1) corresponding to the number of GPUs available on each node. If you see [] or 0 for all nodes, the plugin isn’t working correctly.

Requesting GPUs in Your Pods

To use a GPU, your pod’s container specification needs to request it. This is done under the resources.limits section.

Here’s an example of a simple PyTorch training pod requesting one GPU:

apiVersion: v1
kind: Pod
metadata:
  name: pytorch-gpu-job
spec:
  containers:
  - name: training-container
    image: nvcr.io/nvidia/pytorch:23.09-py3
    command: ["python", "/app/train.py"]
    resources:
      limits:
        nvidia.com/gpu: 1 # Request one GPU
    volumeMounts:
    - name: app-volume
      mountPath: /app
  volumes:
  - name: app-volume
    hostPath:
      path: /path/to/your/local/code # Replace with your actual code path
      type: Directory

Explanation:

nvidia.com/gpu: 1: This is the crucial line. It tells the Kubernetes scheduler that this container requires one GPU resource of the type nvidia.com/gpu. The scheduler will then only place this pod on nodes that have at least one GPU available and registered by the device plugin.
image: nvcr.io/nvidia/pytorch:23.09-py3: This is an example NVIDIA NGC image pre-configured with PyTorch and CUDA. You’ll likely use your own custom training image.

Running Your ML Workload

Build your container image: Ensure your Dockerfile installs necessary ML libraries (TensorFlow, PyTorch, etc.) and includes your training script. Crucially, it must have the NVIDIA CUDA toolkit installed, as the application code will link against it.
Push your image to a registry: Make it accessible to your Kubernetes cluster.

Apply the Pod definition:

kubectl apply -f your-pytorch-gpu-job.yaml

Monitor the pod:
```
kubectl get pods -o wide
kubectl logs pytorch-gpu-job
```
If the pod is stuck in Pending state, it’s likely a scheduling issue. kubectl describe pod pytorch-gpu-job will show events indicating why it can’t be scheduled (e.g., "0/3 nodes are available: 3 Insufficient nvidia.com/gpu").

Common Pitfalls and Debugging

Node Not Reporting GPUs: The most common issue is the NVIDIA device plugin not running or not detecting GPUs. Check the logs of the nvidia-device-plugin-xxxxx pod in the gpu-operator namespace. Look for errors related to Kubelet communication or driver detection. Ensure your nodes have NVIDIA drivers installed and that the nvidia-container-runtime is configured. The gpu-operator Helm chart usually handles this if driver.enabled=true.
Incorrect Resource Request: Double-check nvidia.com/gpu: 1 in your container’s resources.limits. Typos here are frequent.
Multiple GPUs on a Node: If a node has multiple GPUs, the default behavior is to allocate one whole GPU per request. If you need to split a single GPU (e.g., for MIG - Multi-Instance GPU), that requires more advanced configuration of the device plugin and the NVIDIA driver. For standard workloads, requesting nvidia.com/gpu: 1 is sufficient.
Driver Mismatch: The CUDA version your application is compiled against must be compatible with the NVIDIA driver installed on the node. The gpu-operator Helm chart aims to install compatible versions. If you’re managing drivers manually, this is a common source of CUDA_ERROR_NO_DEVICE or similar runtime errors.
Network Policies: While less common for GPU scheduling itself, network policies could prevent the device plugin from registering with the API server or the Kubelet from communicating with the plugin.

The NVIDIA device plugin allows Kubernetes to see GPUs as a schedulable resource, enabling pods to request and utilize them for compute-intensive tasks.