MLOps GPU Scheduling: Maximize Cluster Utilization (2026)

The most surprising truth about GPU scheduling is that it’s not about assigning GPUs, but about denying them until the absolute last moment.

Imagine you’ve got a cluster of A100s, and a bunch of training jobs are piling up. Your goal is to keep those expensive GPUs purring, not sitting idle. This is where MLOps GPU scheduling comes in.

Here’s a typical scenario:

# Job 1: Large BERT training
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: bert-large-train
spec:
  tfReplicaSpecs:
    Worker:
      replicas: 4
      template:
        spec:
          containers:
          - name: tensorflow
            image: nvcr.io/nvidia/tensorflow:21.11-tf2-py3
            resources:
              limits:
                nvidia.com/gpu: 1
---
# Job 2: Smaller image classification
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: resnet50-classify
spec:
  pytorchReplicaSpecs:
    Worker:
      replicas: 8
      template:
        spec:
          containers:
          - name: pytorch
            image: pytorch/pytorch:1.11.0-cuda11.3-cudnn8-runtime
            resources:
              limits:
                nvidia.com/gpu: 1

You submit these jobs to Kubernetes, and an MLOps scheduler (like Kubeflow’s built-in scheduler, or something more advanced like Volcano or KServe) takes over. It doesn’t just grab the first available GPU and slap it onto a pod. That would be wasteful.

Instead, the scheduler is constantly evaluating the total demand against the total supply. It looks at the requested GPU resources (nvidia.com/gpu: 1 in this case), the current cluster utilization, and the priority of incoming jobs. It’s playing a game of Tetris, trying to fit the most blocks (GPU-hours) into the available space (GPU-time) without leaving too many gaps.

The core problem this solves is the massive underutilization of expensive GPU hardware. If you just let pods grab GPUs as they start, you’d have jobs that only need a GPU for 10 minutes holding onto one for hours, while other jobs queue up. This is a direct hit to your ROI.

Here’s how it works internally, simplified:

Resource Discovery: The scheduler (e.g., Kubernetes scheduler with GPU device plugin) discovers available GPUs on nodes. Each node advertises nvidia.com/gpu resources.
Pod Scheduling: When a pod requests GPUs, the Kubernetes scheduler (or an advanced scheduler) identifies nodes that can satisfy the request.
Preemption/Prioritization: If there aren’t enough GPUs, higher-priority pods might preempt (kick out) lower-priority pods. The scheduler uses QoS classes, priority levels, and sometimes custom fairness policies to decide this.
Bin Packing: The scheduler tries to pack pods onto nodes efficiently. It might try to put multiple smaller jobs on a single node if they don’t exceed the node’s GPU capacity, or if the GPU is a multi-instance GPU (MIG) device.

The "levers" you control are primarily:

Resource Requests: resources.limits.nvidia.com/gpu: N. This is the most critical lever. Requesting the exact number of GPUs a job needs, no more, no less.
Node Affinity/Tolerations: Directing specific job types to nodes with particular GPU models or configurations.
Priority Classes: Defining PriorityClass objects in Kubernetes to tell the scheduler which jobs are more important.
Resource Quotas/Limit Ranges: Enforcing limits on how many GPUs a namespace or user can consume.
Advanced Schedulers: Integrating tools like Volcano for gang scheduling (ensuring all replicas of a distributed job start together) or custom schedulers for complex allocation policies.

Consider a node with 8 A100 GPUs. If you have 8 individual jobs, each requesting 1 GPU, they can all fit. But if one job requests 4 GPUs and another requests 5, they can’t both run on that node simultaneously, even though the node has 8 GPUs. The scheduler sees it as 4-GPU capacity and 5-GPU capacity, not 9 total GPUs. This is why careful resource requests are paramount.

What most people don’t realize is that the default Kubernetes scheduler is quite primitive when it comes to GPU bin-packing. It primarily focuses on "can this node satisfy this pod’s request?" rather than "how can I best pack all pending pods onto the available GPUs across the cluster?" This is where specialized MLOps schedulers or custom scheduling logic become essential for maximizing utilization. They often employ sophisticated algorithms to consider job durations, GPU types, and even power consumption.

The next challenge is managing distributed training jobs where all workers must start simultaneously, often called "gang scheduling."