NVIDIA Time Slicing lets multiple Kubernetes pods share a single GPU, making your GPU resources go way further than you’d expect.

Let’s see it in action. Imagine you have a single NVIDIA A100 GPU and two pods that need it for inference. Without Time Slicing, only one pod could use the GPU at a time. With Time Slicing, they can both access it, albeit with fractions of its compute power.

Here’s how it looks from the Kubernetes side. First, you need to enable the NVIDIA device plugin with Time Slicing enabled. This is typically done in the device plugin’s DaemonSet configuration. Look for the --time-slicing flag.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  template:
    spec:
      containers:
      - name: nvidia-device-plugin
        image: nvcr.io/nvidia/k8s-device-plugin:v0.12.0
        env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: all
        args:
        - /usr/bin/nvidia-container-cli --load-config=/etc/nvidia-container/config.toml
        - --register-gpu
        - --time-slicing # This is the key flag!
        # ... other args and volume mounts

When the device plugin starts with --time-slicing, it registers each physical GPU with the Kubernetes scheduler as multiple "time-sliced" devices. For example, an A100 might be advertised as nvidia.com/gpu: 8 (or whatever the max_shared_clients is set to in the config.toml for that GPU type). You can inspect this using kubectl describe node <your-node-name>. You’ll see entries like:

Capacity:
  nvidia.com/gpu:     8
  ...
Allocatable:
  nvidia.com/gpu:     8
  ...

Now, when you define your pod’s resource requests, you request nvidia.com/gpu: 1. Kubernetes, seeing that 8 time-sliced GPUs are available on the node, can schedule multiple pods requesting a single GPU onto that single physical GPU.

apiVersion: v1
kind: Pod
metadata:
  name: inference-pod-1
spec:
  containers:
  - name: inference-container
    image: your-inference-image
    resources:
      limits:
        nvidia.com/gpu: 1 # Requesting one time-sliced GPU
apiVersion: v1
kind: Pod
metadata:
  name: inference-pod-2
  # ... other metadata
spec:
  containers:
  - name: inference-container
    image: your-inference-image
    resources:
      limits:
        nvidia.com/gpu: 1 # Requesting another time-sliced GPU

Internally, the NVIDIA driver and the nvidia-container-runtime handle the time slicing. For each physical GPU, the driver creates a GpuInstance for each client (pod) that requests a slice. These GpuInstances are then time-sliced. This means the physical GPU’s compute cycles are rapidly switched between these GpuInstances. The driver manages the scheduling of these slices, ensuring that each pod gets a fair share of the GPU’s processing time. The config.toml file, often found at /etc/nvidia-container/config.toml on the node, is where you can configure the maximum number of clients per GPU and other Time Slicing parameters. A typical entry might look like:

[gpu-instances]
# Define GPU instances and their properties
[[gpu-instances.gpus]]
device = "0" # Physical GPU device ID
# Define time-slicing configuration for this GPU
[gpu-instances.gpus.time-slicing]
max_shared_clients = 8 # Maximum number of pods that can share this GPU

This allows for a much higher density of inference workloads, as many small requests can be batched and processed concurrently on a single GPU, rather than waiting for exclusive access. For training workloads, Time Slicing is generally not recommended as it can introduce significant overhead and contention.

The key takeaway is that nvidia.com/gpu in your pod spec, when Time Slicing is enabled on the node, doesn’t mean a dedicated physical GPU. It means a "slice" of a physical GPU, managed by the NVIDIA driver to be shared among multiple clients.

The next thing you’ll likely run into is understanding how to tune the max_shared_clients value based on your specific workload characteristics and acceptable latency.

Want structured learning?

Take the full Gpu course →