Offload Model States to CPU and NVMe with DeepSpeed ZeRO (2026)

DeepSpeed’s ZeRO Stage 3, when offloading model states to CPU and NVMe, doesn’t just save GPU memory; it fundamentally changes the memory access patterns of your model, often leading to performance gains by keeping more of the model’s parameters available rather than strictly on the GPU.

Let’s see this in action. Imagine you have a large model that won’t fit into your GPU’s VRAM. Without offloading, you’d get an Out-of-Memory (OOM) error.

import torch
from torch import nn
from deepspeed import initialize

# Dummy model that's too big for a single GPU
class LargeModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Linear(1024, 1024)
        self.layer2 = nn.Linear(1024, 1024)
        # ... imagine many more layers or larger dimensions

    def forward(self, x):
        return self.layer2(torch.relu(self.layer1(x)))

model = LargeModel()

# DeepSpeed configuration for ZeRO Stage 3 with CPU and NVMe offload
# This config would be loaded from a JSON file, e.g., ds_config.json
ds_config = {
    "fp16": {
        "enabled": True
    },
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "cpu",
            "pin_memory": True
        },
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": True
        },
        "nvme_offload_path": "/tmp/ds_nvme_offload", # Ensure this directory exists
        "offload_param_nvme": True,
        "offload_optimizer_nvme": True
    },
    "train_batch_size": 32,
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 32,
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 1e-5
        }
    }
}

# Initialize DeepSpeed
# This would typically involve distributed setup (torch.distributed.launch or similar)
# For demonstration, we'll assume a single GPU setup and mock the initialization.
# In a real scenario, you'd have:
# model_engine, optimizer, _, _ = initialize(model=model, model_parameters=model.parameters(), config=ds_config)

print("DeepSpeed ZeRO Stage 3 with CPU/NVMe offload configured.")
print("Model states (parameters, gradients, optimizer states) will be partitioned and offloaded.")
print("During training, needed states are moved to GPU on-demand, then potentially back to CPU/NVMe.")

The core problem ZeRO Stage 3 with offloading solves is the GPU memory bottleneck. Traditional training requires all model parameters, gradients, and optimizer states to reside on the GPU. As models grow, this becomes impossible. ZeRO Stage 3 partitions these states across all available GPUs. When offloading to CPU and NVMe, it further extends this by moving infrequently accessed states off the GPU.

Here’s how it works internally:

Parameter Partitioning: Model parameters are divided into shards. Each GPU only holds a subset of the parameters for the layers it’s currently computing.
Gradient Partitioning: Similarly, gradients are also partitioned. During the backward pass, each GPU computes gradients for its local parameter shards. Gradients for parameters residing on other GPUs are then reduced and sent to their respective owners.
Optimizer State Partitioning: Optimizer states (like momentum and variance for Adam) are also partitioned and distributed.
CPU Offload: When GPU memory is scarce, DeepSpeed can offload entire parameter shards, gradients, or optimizer states to CPU RAM. The pin_memory: True setting in the config helps speed up the transfer between CPU and GPU by ensuring the pinned memory is directly accessible by the GPU’s DMA engine.
NVMe Offload: For even larger models or when CPU RAM is also a constraint, states can be further offloaded to NVMe SSDs. This is slower than CPU RAM but provides a much larger capacity.
On-Demand Fetching: Crucially, when a layer needs to be computed, its required parameters are fetched from wherever they currently reside (GPU, CPU, or NVMe) back to the GPU. After computation, they might be offloaded again if memory pressure dictates.

The magic happens in the offload_param and offload_optimizer sections of the DeepSpeed config.

"device": "cpu": This tells DeepSpeed to use CPU RAM as the first tier of offload.
"pin_memory": True: This is a performance optimization. When set to True, the CPU memory used for offloading is "pinned," meaning it’s allocated in a way that prevents the operating system from paging it out. This allows for faster, asynchronous data transfers between CPU and GPU.
"nvme_offload_path": "/tmp/ds_nvme_offload": This specifies the directory on your NVMe drive where states will be stored.
"offload_param_nvme": True and "offload_optimizer_nvme": True: These flags enable NVMe offloading for parameters and optimizer states, respectively.

When you run training with this configuration, DeepSpeed intercepts the forward and backward passes. It dynamically manages which parameter shards, gradients, and optimizer states are on the GPU, CPU, or NVMe. It fetches what’s needed for the current computation and, if memory is tight, pushes less-needed states to CPU or NVMe. This process is managed transparently by DeepSpeed’s initialize function and its internal communication collectives.

The surprising part is that this offloading strategy can sometimes speed up training, not just enable it. By keeping more of the model’s states readily accessible (even if on CPU RAM or NVMe), you reduce the chances of needing to perform expensive communication operations to fetch parameters from other GPUs. The latency of fetching from CPU/NVMe, while higher than GPU VRAM, can be lower than inter-GPU communication for certain parameter access patterns, especially if the model’s computation is not perfectly balanced across layers or GPUs. You’re trading GPU VRAM for CPU RAM and NVMe storage, but also potentially for inter-GPU bandwidth.

The next hurdle you’ll likely encounter is managing the NVMe disk I/O, which can become a bottleneck if your NVMe drive is slow or if the offloading/fetching pattern leads to excessive reads/writes.