Implement Model Parallelism Across GPUs with Pipeline Stages (2026)

Harnessing multiple GPUs for a single model isn’t just about throwing more hardware at it; it’s about breaking the model’s computation into a linear sequence of operations that can be processed by different GPUs in a pipeline.

Imagine you have a giant neural network that won’t fit into a single GPU’s memory, or you want to speed up inference by processing batches faster. Model parallelism, specifically pipeline parallelism, is the answer. Instead of replicating the entire model on each GPU (data parallelism), you split the model itself across GPUs.

Let’s say you have a transformer model. You can split it layer-wise. GPU 0 might handle the first 8 layers, GPU 1 the next 8, and so on. When you feed a batch of data, GPU 0 processes its layers, then passes the intermediate activations to GPU 1, which processes its layers, and so forth. This creates a pipeline.

Here’s a simplified view of what happens during a forward pass with two GPUs and a model split into two stages:

// Assume batch_size = 16, num_gpus = 2
// Model: Stage 1 (GPU 0) -> Stage 2 (GPU 1)

// --- Micro-batch 1 ---
// GPU 0: Processes micro-batch 1 through Stage 1
// GPU 1: Idle (waiting for Stage 1 output from GPU 0)

// --- Micro-batch 2 ---
// GPU 0: Processes micro-batch 2 through Stage 1
// GPU 1: Receives Stage 1 output from Micro-batch 1, processes it through Stage 2

// --- Micro-batch 3 ---
// GPU 0: Processes micro-batch 3 through Stage 1
// GPU 1: Receives Stage 1 output from Micro-batch 2, processes it through Stage 2

// ... and so on.

The key to efficiency is the concept of micro-batches. Instead of sending the entire batch through one stage at a time, you divide the batch into smaller chunks (micro-batches). This allows the GPUs to overlap their computations. While GPU 1 is busy with micro-batch 1, GPU 0 can already start working on micro-batch 2. This "pipeline" effect keeps all GPUs busy for a larger portion of the time, significantly reducing idle time.

The total batch size is batch_size, and you decide on the number of micro-batches, num_micro_batches. A common choice is to set num_micro_batches equal to the number of GPUs, but it can be higher. For instance, if batch_size = 64 and num_micro_batches = 8, each micro-batch will have 64 / 8 = 8 samples.

The "stages" are simply contiguous blocks of your model’s layers. For a PyTorch nn.Module, you might define your stages like this:

import torch
import torch.nn as nn

class Stage1(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Linear(1024, 1024)
        self.relu = nn.ReLU()
        self.layer2 = nn.Linear(1024, 512)

    def forward(self, x):
        x = self.layer1(x)
        x = self.relu(x)
        x = self.layer2(x)
        return x

class Stage2(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer3 = nn.Linear(512, 256)
        self.relu = nn.ReLU()
        self.layer4 = nn.Linear(256, 128)

    def forward(self, x):
        x = self.layer3(x)
        x = self.relu(x)
        x = self.layer4(x)
        return x

# In your main script:
stage1_model = Stage1().to('cuda:0')
stage2_model = Stage2().to('cuda:1')

The actual implementation involves careful management of tensors moving between GPUs. Libraries like Megatron-LM or DeepSpeed abstract much of this complexity. When using these, you typically define your model and then use their utilities to partition it. For instance, in Megatron-LM, you might use ColumnParallelLinear and RowParallelLinear to distribute weight matrices.

The goal is to achieve a balance: enough micro-batches to keep the pipeline full, but not so many that the overhead of managing them becomes prohibitive. The ideal number of micro-batches is often found through empirical testing.

The critical constraint you’re often trying to overcome is memory. If your model’s weights, activations, and optimizer states for a single forward/backward pass don’t fit on one GPU, you must use model parallelism. Pipeline parallelism is a way to distribute these across multiple devices.

The actual gradient computation and weight updates are also pipelined. After the forward pass, the backward pass starts from the last stage. Gradients flow backward through the pipeline, and each GPU computes gradients only for the parameters it holds. This requires careful synchronization to ensure that gradients for a given micro-batch are computed and applied correctly.

The efficiency of a pipeline is often measured by its "pipeline parallelism degree" (number of stages) and the "micro-batch size." A crucial metric is the "bubble time" – the time when GPUs are idle waiting for data. The goal is to minimize this bubble.

You can think of this as an assembly line. Each GPU is a station. Data comes in, gets processed, and moves to the next station. If you have too few items (micro-batches) on the line, stations at the end will be waiting. If you have too many, the line might get clogged or require too much buffer space.

The most surprising thing about pipeline parallelism is how effectively it can mask latency. By keeping all GPUs busy with different micro-batches, the total time to process a large batch can be significantly reduced, even though each individual micro-batch still traverses the entire model sequentially. The sum of individual micro-batch times is much larger than the total pipeline execution time.

Here’s a simplified look at a micro-batch flow during a full forward-backward pass with 2 stages and 2 micro-batches:

// Batch Size = 2, Num Micro-batches = 2, Num GPUs = 2
// Model: Stage 1 (GPU 0) -> Stage 2 (GPU 1)

// --- Forward Pass ---
// MB 1: GPU 0 processes Stage 1 -> sends to GPU 1
// MB 2: GPU 0 processes Stage 1 -> sends to GPU 1 (while GPU 1 processes MB 1)

// --- Backward Pass ---
// MB 1: GPU 1 computes gradients for Stage 2 -> sends gradients to GPU 0
// MB 2: GPU 1 computes gradients for Stage 2 -> sends gradients to GPU 0 (while GPU 0 computes gradients for MB 1)

// --- Optimizer Step (happens per GPU) ---
// GPU 0: Updates weights for Stage 1 using received gradients
// GPU 1: Updates weights for Stage 2 using received gradients

The tricky part is that the backward pass for micro-batch k depends on the forward pass of micro-batch k and the backward pass of micro-batch k+1. This requires careful scheduling to ensure gradients are computed correctly and don’t overwrite each other. Libraries like DeepSpeed’s PipelineEngine handle this intricate scheduling.

The next hurdle you’ll face is managing the communication overhead. As the number of stages increases, the inter-GPU communication for passing activations and gradients can become a bottleneck, potentially negating the benefits of parallelism.