ML Pipeline Parallelism: Beyond Simple Layer Splitting

Pipeline parallelism lets you train models that are too big to fit on a single GPU by splitting the model across multiple GPUs.

Here’s a transformer model, broken into two stages, running on two GPUs. The first GPU handles the embedding and the first few layers of the transformer, and the second GPU handles the rest of the transformer layers and the final output.

import torch
import torch.nn as nn

# Define a simple model that can be split
class LargeModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Linear(1024, 2048)
        self.relu = nn.ReLU()
        self.layer2 = nn.Linear(2048, 1024)

    def forward(self, x):
        x = self.layer1(x)
        x = self.relu(x)
        x = self.layer2(x)
        return x

# Instantiate the model
model = LargeModel()

# Split the model into two stages
# Stage 1: Embedding and first half of layers
# Stage 2: Second half of layers and output
model_stage1 = nn.Sequential(
    model.layer1,
    model.relu
).to('cuda:0') # Move stage 1 to GPU 0

model_stage2 = nn.Sequential(
    model.layer2
).to('cuda:1') # Move stage 2 to GPU 1

# Dummy input
input_data = torch.randn(32, 1024).to('cuda:0') # Input on GPU 0

# Forward pass
# Data flows from GPU 0 to GPU 1
output_stage1 = model_stage1(input_data)
output_stage2 = model_stage2(output_stage1.to('cuda:1')) # Move intermediate output to GPU 1

print(output_stage2.shape)

This simple example shows how data moves sequentially from one GPU to the next. In a real training loop, you’d manage this flow, including gradient calculation and backpropagation, across the stages. The key idea is that no single GPU needs to hold the entire model’s parameters or activations simultaneously.

The problem pipeline parallelism solves is the memory limitation of individual GPUs. Modern deep learning models, especially in areas like natural language processing and computer vision, have billions of parameters. Storing these parameters, along with the intermediate activations generated during the forward pass, can easily exceed the memory capacity of even high-end GPUs. When this happens, training stops with an OutOfMemoryError. Pipeline parallelism circumvents this by distributing the model’s layers across multiple devices. Each GPU only needs enough memory to hold a portion of the model and the activations for that specific portion.

Internally, pipeline parallelism works by partitioning the model’s layers into contiguous blocks, called "stages." Each stage is assigned to a different device (e.g., GPU). During the forward pass, the input data flows through the first stage on GPU 0, its output is then transferred to GPU 1 for the second stage, and so on, until the final output is produced by the last stage. Backpropagation follows the reverse path.

The crucial challenge in pipeline parallelism is managing the communication and synchronization between stages to maximize device utilization. Naively sending data from one stage to the next can lead to "pipeline bubbles" – periods where GPUs are idle, waiting for data from the previous stage. To mitigate this, techniques like micro-batching are employed. The input data is split into smaller micro-batches. These micro-batches are fed into the pipeline in a staggered fashion, allowing multiple stages to process different micro-batches concurrently, thereby reducing idle time. For example, GPU 0 processes micro-batch 1, then sends it to GPU 1 and starts processing micro-batch 2. While GPU 1 is processing micro-batch 1, GPU 0 is already working on micro-batch 2. This overlap keeps the pipeline "full" and improves throughput.

The exact levers you control are:

Number of Stages/GPUs: How many devices you split your model across. More stages mean each stage can be smaller, but also increases communication overhead.
Stage Partitioning: How you divide the model’s layers into stages. This is often done automatically by libraries like Megatron-LM or DeepSpeed, but can be manually tuned for optimal balance of computation and memory across stages.
Micro-batch Size: The size of the smaller batches fed into the pipeline. Larger micro-batches can reduce the impact of pipeline bubbles but require more memory per GPU.
Inter-stage Communication Strategy: How data is transferred between GPUs. This typically involves efficient CUDA streams and asynchronous operations to overlap computation and communication.

A common point of confusion is that pipeline parallelism doesn’t inherently solve the problem of a single layer being too large for a GPU. If a specific layer’s parameters or its intermediate activations during a forward/backward pass exceed a single GPU’s memory, you’d need tensor parallelism (also known as intra-layer model parallelism) in addition to pipeline parallelism. Tensor parallelism splits individual layers across multiple GPUs, whereas pipeline parallelism splits layers across multiple GPUs.

The next concept to grapple with is how to efficiently partition the model layers to minimize the "balance" problem, where one stage might be significantly heavier computationally or memory-wise than others.