Data, model, and tensor parallelism are the three main strategies for distributing deep learning training across multiple GPUs.

Let’s look at how this plays out in practice. Imagine you have a massive image classification model, say ResNet-152, and you want to train it on ImageNet using four A100 GPUs.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from torchvision import transforms
from PIL import Image
import os

# --- Dummy Dataset ---
class DummyImageDataset(Dataset):
    def __init__(self, num_samples=1000, img_size=(3, 224, 224), num_classes=10):
        self.num_samples = num_samples
        self.img_size = img_size
        self.num_classes = num_classes
        self.data = torch.randn(num_samples, *img_size)
        self.targets = torch.randint(0, num_classes, (num_samples,))

    def __len__(self):
        return self.num_samples

    def __getitem__(self, idx):
        return self.data[idx], self.targets[idx]

# --- Model Definition ---
class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1)
        self.relu = nn.ReLU()
        self.maxpool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.conv2 = nn.Conv2d(16, 32, kernel_size=3, stride=1, padding=1)
        self.fc = nn.Linear(32 * 56 * 56, num_classes) # Adjusted for 224x224 input after two maxpools

    def forward(self, x):
        x = self.maxpool(self.relu(self.conv1(x)))
        x = self.maxpool(self.relu(self.conv2(x)))
        x = x.view(x.size(0), -1) # Flatten
        x = self.fc(x)
        return x

# --- Training Setup (Single GPU for demonstration, but scalable) ---
def train_model(model, dataloader, criterion, optimizer, device, epochs=1):
    model.to(device)
    model.train()
    for epoch in range(epochs):
        running_loss = 0.0
        for i, (inputs, labels) in enumerate(dataloader):
            inputs, labels = inputs.to(device), labels.to(device)

            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            running_loss += loss.item()
            if i % 10 == 9: # Print every 10 mini-batches
                print(f'Epoch [{epoch+1}/{epochs}], Step [{i+1}/{len(dataloader)}], Loss: {running_loss/10:.4f}')
                running_loss = 0.0
    print('Finished Training')

if __name__ == "__main__":
    # Configuration
    num_gpus = torch.cuda.device_count()
    print(f"Using {num_gpus} GPUs.")
    batch_size = 64
    learning_rate = 0.001
    num_epochs = 2
    dataset_size = 10000
    image_size = (3, 224, 224)
    num_classes = 10

    # Data Loading
    train_dataset = DummyImageDataset(num_samples=dataset_size, img_size=image_size, num_classes=num_classes)
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=2)

    # Model and Optimizer
    model = SimpleCNN(num_classes=num_classes)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)

    # --- Strategy Selection (Conceptual - Actual implementation requires libraries like PyTorch DDP, DeepSpeed, Megatron-LM) ---
    # For this example, we'll show how data parallelism is the most straightforward to think about.
    # Real multi-GPU training would involve wrapping the model and using distributed samplers.

    if num_gpus > 1:
        print("\n--- Demonstrating Data Parallelism (Conceptual) ---")
        # In a real scenario with PyTorch DDP:
        # 1. Initialize distributed environment: torch.distributed.init_process_group(...)
        # 2. Wrap model: model = DDP(model, device_ids=[...])
        # 3. Use DistributedSampler for DataLoader: sampler = DistributedSampler(train_dataset)
        #    train_loader = DataLoader(train_dataset, batch_size=batch_size, sampler=sampler, ...)
        # 4. Each process (GPU) runs the same training script, but on a different slice of data.

        # For simplicity, we'll just pick the first GPU to demonstrate the core idea of the model existing.
        # The actual distribution logic is handled by the DDP wrapper.
        device = torch.device("cuda:0")
        print(f"Training on device: {device} (conceptually distributed)")
        train_model(model, train_loader, criterion, optimizer, device, epochs=num_epochs)
    else:
        print("\n--- Training on a single GPU ---")
        device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        train_model(model, train_loader, criterion, optimizer, device, epochs=num_epochs)

    print("\nModel trained. Next steps involve exploring Model and Tensor Parallelism for larger models.")

The core problem these strategies solve is fitting larger models and/or larger batch sizes into the memory limits of individual GPUs and speeding up training by distributing the computational load.

Data Parallelism is the simplest and most common. Imagine you have four GPUs. You take your entire model and copy it onto each of the four GPUs. Then, you split your mini-batch of data across these GPUs. Each GPU processes its slice of the data with a full copy of the model, computes its local gradients, and then these gradients are aggregated across all GPUs to update the model parameters. Since each GPU has a full model copy, the main bottleneck is the communication overhead for gradient aggregation. It’s best suited when the model itself fits comfortably into a single GPU’s memory, but you want to increase the effective batch size or speed up training.

  • Diagnosis: Check GPU utilization. If all GPUs are showing high compute utilization but memory is not maxed out on any single GPU, and training is still slow, you might be bottlenecked by inter-GPU communication or simply not utilizing enough compute.
  • Fix: Use PyTorch’s DistributedDataParallel (DDP) or TensorFlow’s MirroredStrategy. For DDP, you’d typically initialize the process group (torch.distributed.init_process_group(backend='nccl')), wrap your model (model = DDP(model, device_ids=[local_rank])), and use a DistributedSampler for your DataLoader.
  • Why it works: DDP efficiently synchronizes gradients using an all-reduce operation, minimizing communication overhead compared to older methods. DistributedSampler ensures each process gets a unique, non-overlapping subset of the dataset for each epoch.

Model Parallelism (also known as Pipeline Parallelism) is for when your model is too large to fit into a single GPU’s memory. Instead of copying the model to each GPU, you split the model’s layers across the GPUs. For example, layers 1-10 might be on GPU 0, layers 11-20 on GPU 1, and so on. Data flows sequentially through these GPUs. A mini-batch is sent to GPU 0, processed, its output is sent to GPU 1, processed, and so on. This creates a pipeline. The main challenge here is "pipeline bubbles" – idle time on GPUs waiting for data from the previous stage, or for gradients to flow back. Techniques like micro-batching (splitting the mini-batch into even smaller chunks) are used to keep the pipeline fuller.

  • Diagnosis: Observe GPU memory usage. If one GPU has a significantly larger portion of the model and is maxed out in memory while others are less utilized, model parallelism might be needed. You’ll also see periods where some GPUs are idle while others are busy.
  • Fix: Manually partition your model layers across devices or use frameworks like DeepSpeed or Megatron-LM that automate this. For manual partitioning in PyTorch: layer = layer.to('cuda:0'), then output = layer(input), then output = output.to('cuda:1'), and so on for subsequent layers.
  • Why it works: By distributing layers, you reduce the memory footprint on each individual GPU, allowing larger models to be trained. Micro-batching fills the pipeline, reducing idle time.

Tensor Parallelism is the most fine-grained. Instead of splitting layers or data batches, you split the tensors (weights and activations) within a single layer across multiple GPUs. For instance, a large matrix multiplication Y = XA can be split. If A is split column-wise (A = [A1, A2]), then Y = [XA1, XA2]. Each GPU computes a part of the output tensor. This is particularly effective for transformer models with large linear layers (like feed-forward networks and attention mechanisms). It requires significant communication between GPUs during the forward and backward passes of a single layer, as partial results need to be exchanged to complete the computation. Libraries like Megatron-LM are built around this.

  • Diagnosis: Even with model parallelism, if individual layers (especially large linear layers in transformers) are still too big for a single GPU, or if you have many GPUs and want to scale even larger, tensor parallelism is the next step. You’ll see high inter-GPU communication within a single layer’s computation.
  • Fix: Utilize libraries like Megatron-LM which implement specialized tensor-parallelized layers (e.g., ColumnParallelLinear, RowParallelLinear). These layers handle the splitting, communication (e.g., all_gather for column-wise splits, reduce_scatter for row-wise splits), and reassembly of tensor operations.
  • Why it works: It breaks down the largest computational and memory bottlenecks within individual layers, allowing for scaling beyond what data or standard model parallelism can achieve for massive models.

Often, you’ll combine these strategies. A common approach for very large models is to use tensor parallelism within a node (across GPUs on the same machine), model parallelism across nodes (if you have multiple machines), and data parallelism across all replicas of the model. This layered approach, sometimes called 3D parallelism, is how models with trillions of parameters are trained.

The choice depends on your model size, GPU memory, and the number of GPUs available. Data parallelism is the first step for speed-ups when the model fits. Model parallelism is for when the model itself is too big. Tensor parallelism is for pushing the boundaries of model size even further by splitting within layers.

Want structured learning?

Take the full Gpu course →