Training large models across multiple GPUs is a common bottleneck, and Hugging Face Accelerate is the go-to library for making this seamless.
Let’s see Accelerate in action. Imagine you have a script train.py that trains a simple BERT model for sequence classification.
# train.py
import torch
from torch.optim import AdamW
from transformers import BertForSequenceClassification, BertTokenizer
from accelerate import Accelerator
from datasets import load_dataset
# 1. Initialize Accelerator
accelerator = Accelerator()
# 2. Load data and model
dataset = load_dataset("glue", "sst2")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
optimizer = AdamW(model.parameters(), lr=1e-5)
# Preprocess data
def preprocess_function(examples):
return tokenizer(examples["sentence"], truncation=True, padding="max_length", max_length=128)
tokenized_datasets = dataset.map(preprocess_function, batched=True)
train_dataset = tokenized_datasets["train"].shuffle(seed=42).remove_columns(["sentence", "idx"])
train_dataset.set_format("torch")
# 3. Prepare everything with Accelerator
model, optimizer, train_loader = accelerator.prepare(
model, optimizer, torch.utils.data.DataLoader(train_dataset, batch_size=8, shuffle=True)
)
# Training loop
for epoch in range(3):
model.train()
for batch in train_loader:
outputs = model(**batch)
loss = outputs.loss
accelerator.backward(loss)
optimizer.step()
optimizer.zero_grad()
if accelerator.is_main_process:
print(f"Epoch {epoch}, Loss: {loss.item():.4f}")
if accelerator.is_main_process:
print("Training finished!")
To run this script on a single GPU, you’d normally just do python train.py. But to scale it, you use the accelerate command:
accelerate launch train.py
This single command orchestrates the distributed training. Accelerate handles the complexities of data parallelism, gradient synchronization, and device placement under the hood. You don’t need to manually wrap your model with DistributedDataParallel or manage communication primitives.
The core of Accelerate’s magic lies in the Accelerator object. When you initialize it, it detects your hardware setup (GPUs, TPUs, CPUs) and configures itself accordingly. The accelerator.prepare() method is where the heavy lifting happens. It takes your PyTorch model, optimizer, and data loaders and wraps them in a way that’s compatible with distributed training. For data parallelism, it automatically wraps your model with DistributedDataParallel (or the equivalent for TPUs) and handles moving batches to different devices.
The accelerator.backward(loss) call replaces the standard loss.backward(). This version correctly handles gradient synchronization across all processes, ensuring that each GPU’s gradients are averaged before the optimizer step. This is crucial for maintaining consistent model updates across all replicas.
The accelerator.is_main_process check is used to prevent redundant operations, like printing to the console or saving checkpoints, from occurring multiple times on each process. Only the main process (usually rank 0) performs these actions.
The real power comes when you configure Accelerate for multi-GPU. After installing Accelerate (pip install accelerate), you run accelerate config. This interactive script asks you questions about your environment:
- Do you want to use mixed precision? (e.g.,
fp16) - How many processes? (e.g.,
2for two GPUs) - Which machine type? (
multi-GPUorsingle-GPU) - Which deep learning framework? (
pytorch)
Once configured, Accelerate saves a default_config.yaml file. For example, a multi-GPU setup might look like this:
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_process_ip: 127.0.0.1
main_process_port: 29500
mixed_precision: fp16
num_machines: 1
num_processes: 2
use_cpu: false
Now, accelerate launch train.py will automatically use this configuration, distributing the training across the specified number of GPUs with the chosen precision.
What most users don’t realize is that accelerator.prepare also handles device placement. You don’t need to manually call .to(accelerator.device) on your model, optimizer, or data tensors. Accelerate takes care of moving everything to the correct device for each process. This significantly simplifies your training code, allowing you to write a single script that works out-of-the-box for single-GPU, multi-GPU, and even TPU training with minimal or no modifications.
The next step in scaling might involve exploring DeepSpeed or FSDP integration, which Accelerate also supports through its configuration.