Hugging Face’s accelerate library is your best friend here, and it’s not just for distributed training; it’s for inference too, and it does the heavy lifting of sharding models across GPUs so you don’t have to manually split weights.

Let’s get a large model, say meta-llama/Llama-2-7b-hf, running across two GPUs.

First, ensure you have accelerate installed:

pip install accelerate transformers bitsandbytes

bitsandbytes is crucial for efficient quantization, which helps fit larger models into less VRAM.

Now, let’s set up accelerate for multi-GPU inference. You’ll need to run this command in your terminal:

accelerate config

This will ask you a series of questions. For a typical multi-GPU setup, you’ll want to select:

  • multi-GPU
  • no distributed training (since we’re doing inference)
  • 2 (for the number of GPUs you want to use)
  • Then, it will ask for the specific GPUs. You can enter 0,1 if you have two GPUs and want to use the first two.

The accelerate config command generates a default_config.yaml file in your ~/.cache/huggingface/accelerate/ directory. This file tells accelerate how to distribute your model.

Here’s the Python code to load and run the model:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from accelerate import Accelerator

# Initialize Accelerator
# This will automatically pick up the config from accelerate config
# or you can pass a path to your config file: Accelerator(config_file="my_config.yaml")
accelerator = Accelerator()

# Load tokenizer and model
# Use device_map="auto" and load_in_8bit=True for efficient sharding and quantization
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    load_in_8bit=True,
    torch_dtype=torch.float16 # Use float16 for memory efficiency
)

# Ensure model is on the correct device(s) managed by accelerate
model = accelerator.prepare(model)

# Prepare input
prompt = "The future of AI is"
inputs = tokenizer(prompt, return_tensors="pt")

# Move inputs to the same device as the model (accelerate handles this internally if device_map="auto" was used)
# However, for explicit control, especially if not using device_map="auto" for inputs:
inputs = {k: v.to(accelerator.device) for k, v in inputs.items()}

# Generate text
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=50, do_sample=True, top_k=50, top_p=0.95)

# Decode and print
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

When accelerate loads a model with device_map="auto", it analyzes the model’s layers and its available devices (as configured by accelerate config). It then intelligently assigns different layers of the model to different GPUs. For instance, the initial layers might go on GPU 0, and subsequent layers on GPU 1. The accelerator.prepare(model) step ensures that the model is correctly placed and sharded according to this map.

The load_in_8bit=True flag uses bitsandbytes to quantize the model weights to 8-bit precision. This drastically reduces the VRAM footprint of the model, allowing you to fit larger models or more of them onto your GPUs.

The torch_dtype=torch.float16 further reduces memory usage by using half-precision floating-point numbers.

The accelerator.device refers to the primary device managed by accelerate (often the first GPU). While device_map="auto" handles most of the model placement, explicitly moving input tensors to accelerator.device is a good practice for consistency, though accelerate often manages this implicitly during generation if device_map="auto" was used for the model.

The with torch.no_grad(): block is standard practice for inference to disable gradient calculations, saving memory and speeding up computation.

The model.generate call then performs the forward pass through the sharded model. accelerate seamlessly orchestrates the data flow between the GPUs as the activations are passed from one layer to the next, even if those layers reside on different GPUs.

The accelerator.prepare() step is crucial. It takes the model (which device_map="auto" has already begun to partition) and ensures that each part of the model is on its designated GPU, and that the necessary communication hooks are in place for inter-GPU tensor movement during inference.

You might encounter an error related to CUDA memory if your model is still too large for the combined VRAM of your GPUs, even with 8-bit quantization. In that case, you’d need to either use a smaller model, a more aggressive quantization method (like 4-bit, which is also supported by bitsandbytes), or more GPUs.

The next hurdle you’ll likely face is optimizing the speed of inference, which often involves techniques like quantization-aware fine-tuning, using optimized kernels (like FlashAttention), or exploring different batching strategies.

Want structured learning?

Take the full Huggingface course →