Batch inference on Hugging Face models can be surprisingly tricky to optimize, and the most impactful gains often come from understanding how the model’s internal structure interacts with your hardware, not just from tweaking batch sizes.
Let’s see it in action. Imagine we’re running a BERT model for sequence classification.
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
import time
# Load model and tokenizer
model_name = "bert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval() # Set model to evaluation mode
# Prepare some dummy data
texts = [
"This is a great movie!",
"I really disliked the ending.",
"The acting was superb.",
"A complete waste of time.",
"I would recommend this to everyone.",
] * 1000 # Create a large list of texts
# Tokenize texts
# Setting padding='max_length' for simplicity here, but dynamic padding is better for perf.
# truncation=True is important to avoid extremely long sequences.
encoded_inputs = tokenizer(texts, padding='max_length', truncation=True, return_tensors='pt')
# Move tokenized inputs to the same device as the model
for k, v in encoded_inputs.items():
encoded_inputs[k] = v.to(device)
# --- Benchmarking ---
batch_size = 32
num_batches = len(texts) // batch_size
# Warm-up runs
for _ in range(5):
with torch.no_grad():
outputs = model(**encoded_inputs)
# Timed runs
start_time = time.time()
for i in range(num_batches):
batch_start = i * batch_size
batch_end = batch_start + batch_size
batch_inputs = {k: v[batch_start:batch_end] for k, v in encoded_inputs.items()}
with torch.no_grad():
outputs = model(**batch_inputs)
end_time = time.time()
avg_time_per_batch = (end_time - start_time) / num_batches
throughput_samples_per_sec = batch_size / avg_time_per_batch
print(f"Batch Size: {batch_size}")
print(f"Average time per batch: {avg_time_per_batch:.4f} seconds")
print(f"Throughput: {throughput_samples_per_sec:.2f} samples/sec")
# Example of dynamic padding (more efficient)
# from transformers import DataCollatorWithPadding
# data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
# # When processing batches, use data_collator(batch_of_examples) to get padded inputs.
This code demonstrates a basic setup. We load a BERT model, tokenize a list of texts, and then iterate through batches to get predictions. The key is torch.no_grad() which disables gradient calculation, saving memory and computation. Moving the model and data to the GPU (.to(device)) is crucial for performance.
The problem this solves is the latency and cost associated with running inference on large datasets. Individual requests can be slow, and processing millions of items one by one is inefficient. Batching groups multiple requests together, allowing the hardware (especially GPUs) to process them in parallel, significantly increasing throughput and reducing per-request cost.
Internally, models like BERT use layers of matrix multiplications and other operations. When you feed a batch, the GPU’s parallel processing units can perform these operations across all samples in the batch simultaneously. For example, a large matrix multiplication within a transformer layer can be computed for all sequences in the batch at once, leveraging the SIMD (Single Instruction, Multiple Data) capabilities of the hardware.
The primary levers you control are:
- Batch Size: The number of samples processed in a single forward pass. Larger batch sizes generally lead to higher throughput up to a point, limited by GPU memory and model architecture.
- Input Sequence Length: Longer sequences require more computation and memory. Truncation and padding strategies are vital here.
- Model Architecture: Different models have different computational footprints. Smaller, distilled, or quantized models are faster.
- Hardware: GPUs are significantly faster than CPUs for these parallelizable tasks. The specific GPU model and its memory capacity are key.
- Data Preprocessing: Efficient tokenization and batching (especially dynamic padding) are critical.
A common pitfall is using fixed-length padding (padding='max_length') for all inputs when your sequences have vastly different lengths. This leads to a lot of wasted computation on padding tokens. Using a DataCollatorWithPadding dynamically pads each batch to the maximum sequence length within that batch, which is much more efficient. You achieve this by passing a list of tokenized examples to the data collator, which then returns a batch ready for the model.
The next major hurdle after optimizing batch size and padding is understanding how to effectively leverage multiple GPUs or even multiple machines for inference.