The most surprising truth about PyTorch GPU memory profiling is that torch.cuda.memory_allocated() and torch.cuda.max_memory_allocated() are often misleading for understanding actual GPU memory pressure.
Let’s see what’s actually happening on the GPU. Imagine you have a simple PyTorch script that trains a small model.
import torch
import torch.nn as nn
import torch.optim as optim
import time
# Simple model
class SimpleNN(nn.Module):
def __init__(self):
super(SimpleNN, self).__init__()
self.fc1 = nn.Linear(100, 50)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(50, 10)
def forward(self, x):
x = self.fc1(x)
x = self.relu(x)
x = self.fc2(x)
return x
model = SimpleNN().cuda()
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()
# Simulate some data
data = torch.randn(64, 100).cuda()
target = torch.randint(0, 10, (64,)).cuda()
# Let's track memory
print(f"Initial allocated: {torch.cuda.memory_allocated() / 1024**2:.2f} MB")
print(f"Initial cached: {torch.cuda.memory_reserved() / 1024**2:.2f} MB")
# First forward pass
output = model(data)
loss = criterion(output, target)
print(f"After forward pass (allocated): {torch.cuda.memory_allocated() / 1024**2:.2f} MB")
print(f"After forward pass (cached): {torch.cuda.memory_reserved() / 1024**2:.2f} MB")
# Backward pass and optimization
loss.backward()
optimizer.step()
optimizer.zero_grad()
print(f"After backward and step (allocated): {torch.cuda.memory_allocated() / 1024**2:.2f} MB")
print(f"After backward and step (cached): {torch.cuda.memory_reserved() / 1024**2:.2f} MB")
# Simulate another iteration to see caching
data2 = torch.randn(64, 100).cuda()
output2 = model(data2)
loss2 = criterion(output2, target)
loss2.backward()
optimizer.step()
optimizer.zero_grad()
print(f"After second iteration (allocated): {torch.cuda.memory_allocated() / 1024**2:.2f} MB")
print(f"After second iteration (cached): {torch.cuda.memory_reserved() / 1024**2:.2f} MB")
# Let's clear cache to see what happens
torch.cuda.empty_cache()
print(f"After empty_cache (allocated): {torch.cuda.memory_allocated() / 1024**2:.2f} MB")
print(f"After empty_cache (cached): {torch.cuda.memory_reserved() / 1024**2:.2f} MB")
When you run this, you’ll notice memory_allocated increases, then might not decrease much even after zero_grad() or empty_cache(). This is because PyTorch’s CUDA memory allocator is designed for speed, not immediate deallocation. It uses a caching allocator. When you allocate memory, PyTorch asks the CUDA driver for a block. When you deallocate, PyTorch doesn’t necessarily return that block to the driver immediately. Instead, it keeps it in a pool, ready for the next allocation. This avoids the overhead of frequent CUDA driver calls, which can be slow.
The key levers you control are primarily around how you manage tensors and the PyTorch caching allocator.
-
Tensor Management:
torch.cuda.empty_cache(): This is a blunt instrument. It tells PyTorch to release all unoccupied cached memory back to the CUDA driver. It doesn’t free memory currently held by active tensors.tensor.detach(): If you no longer need a tensor’s gradient history, call.detach(). This can break computational graph links and allow memory to be freed sooner if the tensor itself is no longer referenced.del tensor: Explicitly delete tensors when they are no longer needed, especially large ones. Python’s garbage collector will eventually reclaim the memory, and PyTorch’s allocator will then potentially return it to the cache or driver.
-
PyTorch Allocator Settings:
torch.backends.cudnn.enabled = True(default): Enables cuDNN, which can use its own memory management. Disabling it can sometimes alter memory behavior but usually at a performance cost.torch.cuda.set_per_process_memory_fraction(fraction, device=None): This is a more advanced way to limit the total memory PyTorch’s allocator can use on a specific device. For example,torch.cuda.set_per_process_memory_fraction(0.8, 0)would tell PyTorch’s allocator that it should not attempt to allocate more than 80% of GPU 0’s memory. If an allocation request exceeds this, it will raise aRuntimeError. This is useful for preventing OOM errors when running multiple processes on the same GPU.torch.cuda.memory_stats(): This provides a detailed breakdown of PyTorch’s allocator, including total allocated, reserved, and cached memory, as well as statistics about the allocator’s bins and fragmentation.
The one thing that often trips people up is the distinction between torch.cuda.memory_allocated() and torch.cuda.memory_reserved(). memory_allocated() is the memory currently in use by tensors. memory_reserved() is the total memory that PyTorch’s allocator has obtained from the CUDA driver and is holding onto, whether it’s currently backing a tensor or sitting in the cache. After a loss.backward() and optimizer.step(), gradients are freed, but memory_reserved() might stay high because the allocator holds onto that memory for future use.
The next common pitfall after mastering memory allocation is understanding and optimizing gradient accumulation.