NVIDIA’s H100 GPU is a beast for LLMs, and FP8 precision is the secret sauce to really unleash its potential for both training and inference.

Let’s see it in action. Imagine you’re training a Llama 2 70B model. Normally, you’d use FP16 or BF16, which are 16-bit floating-point formats. This means each number takes up 16 bits of memory.

# Example: FP16 tensor
import torch
tensor_fp16 = torch.randn(1024, 1024, dtype=torch.float16)
print(tensor_fp16.element_size() * tensor_fp16.nelement()) # Output: 2097152 bytes (2MB)

Now, with H100 and FP8, we can use an 8-bit floating-point format. This effectively halves the memory footprint per number.

# Example: FP8 tensor (requires specific H100 features and libraries like transformers)
# This is a conceptual representation as direct FP8 tensor creation like this isn't standard PyTorch
# but illustrates the memory reduction. Actual implementation uses specific quantization methods.
# Assuming an FP8 tensor of the same size:
# print(1 * 8 * 1024 * 1024) # Output: 8388608 bits / 8 bits/byte = 1048576 bytes (1MB)

This memory reduction is huge. It means you can fit larger models into GPU memory, or use larger batch sizes for training, both of which lead to faster training times and higher throughput for inference.

The magic behind FP8 on H100 lies in its Tensor Cores, specifically the "Transformer Engine." This hardware and software co-design dynamically switches between FP8 and FP16 precision during computation. For operations where FP8 is accurate enough (like matrix multiplications in the forward and backward passes), it uses FP8. For operations that are more sensitive to precision, it automatically falls back to FP16. This dynamic, mixed-precision approach is key – it gives you the speed and memory benefits of FP8 without sacrificing model accuracy.

The goal of FP8 precision is to accelerate deep learning workloads by reducing the memory bandwidth and compute requirements. LLMs, with their massive parameter counts and numerous matrix multiplications, are prime candidates for this optimization. By using 8-bit floating-point numbers instead of 16-bit, you can:

  • Double the memory capacity: Fit larger models or larger batch sizes on the same hardware.
  • Increase compute throughput: Process more data per unit of time, as the GPU can perform more 8-bit operations in parallel compared to 16-bit.
  • Reduce energy consumption: Less data movement and simpler computations often translate to lower power draw.

To leverage FP8, you typically need to use libraries and frameworks that support it, such as NVIDIA’s CUDA, cuDNN, and higher-level frameworks like PyTorch or TensorFlow with specific integrations. The process often involves quantizing your model’s weights and activations to FP8 format. For training, the Transformer Engine handles much of this automatically, dynamically casting to FP8 where appropriate. For inference, you might explicitly load a quantized model or use inference optimization libraries.

The key levers you control are primarily around the quantization strategy and the batch size. For training, you configure the Transformer Engine’s behavior (though often it’s set to optimize automatically). For inference, you might choose between different FP8 quantized versions of a model or tune the batch size to maximize throughput.

One thing that trips people up is understanding that FP8 isn’t a single, static format like FP16. On H100, the Transformer Engine uses two distinct FP8 formats: E4M3 and E5M2. E4M3 has 4 exponent bits and 3 mantissa bits, while E5M2 has 5 exponent bits and 2 mantissa bits. The Transformer Engine intelligently chooses which format to use for different operations based on dynamic range and precision requirements, often performing the actual computation in E4M3 for maximum efficiency and falling back to E5M2 or FP16 when necessary to maintain accuracy. This dynamic selection is what allows for significant speedups without a noticeable degradation in model performance.

The next frontier you’ll likely explore is how to further optimize inference speed using techniques like speculative decoding or advanced quantization schemes beyond the basic FP8 support.

Want structured learning?

Take the full Gpu course →