Flash Attention: Speed Through Memory Tricks

FlashAttention isn’t just a faster version of standard attention; it’s a fundamental re-architecting of the computation to bypass the slow memory operations that bottleneck traditional implementations.

Let’s see it in action, conceptually. Imagine you’re calculating attention for a sequence of 1024 tokens. Standard attention involves a massive matrix multiplication (Q @ K^T) that produces an intermediate N x N matrix (1024 x 1024 in this case). This matrix then needs to be read back to apply the softmax, and then another matrix multiplication with V. The problem is that this intermediate N x N matrix is huge and doesn’t fit into the GPU’s fast on-chip SRAM. It has to be written to and read from the much slower High Bandwidth Memory (HBM). FlashAttention avoids materializing this full N x N matrix.

Here’s a simplified view of how it works internally, focusing on the "tiling" and "recomputation" aspects. Instead of one giant matrix multiply, FlashAttention breaks the Q, K, and V matrices into smaller blocks. It loads a block of Q and a block of K into the fast SRAM. It computes a partial attention score for that block, scales it, and then applies a softmax incrementally. Crucially, it doesn’t store the intermediate result of this partial softmax. Instead, it updates running statistics (like the current maximum value and sum of exponentials) needed for the final softmax. Then, it loads a block of V and computes a partial output. This entire process is repeated for all blocks of K and V, with each step updating the running softmax statistics and accumulating the output.

The key insight is that the softmax operation, $S_i = \frac{e^{z_i}}{\sum_j e^{z_j}}$, can be computed incrementally. If you have a new set of scores $z’$, and you know the previous maximum $m$ and sum of exponentials $l$, you can update them to $m’$ and $l’$ without recomputing everything. FlashAttention uses this property. It loads blocks of Q and K, computes $Q_i K_j^T$, and then uses these partial results to update the softmax normalization constants. It does this in a way that the full N x N attention matrix is never written to HBM. The output is computed directly, block by block, using the incrementally updated softmax normalization. This significantly reduces HBM reads and writes, which are the primary bottleneck.

The "recomputation" aspect comes into play during the backward pass. Since the full attention matrix wasn’t stored, it needs to be recomputed on the fly. However, because this recomputation is also tiled and done within SRAM, it’s still much faster than reading the full matrix from HBM. This is why it’s called "FlashAttention" – it’s like a quick flash of computation that avoids the slow memory trip.

The exact levers you control are primarily through the attention implementation itself. When you use libraries like PyTorch with torch.nn.functional.scaled_dot_product_attention (which often defaults to FlashAttention if available and configured), you’re implicitly using it. If you’re building custom kernels or using lower-level libraries, you’d be orchestrating these block operations and incremental softmax updates yourself. The primary parameters that affect performance are the sequence length (N), the head dimension (d_k), and the batch size. Longer sequences and larger head dimensions exacerbate the memory bottleneck that FlashAttention solves.

The one thing most people don’t realize is that the recomputation in the backward pass isn’t a brute-force recalculation of the entire forward pass’s intermediate values. Instead, it cleverly recomputes only the necessary parts of the attention scores and gradients by re-accessing the tiled blocks of Q, K, and V from HBM, but performing the core computations within SRAM, updating the softmax normalization again on the fly for the backward pass. This avoids storing the large intermediate $N \times N$ attention matrix, which would have been the bottleneck if it were simply re-materialized.

The next frontier is understanding how these tiling strategies and memory-aware computations extend to more complex attention variants like multi-query attention or grouped-query attention.