vLLM’s architecture is built to push LLM inference throughput to its absolute limits, and the core innovation is how it manages GPU memory.
Let’s watch it in action. Imagine you have a simple prompt: "Tell me a story about a brave knight." This prompt gets tokenized into something like [101, 534, 2345, 1996, 1189, 3456, 102]. When vLLM processes this, it doesn’t just allocate a fixed chunk of memory for the entire sequence. Instead, it breaks down the sequence’s key-value (KV) cache – the memory holding intermediate attention computations – into fixed-size blocks.
Here’s a simplified view of how a few requests might be handled:
Request 1: "Tell me a story about a brave knight." (Prompt length 7 tokens)
- vLLM allocates 2 blocks for its KV cache.
Request 2: "Write a poem about the sea." (Prompt length 6 tokens)
- vLLM allocates 2 blocks for its KV cache.
Request 3: "Summarize this article: [long article text]" (Prompt length 500 tokens)
- vLLM allocates 25 blocks for its KV cache.
Now, what happens when Request 1 generates its first output token? Say it becomes "Once". The KV cache for "Once" needs to be stored. Instead of reallocating a whole new sequence of blocks, vLLM appends the new KV state to the existing blocks. This is where PagedAttention shines.
The real magic of PagedAttention is that it decouples the logical sequence length from the physical memory allocation. It uses a virtual memory system, much like an operating system, to manage the KV cache. Each sequence is assigned a virtual address space for its KV cache, and this virtual space is mapped to physical memory blocks on the GPU. These blocks are a fixed size, typically 128 tokens.
This means if a sequence needs 130 tokens of KV cache, it won’t get 130 tokens’ worth of contiguous memory. Instead, it might use 2 full blocks (256 tokens) and only a small part of the second block. Crucially, the unused portions of these physical blocks are immediately available for other sequences. This is the "paging" aspect – physical blocks are paged in and out of sequences as needed.
This memory management strategy directly enables continuous batching. Traditional batching would wait for all sequences in a batch to finish before starting the next batch. This is incredibly inefficient because shorter sequences finish much faster than longer ones, leaving GPU computation idle. Continuous batching, powered by PagedAttention’s flexible memory, allows vLLM to execute sequences in a dynamic, on-the-fly batch.
When a sequence finishes, its memory blocks are immediately reclaimed and can be used by new incoming requests. When a new request arrives, vLLM checks its available memory blocks and assigns them to the new sequence. This allows sequences to be added and removed from the "active batch" at any time, maximizing GPU utilization.
Consider this: a sequence that is 2000 tokens long might require 16 blocks (2000 / 128 ≈ 15.625, rounded up to 16). If vLLM had to allocate these 16 contiguous blocks upfront for every sequence in a traditional batch, it would lead to massive fragmentation and underutilization. PagedAttention, however, allows these 16 blocks to be scattered across the GPU’s memory, with the unused portions of partially filled blocks being readily available.
The "more" in the topic title refers to other optimizations that build on this foundation. For instance, the iteration-level parallelism and efficient attention kernels (like FlashAttention) are integrated to further speed up the computation within these memory-managed sequences.
The most surprising thing about vLLM’s PagedAttention is that it treats the KV cache not as a sequence of data, but as a collection of independent memory pages, similar to how an OS manages RAM. This fundamental shift allows for arbitrary sharing and reuse of memory blocks between different sequences, eliminating the wasted space that plagues traditional LLM inference. When a sequence requests memory, it’s not asking for "N tokens worth of KV cache"; it’s asking for "a pointer to the next available block" and "a pointer to the block after that," and so on, with the system tracking which logical token belongs to which physical block.
The next step in understanding vLLM’s performance is to explore how it handles the sampling and decoding strategies that run after the attention mechanism has computed the probabilities for the next token.