LLM prompt caching can make your API calls seem instantaneous by reusing computation for identical prompt prefixes.
Let’s see this in action. Imagine you’re building a chatbot that remembers context. Each turn, you send the entire conversation history as the prompt.
{
"model": "gpt-4o",
"messages": [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "What's the capital of France?"},
{"role": "assistant", "content": "The capital of France is Paris."},
{"role": "user", "content": "And what's its population?"}
],
"max_tokens": 50
}
The LLM processes this entire sequence. If the next user message is "How about its main river?", the prompt becomes:
{
"model": "gpt-4o",
"messages": [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "What's the capital of France?"},
{"role": "assistant", "content": "The capital of France is Paris."},
{"role": "user", "content": "And what's its population?"},
{"role": "assistant", "content": "The population of Paris is approximately 2.1 million people."},
{"role": "user", "content": "How about its main river?"}
],
"max_tokens": 50
}
Notice that the first four messages (system, user 1, assistant 1, user 2) are identical. This is the "prompt prefix." Without caching, the LLM re-computes the embeddings and attention mechanisms for all these tokens every single time. This is wasteful.
Prompt caching, specifically for prefixes, intercepts these repeated sequences. When the LLM processes the second prompt, the caching layer recognizes the identical prefix. Instead of running the LLM from scratch, it loads the pre-computed internal states (like key/value caches for attention layers) from the previous computation of that prefix. The LLM then only needs to process the new user message ("How about its main river?"), appending its output to the cached state. This dramatically reduces latency because the bulk of the computation is skipped.
The core problem this solves is the quadratic complexity of the self-attention mechanism in transformers with respect to sequence length. For long contexts, re-processing earlier tokens becomes computationally prohibitive. Caching exploits the commonality in sequential interactions, where much of the conversation history is repeated across turns.
Here’s how it works conceptually:
- Cache Key Generation: A unique hash or identifier is generated for each distinct prompt prefix encountered.
- Cache Lookup: Before sending a prompt to the LLM, the system checks if a cached state exists for the current prefix.
- Cache Hit: If a hit occurs, the cached internal states (specifically the Key-Value cache, or KV cache) are loaded. The LLM then processes only the new tokens appended to this cached state.
- Cache Miss: If no hit, the LLM processes the entire prompt, and its resulting internal states are stored in the cache for future use.
The "prefix" is crucial. Caching the entire prompt is usually ineffective because conversations are dynamic. Caching only the initial, unchanging part of the prompt is where the leverage lies. For example, in Retrieval Augmented Generation (RAG), the retrieved documents might change, but the user’s initial query and system instructions might remain constant across several follow-up questions.
Imagine a RAG system. Your prompt might look like this:
{
"model": "gpt-4o",
"messages": [
{"role": "system", "content": "You are a helpful assistant that answers questions based on the provided context."},
{"role": "user", "content": "Context:\n[Document 1 content]\n\nQuestion: What is the main finding of the study?"}
],
"max_tokens": 100
}
If the user asks a follow-up question like "Can you elaborate on the methodology?", the new prompt is:
{
"model": "gpt-4o",
"messages": [
{"role": "system", "content": "You are a helpful assistant that answers questions based on the provided context."},
{"role": "user", "content": "Context:\n[Document 1 content]\n\nQuestion: What is the main finding of the study?\nAssistant: The main finding is X.\n\nElaborate on the methodology?"}
],
"max_tokens": 100
}
Here, the system prompt and the context might be identical if the retrieved documents haven’t changed. The prompt prefix that can be cached is:
"You are a helpful assistant that answers questions based on the provided context.\n\nContext:\n[Document 1 content]\n\nQuestion: What is the main finding of the study?\nAssistant: The main finding is X.\n\n"
The LLM’s KV cache holds the computed representations for all tokens up to "X.". When the new user input "Elaborate on the methodology?" arrives, the system appends these new tokens to the existing KV cache. The LLM then only computes the self-attention and feed-forward layers for the new tokens, leveraging the pre-computed states for the entire preceding sequence.
The exact mechanism involves the past_key_values parameter in many LLM inference APIs. When you make a request, you can optionally provide past_key_values populated from a previous call. The LLM then uses these to compute the next token without re-processing the prior ones.
# Example concept with a hypothetical library
from my_llm_library import LLMClient
client = LLMClient(model="gpt-4o")
# First call
prompt1 = "What is the capital of France?"
response1, cache1 = client.generate(prompt1)
print(f"Response 1: {response1}")
# Second call, using the cache from the first
prompt2 = "What is its population?"
# The 'cache' object would internally represent the KV cache
response2, cache2 = client.generate(prompt2, past_key_values=cache1)
print(f"Response 2: {response2}")
The crucial insight for developers is that the KV cache is tied to the specific sequence of token IDs processed. If the tokenization changes, or if even a single token within the prefix changes, the cache becomes invalid. This means you need robust hashing or comparison mechanisms for your prompt prefixes. For example, using the SHA-256 hash of the tokenized prompt prefix is a common strategy for cache keys.
One common pitfall is assuming you can cache arbitrarily long prefixes. The KV cache itself consumes significant memory. For very long prefixes, the memory overhead of storing and retrieving the KV cache can outweigh the computational savings, especially if cache hits are infrequent. Therefore, it’s often beneficial to cache only the most stable and frequently repeated parts of your prompt, such as system instructions and initial context, and let the dynamic parts of the conversation miss the cache more often.
The next challenge is managing cache invalidation when the underlying context (like retrieved documents) does change.