Caching context for Gemini API calls can drastically cut your operational expenses by avoiding redundant processing of identical input.

Let’s see this in action. Imagine a chatbot that needs to remember the last few turns of a conversation to maintain coherence. Without caching, each new turn would involve sending the entire conversation history to Gemini, incurring full token costs.

Here’s a Python snippet demonstrating a simplified caching mechanism:

import hashlib
import json

class GeminiCache:
    def __init__(self):
        self.cache = {}

    def get(self, prompt, model_name="gemini-pro"):
        # Create a cache key based on the prompt and model
        cache_key = hashlib.sha256(json.dumps({"prompt": prompt, "model": model_name}).encode()).hexdigest()
        return self.cache.get(cache_key)

    def set(self, prompt, response, model_name="gemini-pro"):
        cache_key = hashlib.sha256(json.dumps({"prompt": prompt, "model": model_name}).encode()).hexdigest()
        self.cache[cache_key] = response

# --- Example Usage ---
cache = GeminiCache()

# First call - will hit the Gemini API
prompt1 = "What is the capital of France?"
cached_response1 = cache.get(prompt1)
if not cached_response1:
    # Simulate API call
    print("Calling Gemini API for prompt 1...")
    api_response1 = "The capital of France is Paris."
    cache.set(prompt1, api_response1)
    cached_response1 = api_response1
print(f"Response 1: {cached_response1}\n")

# Second call with the same prompt - should hit the cache
prompt2 = "What is the capital of France?"
cached_response2 = cache.get(prompt2)
if not cached_response2:
    # This block should not be reached if cache is working
    print("Calling Gemini API for prompt 2...")
    api_response2 = "The capital of France is Paris."
    cache.set(prompt2, api_response2)
    cached_response2 = api_response2
else:
    print("Cache hit for prompt 2!")
print(f"Response 2: {cached_response2}\n")

# Third call with a different prompt - will hit the Gemini API
prompt3 = "What is the largest planet in our solar system?"
cached_response3 = cache.get(prompt3)
if not cached_response3:
    print("Calling Gemini API for prompt 3...")
    api_response3 = "The largest planet in our solar system is Jupiter."
    cache.set(prompt3, api_response3)
    cached_response3 = api_response3
print(f"Response 3: {cached_response3}\n")

In this example, prompt1 and prompt2 are identical. The first time, the cache.get() returns None, and we simulate an API call. We then store the result in the cache using cache.set(). The second time prompt2 is encountered, cache.get() finds a match and returns the stored response immediately, bypassing the API call. prompt3 is different, so it results in a new API call and a new cache entry.

The core problem context caching solves is the cost associated with repeatedly processing the same contextual information. Large language models, including Gemini, charge based on the number of input and output tokens. When your application requires the model to "remember" previous interactions or specific pieces of information across multiple requests, you often resend that same context. This leads to redundant token consumption. By storing and reusing responses for identical contexts, you directly reduce the number of tokens sent to the API, thereby lowering costs.

Internally, the caching strategy relies on identifying when a new API request is functionally identical to a previous one. The most straightforward way to do this is by hashing the input prompt. If the hash of the current prompt matches the hash of a previously seen prompt, and the target model is the same, we can assume the desired output will also be the same. The hashlib.sha256 in the example serves this purpose, creating a unique, fixed-size fingerprint for the input data. This fingerprint is used as a key in a dictionary (or a more sophisticated cache store like Redis or Memcached for production environments), where the value is the API’s response.

The primary lever you control is what you cache and how you define cacheability. For conversational agents, this often means caching full turns or specific summaries of past turns. For generative tasks where you might ask for variations of a theme, caching might be less effective unless the core prompt remains identical. You can also implement cache invalidation strategies, though for simple cost reduction on static prompts, this is often unnecessary. The model_name parameter in the cache functions is crucial for scenarios where you might use different Gemini models for different tasks; you wouldn’t want to serve a response from gemini-1.5-pro-latest when the request was for gemini-pro.

When implementing caching, especially for complex prompts involving structured data or user-specific information, simply hashing the raw string prompt might not be sufficient if the order of elements within the prompt can change without affecting the intended meaning. For instance, if your prompt is constructed by concatenating several pieces of information, and those pieces can arrive in a different order, a simple string hash would treat them as distinct. A more robust approach in such cases involves serializing the prompt components into a canonical, sorted format before hashing. This ensures that semantically equivalent prompts, regardless of the order of their constituent parts, generate the same cache key.

The next step in optimizing API usage involves exploring techniques for prompt engineering that allow for more frequent cache hits by making prompts more consistent and predictable.

Want structured learning?

Take the full Gemini-api course →