The most surprising thing about LlamaIndex cost optimization is that the default settings often encourage more API calls than you might expect, not fewer.

Let’s see it in action. Imagine you have a few documents and you want to index them for RAG.

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.indices.loading import load_index_from_storage

import os

# Assume you have a directory named 'data' with some text files
# For example, data/doc1.txt, data/doc2.txt

# Create a dummy data directory and files for demonstration
if not os.path.exists("data"):
    os.makedirs("data")
with open("data/doc1.txt", "w") as f:
    f.write("This is the first document about apples. Apples are red and grow on trees.")
with open("data/doc2.txt", "w") as f:
    f.write("This is the second document about bananas. Bananas are yellow and grow in bunches.")

# --- Indexing ---
print("--- Indexing Documents ---")
documents = SimpleDirectoryReader("data").load_data()
# Default node parser splits into sentences. This is good for retrieval but can lead to many small chunks.
# Each chunk will likely require an embedding call.
nodes = SentenceSplitter(chunk_size=512, chunk_overlap=20).get_nodes_from_documents(documents)

# A default LlamaIndex setup might use an OpenAI embedding model.
# Each node needs an embedding.
# For 2 documents, if SentenceSplitter creates 4 nodes, that's 4 embedding calls.
# Let's simulate this without actually calling OpenAI.
# In a real scenario, this would be:
# from llama_index.embeddings.openai import OpenAIEmbedding
# embed_model = OpenAIEmbedding()
# nodes = embed_model.get_text_embeddings(nodes) # This is where API calls happen

# For demonstration, we'll just note the number of nodes
print(f"Created {len(nodes)} nodes.")
# If using OpenAI embeddings, this would be len(nodes) * OpenAI's embedding cost per token.

# If you save this index, LlamaIndex stores the nodes and their embeddings.
PERSIST_DIR = "./storage"
if not os.path.exists(PERSIST_DIR):
    os.makedirs(PERSIST_DIR)
storage_context = StorageContext.from_defaults()
# In a real scenario, you'd pass your configured embed_model here.
index = VectorStoreIndex(nodes, storage_context=storage_context)
index.storage_context.persist(persist_dir=PERSIST_DIR)
print(f"Index persisted to {PERSIST_DIR}")

# --- Querying ---
print("\n--- Querying Index ---")
# Load the index from storage
storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
index = load_index_from_storage(storage_context)

# Create a query engine
# The default query engine often uses a default LLM for synthesis.
# For example, if using OpenAI's gpt-3.5-turbo, this is another API call.
query_engine = index.as_query_engine()

# Perform a query
response = query_engine.query("What fruits are mentioned?")
print(f"Query: What fruits are mentioned?")
print(f"Response: {response}")
# This query involves:
# 1. Embedding the query string ("What fruits are mentioned?").
# 2. Retrieving the top-k most similar nodes (using vector similarity on embeddings).
# 3. Passing the retrieved nodes and the original query to an LLM for synthesis.

# --- The Cost Trap ---
# If you re-run this script, or load the index and query again,
# you'll notice:
# - Indexing: Re-indexing will re-embed all nodes (if not using a persistent vector store with embeddings).
# - Querying: Each query embeds the query string and calls the LLM.

print("\n--- Re-querying Index ---")
response_2 = query_engine.query("Tell me about bananas.")
print(f"Query: Tell me about bananas.")
print(f"Response: {response_2}")
# This is another LLM call and another embedding call for the query.

The mental model LlamaIndex helps you build is one where you ingest data, embed it, and then query it. The core components are:

  1. Data Loading: SimpleDirectoryReader, UnstructuredReader, etc. These bring your raw data into LlamaIndex.
  2. Node Parsing: SentenceSplitter, TokenTextSplitter. This breaks down your documents into smaller, manageable chunks (nodes) that can be embedded. The chunk_size and chunk_overlap are critical here.
  3. Embedding: OpenAIEmbedding, HuggingFaceEmbedding, etc. This converts each node’s text into a vector representation. This is often the most expensive part of indexing, as it involves API calls or significant local compute.
  4. Vector Store: Where the embeddings are stored and indexed for efficient similarity search. LlamaIndex has in-memory stores, and integrations with services like Pinecone, Weaviate, or local options like Chroma.
  5. LLM: The large language model used for generating responses based on retrieved context. This is typically the most expensive part of querying.
  6. Retrieval: The process of finding the most relevant nodes from the vector store based on the query’s embedding.
  7. Synthesis: The LLM uses the retrieved nodes and the original query to formulate a coherent answer.

The default behavior, especially with many small documents or frequent re-indexing, can lead to a surprising number of embedding and LLM calls.

The one thing most people don’t know is that LlamaIndex has a sophisticated caching mechanism for embeddings and LLM responses that can drastically cut costs if leveraged correctly, but it’s not always on by default or configured for maximum impact. For instance, if you have a VectorStoreIndex that persists its nodes to disk (like Chroma or Pinecone), the embeddings are already stored. Re-indexing the same data won’t re-generate embeddings if the vector store is correctly initialized and the data hasn’t changed. However, querying still involves embedding the query and sending it to the LLM.

To optimize, you need to explicitly manage caching and reuse.

Cache and Reduce API Calls

The primary drivers of cost are embedding calls during indexing and LLM calls during querying.

  1. Caching Embeddings (Implicit and Explicit):

    • Diagnosis: When you index, check how many embed_text or similar methods are called. If you’re re-indexing the same documents without changes, and you’re seeing embedding calls, you’re paying for redundant work.
    • Fix: Use a persistent vector store (Chroma, Pinecone, Weaviate, etc.) and ensure you’re loading an existing index rather than creating a new one if the data hasn’t changed. LlamaIndex’s VectorStoreIndex.from_vector_store or load_index_from_storage with a pre-configured StorageContext pointing to your persistent store will reuse existing embeddings.
      from llama_index.core import VectorStoreIndex, StorageContext
      from llama_index.vector_stores.chroma import ChromaVectorStore
      from llama_index.core.indices.loading import load_index_from_storage
      import chromadb
      
      # Assuming you've previously persisted to Chroma
      db = chromadb.PersistentClient(path="./chroma_db")
      chroma_collection = db.get_or_create_collection("my_documents")
      vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
      
      # If the collection already has embeddings, this loads them, avoiding re-embedding
      storage_context = StorageContext.from_defaults(vector_store=vector_store)
      index = VectorStoreIndex.from_vector_store(vector_store, storage_context=storage_context)
      
    • Why it works: The vector store holds the computed embeddings. When you load the index from this store, LlamaIndex uses the pre-computed vectors instead of calling the embedding API again.
  2. Caching LLM Responses (Query Caching):

    • Diagnosis: Every identical query to your QueryEngine results in a new LLM call and a new embedding call for the query.
    • Fix: Implement a ResponseCache or use a query engine that supports it. LlamaIndex provides InMemoryResponseCache and allows integration with custom caches.
      from llama_index.core.response_cache import InMemoryResponseCache
      from llama_index.core import Settings
      
      # Set up the cache globally or per query engine
      Settings.response_cache = InMemoryResponseCache()
      
      # When you query, LlamaIndex checks the cache first
      response = query_engine.query("What fruits are mentioned?")
      # If the same query is made again, the cached response is returned instantly.
      # This bypasses both embedding the query and calling the LLM.
      
    • Why it works: The ResponseCache stores the result of a query based on its content (or a hash of it). Subsequent identical queries hit the cache, returning the stored response without any downstream API calls.
  3. Reducing Node Count / Granularity:

    • Diagnosis: If your documents are split into excessively small chunks (e.g., very low chunk_size), you’ll have a massive number of nodes. Each node requires an embedding call during indexing.
    • Fix: Adjust chunk_size in your NodeParser to a reasonable size (e.g., 512-1024 tokens for many models). Larger chunks mean fewer embedding calls.
      from llama_index.core.node_parser import SentenceSplitter
      # Larger chunk size reduces the number of nodes and thus embedding calls
      nodes = SentenceSplitter(chunk_size=1024, chunk_overlap=20).get_nodes_from_documents(documents)
      
    • Why it works: Fewer nodes directly translate to fewer individual embedding API calls during the initial indexing phase.
  4. Batching Embeddings:

    • Diagnosis: Some embedding models have a cost per token or per request. Making individual calls for each node can be less efficient than batching.
    • Fix: LlamaIndex’s embedding integrations often handle batching automatically when possible, especially for models that support it. Ensure you’re using a recent version of LlamaIndex and the relevant embedding integration. If you’re implementing a custom embedding solution, look for batching capabilities.
    • Why it works: Many API endpoints are optimized for batched requests, reducing overhead per embedding.
  5. Selective Re-indexing:

    • Diagnosis: Re-indexing an entire corpus when only a small subset of documents has changed.
    • Fix: Implement a strategy to identify changed documents and only re-index those. This might involve comparing document hashes or using metadata. LlamaIndex’s VectorStoreIndex.from_documents can be used with specific nodes_only=True and then index.insert(new_nodes) if you’re managing the process. For more advanced scenarios, consider tools that track document versions.
      # Simplified example: only re-index changed docs
      changed_docs = [...] # Identify which documents changed
      new_nodes = SentenceSplitter().get_nodes_from_documents(changed_docs)
      # Assuming index is already loaded from a persistent store
      index.insert_nodes(new_nodes)
      
    • Why it works: You avoid re-embedding and re-inserting unchanged data, saving on embedding costs and processing time.
  6. Using Local/Cheaper Embeddings:

    • Diagnosis: Relying on expensive proprietary embedding models when a sufficient local or cheaper alternative exists.
    • Fix: Switch to models like HuggingFaceEmbedding with locally hosted models (e.g., sentence-transformers/all-MiniLM-L6-v2) or other cost-effective cloud providers.
      from llama_index.embeddings.huggingface import HuggingFaceEmbedding
      # Use a local model to avoid API costs for embeddings
      embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
      # Now use this embed_model when creating your index
      
    • Why it works: Eliminates or significantly reduces per-embedding costs associated with cloud APIs.

The next error you’ll hit is likely a RateLimitError if you’re making too many calls too quickly to an API, or a ContextWindowExceededError if your retrieved context for LLM synthesis is too large.

Want structured learning?

Take the full Llamaindex course →