LlamaIndex’s token counting isn’t just about seeing how many tokens you’ve used; it’s a surprisingly effective way to force yourself to think about the value each piece of text brings to your LLM interaction.

Let’s see it in action. Imagine you’re building a RAG system that answers questions about a document.

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.response.notebook_utils import display_response
from llama_index.core.callbacks import CallbackManager, TokenCountingHandler

# Load documents
documents = SimpleDirectoryReader("./data").load_data()

# Setup token counting
token_counter = TokenCountingHandler()
callback_manager = CallbackManager([token_counter])

# Build the index
index = VectorStoreIndex.from_documents(documents, callback_manager=callback_manager)

# Query the index
query_engine = index.as_query_engine()
response = query_engine.query("What is the main purpose of this document?")

# Display the response and token counts
display_response(response)
print(f"Total tokens used: {token_counter.total_tokens}")
print(f"Prompt tokens: {token_counter.prompt_tokens}")
print(f"Completion tokens: {token_counter.completion_tokens}")

When you run this, token_counter hooks into LlamaIndex’s internal operations. It doesn’t just count tokens at the end; it tracks them as they’re generated. You see the raw number of tokens sent to the LLM (prompt_tokens) and the tokens returned by the LLM (completion_tokens). The total_tokens is simply their sum.

This RAG system has a few implicit costs. First, the VectorStoreIndex.from_documents call itself can consume tokens if LlamaIndex is doing things like embedding generation or initial summarization within the indexing process (though this is less common for basic vector stores). More importantly, the query_engine.query call is where the bulk of the token usage happens.

Here’s the mental model:

  1. Retrieval: LlamaIndex fetches relevant Node objects (chunks of your document) based on your query. Each Node has associated text.
  2. Context Assembly: These retrieved Node texts are then formatted into a prompt. This prompt includes your original query and the retrieved context. This is the prompt_tokens part.
  3. LLM Call: The assembled prompt is sent to the LLM.
  4. Generation: The LLM generates a response based on the prompt. This is the completion_tokens part.

The key levers you control are:

  • Chunking Strategy: How you split your documents into Nodes. Smaller chunks mean more Nodes might be retrieved, potentially increasing context length. Larger chunks might miss nuance.
  • Number of Retrieved Chunks: LlamaIndex’s VectorStoreIndex has parameters (like similarity_top_k) that determine how many Nodes are pulled. More chunks mean more context, more tokens.
  • Prompt Engineering: How you structure the prompt sent to the LLM. Adding instructions, examples, or specific formatting can add tokens.
  • LLM Choice: Different LLMs have different token costs per thousand.

The one thing that trips people up is that the prompt_tokens isn’t just your query text. It includes the system prompt, any preamble LlamaIndex adds, and crucially, the text from all retrieved document chunks. If you retrieve 10 chunks of 500 tokens each, that’s 5000 tokens before your query even gets added.

When optimizing, remember that a single 500-token chunk is cheaper than 5 separate 100-token chunks if they all end up in the prompt. The cost is in the final context window, not the intermediate steps of retrieval.

Want structured learning?

Take the full Llamaindex course →