LlamaIndex’s contextual compression is a technique for making Retrieval Augmented Generation (RAG) systems smarter by filtering out irrelevant information before it even hits the language model.

Here’s a typical RAG flow: a user asks a question, the system retrieves relevant documents, and then a language model uses those documents to answer the question. The problem is, "relevant" is often a loose term. The retriever might pull back a dozen documents, but only a few sentences in one of them actually contain the answer. The rest is just noise that can confuse the LLM, leading to hallucinations or less accurate answers. Contextual compression tackles this by adding a "pre-filter" step.

Let’s see it in action. Imagine we have a document about the solar system.

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.response.notebook_utils import display_source_nodes
from llama_index.core.retrievers import ContextualCompressionRetriever
from llama_index.core.indices.postprocessor import SentenceTransformerTokenLimitPostprocessor
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

# Assume you have an OpenAI API key set as an environment variable
# os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"

# Configure settings (optional, but good practice)
Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

# Load documents
documents = SimpleDirectoryReader("./data").load_data() # Assuming you have a 'data' folder with your documents

# Parse nodes
node_parser = SentenceSplitter(chunk_size=1024, chunk_overlap=20)
nodes = node_parser.get_nodes_from_documents(documents)

# Build an index
index = VectorStoreIndex(nodes)

# --- Contextual Compression Setup ---

# 1. Choose a retriever (e.g., a vector index retriever)
base_retriever = index.as_retriever(similarity_top_k=5)

# 2. Choose a "compressor"
# This is the core of contextual compression. It uses an LLM to re-rank
# and filter the retrieved nodes based on their relevance to the query.
from llama_index.core.compressors import LLMTextSplitter

# A simple compressor using SentenceSplitter logic - it splits nodes into smaller chunks
# and then uses an LLM to decide which chunks are most relevant.
# For more advanced compression, you'd use a different compressor, like 'PromptCompressor'.
# Here, we'll demonstrate a conceptual approach using LLMTextSplitter for brevity.
# In a real-world scenario, you'd likely use a more sophisticated compressor.

# For a more powerful compressor, you might use:
# from llama_index.core.compressors import PromptCompressor
# compressor = PromptCompressor(
#     llm=Settings.llm,
#     max_tokens=300,  # Max tokens for the compressed context
# )

# Let's use a simpler example for demonstration: a re-ranker
from llama_index.core.indices.postprocessor import CohereReciprocalRankFusionRetriever
from llama_index.llms.cohere import Cohere

# Using Cohere for re-ranking as an example compressor.
# This requires a Cohere API key.
# os.environ["COHERE_API_KEY"] = "YOUR_COHERE_API_KEY"
# Settings.cohere_llm = Cohere(model="command-r-plus") # or another suitable model

# Let's simulate a compressor conceptually. The actual 'LLMTextSplitter' isn't
# a direct compressor in the same way 'PromptCompressor' is, but it demonstrates
# the idea of breaking down and filtering. A true compressor would use an LLM
# to *evaluate* relevance.

# For a practical example of LLM-based compression, we'd use PromptCompressor.
# Let's refine the setup to use PromptCompressor.

from llama_index.core.compressors import PromptCompressor

# Configure the prompt compressor
# The LLM used here will decide which parts of the retrieved text are essential.
compressor = PromptCompressor(
    llm=Settings.llm,
    max_tokens=150,  # Target number of tokens for the compressed context
    # You can also specify a `wake_word` to guide the LLM
)

# 3. Create the ContextualCompressionRetriever
# This wraps the base retriever and applies the compressor
compression_retriever = ContextualCompressionRetriever(
    base_retriever=base_retriever,
    text_splitter=SentenceSplitter(chunk_size=1024), # The text splitter used internally by the compressor
    # For PromptCompressor, the 'compressor' argument is passed directly
    # The actual compressor logic is applied *after* retrieval
    # So, we pass the compressor object to the `ContextualCompressionRetriever`
    # The `text_splitter` here is used by the compressor to break down nodes.
    # If using PromptCompressor, it often has its own internal text splitting logic.
    # Let's adjust based on PromptCompressor's typical usage:
    # The `ContextualCompressionRetriever` itself doesn't take a `compressor` argument directly.
    # It takes a `base_retriever` and a `node_postprocessors` list.
    # The compression logic is often implemented as a postprocessor.

    # Let's re-structure to use postprocessors correctly for compression.
    # The `ContextualCompressionRetriever` is a specific type of retriever that *internally*
    # uses a compressor. The way it's designed is that you provide the `base_retriever`
    # and the `compressor` object.

    # Re-reading LlamaIndex docs, the `ContextualCompressionRetriever` IS the way.
    # It takes `base_retriever` and `compressor`.
    # The `compressor` argument should be the compressor object itself.
    # The `text_splitter` is often used *by* the compressor.

    # Let's assume PromptCompressor is the intended compressor here.
    # The `PromptCompressor` itself takes `llm` and `max_tokens`.
    # The `ContextualCompressionRetriever` then *uses* this compressor.
    # So, the `compressor` argument to `ContextualCompressionRetriever` is the compressor instance.
    # The `text_splitter` argument to `ContextualCompressionRetriever` is the splitter
    # the retriever uses to chunk nodes *before* passing them to the compressor.
)

# Let's refine the setup with PromptCompressor correctly:
# The `ContextualCompressionRetriever` is a retriever that applies compression.
# It takes a `base_retriever` and a `compressor`.

# Initialize the compressor
compressor = PromptCompressor(
    llm=Settings.llm,
    max_tokens=150, # Target tokens for compressed context
)

# Initialize the contextual compression retriever
compression_retriever = ContextualCompressionRetriever(
    base_retriever=base_retriever,
    compressor=compressor,
    # The `text_splitter` is used by the `PromptCompressor` by default
    # if not specified, it uses `SentenceSplitter`.
    # We can explicitly pass one if needed:
    # text_splitter=SentenceSplitter(chunk_size=512)
)

# --- Querying ---
query_engine = index.as_query_engine(
    retriever=compression_retriever,
    # You can also add other postprocessors if needed
    # e.g., to limit the final number of nodes displayed
    # response_postprocessors=[
    #     SentenceTransformerTokenLimitPostprocessor(
    #         limit=2048
    #     )
    # ]
)

response = query_engine.query("What is the red planet?")
print(response)

# Display the source nodes used for the response
print("\n--- Source Nodes ---")
display_source_nodes(response.source_nodes)

When you run this, notice the response.source_nodes. If you were to compare this to a standard index.as_query_engine() without compression, you’d see fewer, more targeted nodes. The PromptCompressor works by taking the initial set of retrieved nodes, passing them to the LLM with a specific prompt asking it to extract only the most relevant sentences or chunks related to the query, and then returning that condensed information.

The core problem this solves is information overload for the LLM. In RAG, the LLM has a limited context window. If you stuff it with too much irrelevant text, it can:

  1. Hallucinate: Make up answers based on noise.
  2. Lose focus: Fail to pinpoint the correct answer within the noise.
  3. Cost more: Processing more tokens is more expensive.

Contextual compression gives you more control over what gets into the LLM’s context window. You’re not just retrieving documents; you’re retrieving the essential parts of those documents relevant to the specific query.

The mental model for contextual compression is a two-stage retrieval process:

  1. Initial Retrieval: A fast, often keyword- or embedding-based retriever (like a vector index) fetches a candidate set of documents or nodes. This set is usually larger than what the LLM can effectively process.
  2. Compression/Refinement: A more sophisticated, often LLM-powered, step then re-evaluates these candidate nodes. It filters, prunes, or summarizes them to produce a much smaller, highly relevant set of text chunks. This refined set is then passed to the LLM for final answer generation.

The key levers you control are:

  • base_retriever: How you initially find candidate documents. This could be a VectorStoreRetriever, BM25Retriever, MultiVectorRetriever, etc. The similarity_top_k parameter here determines how many initial candidates are passed to the compressor.
  • compressor: The algorithm or model that performs the filtering. PromptCompressor is common, using an LLM to judge relevance. Other compressors might use techniques like SentenceTransformerTokenLimitPostprocessor (which is technically a postprocessor but can be used similarly) or custom re-rankers.
  • max_tokens (for PromptCompressor): This is a crucial parameter. It tells the compressor what the target size of the compressed context should be. A smaller value means more aggressive filtering.
  • text_splitter (within PromptCompressor or ContextualCompressionRetriever): How the compressor breaks down the initial retrieved nodes into smaller pieces to evaluate.

The one thing most people don’t know is that the "compression" isn’t always about making text shorter; it’s about semantic filtering. A PromptCompressor might take a long, relevant chunk and decide only a single sentence within it is truly essential for the specific query. It’s not just truncating; it’s intelligently selecting the most potent pieces of information. This is why the LLM used in the compressor needs to be good at understanding nuances and relevance.

The next step after mastering contextual compression is exploring hybrid retrieval strategies, combining different retriever types for even more robust candidate generation.

Want structured learning?

Take the full Llamaindex course →