LangChain’s Contextual Compression is a technique to make your retrieval systems smarter by filtering out irrelevant information after you’ve already fetched it.
Let’s see it in action. Imagine we have a document about different types of pasta and we want to ask about "orecchiette."
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.document_transformers import EmbeddingsRedactionFilter
# Load a document
loader = TextLoader("pasta_recipes.txt")
documents = loader.load()
# Embed and store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(documents, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5}) # Fetch top 5 docs
# Define a simple LLM chain
llm = ChatOpenAI(model_name="gpt-3.5-turbo")
prompt = ChatPromptTemplate.from_template(
"Answer the question based on the following context:\n\n{context}\n\nQuestion: {question}"
)
chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
)
# Now, let's set up contextual compression
# We'll use EmbeddingsRedactionFilter to remove parts of the document
# that are less semantically similar to the original query.
# This is a simple example, but you can use more complex filters.
compression_retriever_chain = EmbeddingsRedactionFilter.from_documents(
documents, embeddings=embeddings, similarity_threshold=0.8
)
compressed_retriever = ContextualCompressionRetriever(
base_retriever=retriever,
# This is the 'compressor' that filters the documents
# We're using a simple filter here, but this could be another LLM call
# or a more sophisticated embedding-based filter.
base_compressor=compression_retriever_chain,
)
# Create a new chain with the compressed retriever
compressed_chain = (
{"context": compressed_retriever, "question": RunnablePassthrough()}
| prompt
| llm
)
# Example question
question = "What is orecchiette?"
# Run the original chain (without compression)
print("--- Without Compression ---")
response_no_compression = chain.invoke(question)
print(response_no_compression.content)
print("\n--- With Compression ---")
# Run the compressed chain
response_with_compression = compressed_chain.invoke(question)
print(response_with_compression.content)
The core idea is that your initial retrieval step might pull back more documents than are actually useful. Think of it like a noisy signal – you get the signal, but there’s a lot of static. Contextual Compression acts as a secondary filter, after the initial retrieval, to clean up that signal by re-evaluating the retrieved documents based on their relevance to the specific query. This is crucial because a document might be generally relevant to your knowledge base but contain a paragraph or sentence that’s off-topic for the user’s immediate question.
Here’s how it works internally:
- Base Retrieval: You start with a standard retriever (e.g., a vector store retriever) that fetches a set of initial documents based on the query’s embedding. This is your "noisy" set.
- Compression Stage: Instead of directly feeding these documents to the LLM, they first go through a "compressor." This compressor is itself a runnable (often another retriever or a custom function). Its job is to take the initial set of documents and the original query, and then produce a smaller, more relevant subset of those documents.
- LLM Consumption: The LLM then receives only this compressed set of documents, along with the original query, to generate its final answer.
The ContextualCompressionRetriever in LangChain orchestrates this. You give it your base_retriever (what fetches the initial documents) and a base_compressor (what filters them). The base_compressor can be various things:
- Another Retriever: You could use a
DocumentScoreror even a small LLM call to re-rank or filter documents. - A Transformer: Like the
EmbeddingsRedactionFiltershown above, which can remove less relevant sentences or paragraphs from documents based on their embeddings relative to the query. - A Re-ranker: Many specialized re-ranking models exist that take a list of documents and a query and return the same documents but in a more relevant order, or even filter them.
The power comes from the fact that the compression step is contextual. It uses the specific query to decide which parts of the retrieved documents are most pertinent, rather than just relying on the initial similarity scores. This can dramatically improve the quality of answers from LLMs, especially when dealing with large or diverse knowledge bases where initial retrieval might pick up many superficially similar but ultimately irrelevant chunks.
One subtlety often overlooked is that the base_compressor itself can be an LLM-based retriever. For instance, you could have an initial vector search that pulls back 10 documents, and then use a separate LLM call to ask, "Given this query and these 10 documents, which 3 documents are MOST relevant?" This LLM acts as the compressor, performing a more nuanced, semantic filtering than a simple embedding similarity might achieve alone.
The next step in refining retrieval is often exploring different compressor types, like LLM-based re-rankers or more sophisticated filtering logic, to further tailor the context provided to the LLM.