LangChain’s reranking module, powered by cross-encoders, can dramatically improve the relevance of documents retrieved by your RAG system, often surfacing better results than simple vector similarity alone.
Let’s see this in action. Imagine we have a basic RAG setup retrieving documents based on a user’s query.
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_community.cross_encoders import CrossEncoder
from langchain_community.cross_encoders.document_ranking import DocumentRanker
# Assume you have documents in a file named 'my_docs.txt'
# and an OpenAI API key set as an environment variable.
# 1. Load and split documents
loader = TextLoader("my_docs.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
# 2. Create embeddings and vector store
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(docs, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 10}) # Retrieve top 10
# 3. Define RAG prompt and LLM chain (without reranking yet)
llm = ChatOpenAI(model="gpt-4o-mini")
prompt_template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Context: {context}
Question: {question}
Answer:"""
prompt = ChatPromptTemplate.from_template(prompt_template)
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
rag_chain_no_rerank = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
# --- Now, let's add reranking ---
# 4. Initialize a Cross-Encoder model
# This model takes a query and a document as input and outputs a relevance score.
# We'll use a pre-trained model from HuggingFace.
cross_encoder_model = CrossEncoder("BAAI/bge-reranker-v2-m3")
reranker = DocumentRanker(cross_encoder_model)
# 5. Create a reranked retriever
# This retriever first fetches initial documents using the vector store,
# then reranks them using the cross-encoder.
reranked_retriever = reranker.as_retriever(
retriever=retriever, # The original retriever
top_k=5 # How many documents to keep AFTER reranking
)
# 6. Define the RAG chain with reranking
rag_chain_with_rerank = (
{"context": reranked_retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
# --- Example Usage ---
query = "What are the key benefits of using cross-encoders in RAG?"
print("--- RAG without Reranking ---")
print(rag_chain_no_rerank.invoke(query))
print("\n--- RAG with Reranking ---")
print(rag_chain_with_rerank.invoke(query))
The magic happens because cross-encoders treat the query and document together as a single input. Unlike bi-encoders (which embed query and documents separately), cross-encoders can understand the nuanced relationship between the two. This allows them to assign a much more accurate relevance score.
When you run the code above, you’ll likely see that the RAG chain with reranking provides a more focused and accurate answer. The initial retriever might bring back several documents that are semantically similar but not directly answering the question. The cross-encoder then re-evaluates these candidates, scoring them based on how well each document actually addresses the query. It’s like having a human read the query and each retrieved snippet, deciding which one is the best fit.
The core problem RAG reranking solves is the "semantic drift" that can occur when relying solely on vector similarity. Vector embeddings are great at capturing general meaning, but they can miss subtle contextual cues. A document might be about "apples" (fruit) when you’re looking for "Apple" (the company). A bi-encoder might give these a high similarity score. A cross-encoder, however, analyzes the pair "What are the latest products from Apple?" and "This article discusses the nutritional benefits of apples and pears" and will correctly assign a low relevance score to the latter.
Internally, a cross-encoder model is typically a transformer architecture (like BERT, RoBERTa, or specialized reranking models like BAAI/bge-reranker). It takes a concatenated input [CLS] query [SEP] document [SEP] and passes it through its layers. The final representation is then fed into a linear layer to predict a single relevance score (often between 0 and 1). The DocumentRanker in LangChain orchestrates this: it takes the initial set of documents from your primary retriever, pairs each document with the query, passes them to the CrossEncoder model, gets the scores, and then sorts the documents by these scores. You configure top_k on the DocumentRanker to specify how many of the best reranked documents should be passed to the LLM.
The DocumentRanker in LangChain handles the batching of documents for the cross-encoder model, which is crucial for performance. Instead of sending one query-document pair at a time, it groups multiple pairs together to leverage the parallel processing capabilities of the underlying model, significantly speeding up the reranking process. You can also fine-tune cross-encoder models on your specific domain data for even better performance, though this requires more advanced setup.
The surprising efficiency of these specialized reranker models, like BAAI/bge-reranker-v2-m3, comes from their training objective. They are trained specifically to output a single score indicating the relevance of a query-document pair, rather than generating text or classifying tokens. This focused training allows them to excel at the task of ranking, often outperforming more general-purpose models or even human judgment in blind tests for specific retrieval tasks.
The next step after implementing reranking is to explore different cross-encoder models or even fine-tune one on your domain-specific data.