Reranking is a subtle but powerful optimization that can dramatically improve the precision of retrieval systems by moving beyond simple keyword matching.
Let’s see this in action with LlamaIndex, a popular framework for building LLM applications with data. We’ll use a small example dataset of documents about different types of pasta.
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.response.notebook_utils import display_response
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.postprocessor.cohere_rerank import CohereRerank
from llama_index.postprocessor.colbert_rerank import ColbertRerank
# Assume you have a directory named 'data' with text files.
# For this example, let's create a dummy file:
import os
if not os.path.exists("data"):
os.makedirs("data")
with open("data/pasta.txt", "w") as f:
f.write("Spaghetti is a long, thin, solid, cylindrical pasta.\n")
f.write("Fettuccine is a type of pasta popular in Roman and Tuscan cuisine.\n")
f.write("Penne is a type of pasta with cylinder-shaped pieces, their ends cut at an angle.\n")
f.write("Lasagna is a type of pasta, possibly of Greek origin, made of very wide, flat pasta sheets.\n")
f.write("Rigatoni is a tube-shaped type of pasta with ridges on the outside.\n")
f.write("Macaroni is a dry pasta shaped like narrow tubes.\n")
# Load documents
documents = SimpleDirectoryReader("./data").load_data()
# Configure LlamaIndex settings
# Use a smaller embedding model for faster local testing
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
Settings.llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
# Build a vector index
index = VectorStoreIndex.from_documents(documents)
# --- Basic Retrieval ---
print("--- Basic Retrieval ---")
retriever = VectorIndexRetriever(index=index, similarity_top_k=5)
query_engine = RetrieverQueryEngine(retriever=retriever)
response = query_engine.query("What kind of pasta is long and thin?")
print(response)
# Expected output will likely include Spaghetti and Macaroni, but might have noise.
# --- Retrieval with Cohere Reranking ---
print("\n--- Retrieval with Cohere Reranking ---")
# Ensure you have COHERE_API_KEY set in your environment
# pip install cohere
import os
if "COHERE_API_KEY" in os.environ:
cohere_reranker = CohereRerank(api_key=os.environ["COHERE_API_KEY"], top_n=3)
query_engine_cohere = RetrieverQueryEngine(
retriever=retriever,
response_synthesizer=None, # Use default synthesizer
postprocessors=[cohere_reranker]
)
response_cohere = query_engine_cohere.query("What kind of pasta is long and thin?")
print(response_cohere)
# Expected output will be more precise, likely focusing on Spaghetti.
else:
print("COHERE_API_KEY not set. Skipping Cohere reranking example.")
# --- Retrieval with ColBERT Reranking ---
print("\n--- Retrieval with ColBERT Reranking ---")
# pip install torch transformers sentencepiece accelerate
# Note: ColBERT requires a local model download and can be resource-intensive.
# This example uses a smaller, pre-trained ColBERT model.
colbert_reranker = ColbertRerank(
model_name="colbert-ir/colbertv2.0", # A common ColBERT model
top_n=3 # Number of results to rerank
)
query_engine_colbert = RetrieverQueryEngine(
retriever=retriever,
response_synthesizer=None, # Use default synthesizer
postprocessors=[colbert_reranker]
)
response_colbert = query_engine_colbert.query("What kind of pasta is long and thin?")
print(response_colbert)
# Expected output will also be precise, similar to Cohere.
The core problem retrieval systems face is that initial candidate selection, often done via vector similarity (like dot product or cosine similarity), is a coarse filter. It finds documents that are semantically similar on average, but it doesn’t deeply understand the nuance of a specific query’s intent against each document’s content. Think of it like a first-pass job application screener: they pull all resumes that mention "software engineering," but they haven’t yet read them to see if the candidate is a good fit for the specific role.
Reranking addresses this by taking a smaller set of promising candidates from the initial retrieval and applying a more sophisticated, often cross-encoder based, model to re-score them. These reranking models are designed to directly compare the query and a document snippet, understanding how well they align precisely.
Cohere Reranking uses a powerful, proprietary transformer model that’s fine-tuned for relevance. When you pass a query and a list of documents to CohereRerank, it sends them to Cohere’s API. The API computes a precise relevance score for each document with respect to the query, and returns the top top_n documents sorted by this score. This is a black-box approach, offering high quality with minimal local compute but requiring an API key and network access.
ColBERT (Contextualized Late Interaction over Bi-Encoder Representations) takes a different approach. It’s a local, open-source model. ColBERT works by, for each token in the query and each token in a document, generating an embedding. Then, instead of a simple average or sum, it computes the interaction between every query token embedding and every document token embedding (hence "late interaction"). The maximum similarity score across all these token-pair interactions forms the document’s score. This allows for a much finer-grained, token-level understanding of relevance. It’s more computationally intensive locally than a simple vector search but doesn’t require external APIs.
The key difference in how rerankers work is the comparison mechanism. A standard retriever (like a VectorIndexRetriever) typically uses a bi-encoder: one model encodes the query, another encodes the document, and you compare their resulting vectors. This is fast but can miss nuanced matches. Rerankers like Cohere and ColBERT often employ a form of cross-encoder or a specialized interaction mechanism. A cross-encoder takes the query and the document together as input to a single transformer model, producing a direct relevance score. This is much more accurate because the model "sees" the query and document in context together, but it’s far too slow for initial retrieval across millions of documents. Reranking applies this powerful comparison to a small, pre-filtered set.
The true magic of reranking lies in its ability to disambiguate and prioritize. For example, if your query is "what about pasta shaped like tubes?", a basic retriever might return "Spaghetti is a long, thin, solid, cylindrical pasta" (good match on "cylindrical") and "Macaroni is a dry pasta shaped like narrow tubes" (good match on "tubes"). However, a reranker, especially ColBERT with its token-level interaction, can more precisely identify that "tubes" in the query directly maps to the description of Macaroni and Rigatoni, while "cylindrical" is a related but less direct match for Spaghetti in the context of "shaped like tubes."
The most surprising part about reranking is how much more powerful it is than one might expect for a process applied to only a few top results. It’s not just about picking the very best from the initial list; it’s about how the reranker’s fine-grained understanding of query-document interaction fundamentally changes the perceived relevance. A document that scored a 0.85 in initial retrieval might be demoted to 0.70 by the reranker, while another that scored 0.80 might be promoted to 0.92. This shift, even among the top 3-5 results, is what leads to the dramatic precision gains.
The next step after achieving high precision with reranking is to consider how to synthesize these refined results into a coherent answer, often involving more advanced response generation strategies or even hierarchical retrieval.