Hybrid search is the secret sauce for getting the best of both worlds in information retrieval, and LangChain makes it surprisingly straightforward to implement.
Let’s see it in action. Imagine you have a collection of documents and you want to find the most relevant ones for a query. Instead of relying on just keyword matching (like BM25) or just understanding the meaning of words (semantic search), hybrid search does both.
Here’s a simplified setup:
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain.retrievers import BM25Retriever
# Assume documents are already loaded and split into chunks
# For demonstration, let's create dummy documents
documents = [
"The quick brown fox jumps over the lazy dog.",
"A fast, agile fox leaps across a sleepy canine.",
"Natural language processing is a subfield of artificial intelligence.",
"Machine learning algorithms are used in NLP.",
"The lazy dog slept soundly in the sun.",
]
# 1. Set up Semantic Search (Vector Store)
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_texts(documents, embeddings)
retriever_semantic = vectorstore.as_retriever(search_kwargs={"k": 3})
# 2. Set up BM25 Search
retriever_bm25 = BM25Retriever.from_documents(
[Document(page_content=doc) for doc in documents] # BM25Retriever expects Document objects
)
retriever_bm25.k = 3
# 3. Combine the retrievers
from langchain.retrievers import EnsembleRetriever
ensemble_retriever = EnsembleRetriever(
retrievers=[retriever_semantic, retriever_bm25],
weights=[0.5, 0.5] # Equal weight for both
)
# 4. Set up a simple chain to use the retriever
prompt = ChatPromptTemplate.from_template(
"Answer the question based on the following context:\n\n{context}\n\nQuestion: {question}"
)
model = ChatOpenAI()
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
chain = (
{"context": ensemble_retriever | RunnablePassthrough() | format_docs, "question": RunnablePassthrough()}
| prompt
| model
)
# Example Query
query = "What animal is fast?"
response = chain.invoke(query)
print(response.content)
This code demonstrates the core idea: you create two independent retrievers – one for semantic similarity and one for keyword relevance – and then combine their results. The EnsembleRetriever is the key component here, allowing you to specify weights for each retriever, effectively controlling how much influence each type of search has on the final output.
The problem hybrid search solves is the inherent limitation of each individual search method. Pure semantic search can sometimes miss highly specific keywords or jargon that aren’t well-represented in its embeddings. Conversely, pure keyword search (like BM25) can be very brittle; if a user’s query doesn’t use the exact keywords present in the document, even if the meaning is identical, the document might not be retrieved. Hybrid search bridges this gap by leveraging both approaches. The semantic search finds documents that are conceptually similar, while BM25 ensures that documents containing the precise terms are also considered.
Internally, when you query the ensemble_retriever, it sends the query to both the semantic retriever and the BM25 retriever. Each retriever returns its top-k results. The EnsembleRetriever then merges these results, re-ranking them based on the specified weights. For instance, with weights [0.5, 0.5], it might take the top results from each, score them based on a combined relevance score derived from their original scores and the weights, and then present the top N unique results. The weights parameter is your primary lever; adjusting it allows you to tune the balance. A higher weight for the semantic retriever will prioritize conceptual relevance, while a higher weight for BM25 will prioritize exact keyword matches.
Here’s a less obvious but mechanically crucial point: the BM25Retriever in LangChain, by default, expects Document objects, not plain strings. If you’re creating it from a list of strings, you need to wrap each string in a Document object, as shown in the example with [Document(page_content=doc) for doc in documents]. Failure to do this will result in a TypeError because the internal scoring mechanisms of BM25 rely on attributes of the Document object.
The next logical step is exploring more advanced ensemble strategies beyond simple weighted averaging, such as using different search_kwargs for each retriever or implementing custom re-ranking logic.