LangChain custom retrievers let you plug in any data source, not just the ones it natively supports, by framing it as a similarity search.
Here’s a simple example of a retriever that searches a list of dictionaries based on a keyword. Imagine you have a small, curated dataset of internal company policies.
from typing import List, Dict, Any
from langchain.schema import BaseRetriever, Document
from langchain.vectorstores.base import VectorStore
class ListDictRetriever(BaseRetriever):
data: List[Dict[str, Any]]
search_key: str = "content" # The key in your dicts to search against
def _get_relevant_documents(self, query: str) -> List[Document]:
results = []
# Simple keyword matching for demonstration
for item in self.data:
if query.lower() in item.get(self.search_key, "").lower():
results.append(Document(page_content=item.get(self.search_key, ""), metadata=item))
return results
# Sample data
policy_data = [
{"id": "1", "title": "Remote Work Policy", "content": "Employees can work remotely up to 2 days a week."},
{"id": "2", "title": "Expense Reimbursement Policy", "content": "Submit all expenses within 30 days."},
{"id": "3", "title": "Vacation Policy", "content": "Accrue 2 days of vacation per month."},
{"id": "4", "title": "Remote Work Security", "content": "Ensure all remote connections are secured via VPN."}
]
# Instantiate and use the retriever
retriever = ListDictRetriever(data=policy_data, search_key="content")
documents = retriever.get_relevant_documents("remote work")
for doc in documents:
print(f"Content: {doc.page_content}")
print(f"Metadata: {doc.metadata}")
# Output:
# Content: Employees can work remotely up to 2 days a week.
# Metadata: {'id': '1', 'title': 'Remote Work Policy', 'content': 'Employees can work remotely up to 2 days a week.'}
# Content: Ensure all remote connections are secured via VPN.
# Metadata: {'id': '4', 'title': 'Remote Work Security', 'content': 'Ensure all remote connections are secured via VPN.'}
This ListDictRetriever acts like a search engine for your list of dictionaries. When you ask for documents related to "remote work," it iterates through your policy_data, checks if "remote work" appears in the content field of each dictionary, and if so, wraps that dictionary’s content and metadata into a LangChain Document object. This Document is what LangChain chains understand.
The core problem custom retrievers solve is bridging the gap between your arbitrary data and LangChain’s expectation of a vector similarity search. LangChain’s RetrievalQA chain, for instance, expects to call a .get_relevant_documents(query) method on whatever you give it. If you have data in a SQL database, a CSV file, a custom API, or even just a structured list like in the example, you can build a retriever that translates your data access logic into this expected interface.
Internally, a custom retriever needs to inherit from langchain.schema.BaseRetriever and implement the _get_relevant_documents method. This method takes a query string (which is typically a natural language question or keywords) and must return a List[Document]. A Document object has a page_content (the text chunk) and metadata (a dictionary for additional information like source, ID, etc.).
The BaseRetriever class provides a get_relevant_documents method that handles caching and other utilities, but your custom logic lives in _get_relevant_documents. For vector stores, this method would typically involve converting the query into an embedding and then performing a similarity search against the stored embeddings. But for non-vector sources, you implement whatever search mechanism is appropriate.
Here’s how you might extend this to integrate with a SQL database, assuming you have a table named documents with id, content, and metadata columns.
from typing import List, Dict, Any
from langchain.schema import BaseRetriever, Document
import sqlite3 # Example for SQLite
class SQLRetriever(BaseRetriever):
db_path: str
table_name: str = "documents"
search_column: str = "content"
def _get_relevant_documents(self, query: str) -> List[Document]:
conn = sqlite3.connect(self.db_path)
cursor = conn.cursor()
# Basic LIKE query for demonstration. In a real app, you'd use full-text search.
sql_query = f"SELECT id, content, metadata FROM {self.table_name} WHERE {self.search_column} LIKE ?"
cursor.execute(f"%{query}%",)
rows = cursor.fetchall()
conn.close()
results = []
for row in rows:
doc_id, content, metadata_str = row
# Assuming metadata is stored as a JSON string
import json
metadata = json.loads(metadata_str) if metadata_str else {}
results.append(Document(page_content=content, metadata={"id": doc_id, **metadata}))
return results
# Example usage (requires a dummy SQLite DB)
# Create a dummy DB for testing
# conn = sqlite3.connect("my_docs.db")
# cursor = conn.cursor()
# cursor.execute("CREATE TABLE IF NOT EXISTS documents (id INTEGER PRIMARY KEY, content TEXT, metadata TEXT)")
# cursor.execute("INSERT INTO documents (content, metadata) VALUES (?, ?)", ("This is about AI.", '{"source": "wiki"}'))
# cursor.execute("INSERT INTO documents (content, metadata) VALUES (?, ?)", ("Learn about machine learning.", '{"source": "blog"}'))
# conn.commit()
# conn.close()
# sql_retriever = SQLRetriever(db_path="my_docs.db")
# docs = sql_retriever.get_relevant_documents("AI")
# for doc in docs:
# print(f"Content: {doc.page_content}\nMetadata: {doc.metadata}\n")
The critical part here is _get_relevant_documents. For a vector store like Chroma or FAISS, this method would look very different. It would involve creating an embedding for the query using the same embedding model used for indexing, and then querying the vector store’s index for the nearest neighbors. For example, with FAISS:
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.schema import Document
import os
# Assuming you have a FAISS index saved locally
# If not, you'd first build it:
# docs = [Document(page_content="..."), ...]
# embeddings = OpenAIEmbeddings()
# vectorstore = FAISS.from_documents(docs, embeddings)
# vectorstore.save_local("faiss_index")
# Load the index
embeddings = OpenAIEmbeddings() # Must match the embeddings used for indexing
vectorstore = FAISS.load_local("faiss_index", embeddings)
# The FAISS vector store object itself is a retriever
# Its get_relevant_documents method does the embedding and similarity search
query = "What is the capital of France?"
retrieved_docs = vectorstore.similarity_search(query, k=3) # k is the number of docs to return
# You can also wrap it in a retriever object if needed for other chain types
faiss_retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
retrieved_docs_from_retriever = faiss_retriever.get_relevant_documents(query)
The most surprising thing about building custom retrievers is that you don’t have to perform similarity search; you just need to return Document objects that are relevant to the query. This opens the door to any data source that can be queried, even if it’s just a simple lookup or a complex multi-stage process.
For example, imagine a retriever that queries a knowledge graph. It would take the query, parse it to identify entities and relationships, query the graph (e.g., using SPARQL), and then construct Document objects from the graph query results. The page_content could be a textual description of the graph triple, and the metadata could store the entities and relationship types.
The final piece of the puzzle is how these retrievers integrate into LangChain applications. They are most commonly used with chains like RetrievalQA or ConversationalRetrievalChain. You simply pass an instance of your custom retriever to the retriever argument of these chain constructors. LangChain then handles calling retriever.get_relevant_documents(query) internally when it needs to fetch context for the LLM.
When you implement a custom retriever for a data source that doesn’t have inherent semantic similarity (like a simple SQL LIKE query or a dictionary lookup), the "relevance" is defined by your query logic. LangChain doesn’t enforce that you use embeddings; it only requires that your retriever returns documents that are likely to help answer the user’s query.
The next hurdle you’ll encounter is managing the performance and accuracy of custom retrievers, especially as data sources become more complex.