The LangChain Self-Query Retriever is less about "self-querying" and more about letting the LLM generate the query based on natural language questions.
Let’s see it in action. Imagine you have a bunch of documents about products, and you want to ask questions like:
"Show me laptops that cost less than $1000 and have at least 16GB of RAM."
Here’s a simplified setup:
from langchain_core.documents import Document
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_openai import ChatOpenAI
from langchain.retrievers.self_query import SelfQueryRetriever
from langchain.chains.query_constructor.ir import (
Comparator,
Comparison,
Operation,
Predicate,
String,
Integer,
Float,
)
# Sample documents
docs = [
Document(
page_content="This is a powerful gaming laptop with a high refresh rate screen and an RTX 4080 GPU. It costs $1800.",
metadata={"product_type": "laptop", "price": 1800, "ram_gb": 32, "gpu": "RTX 4080"},
),
Document(
page_content="A budget-friendly ultrabook perfect for students. It has 8GB of RAM and a solid-state drive. Price: $750.",
metadata={"product_type": "laptop", "price": 750, "ram_gb": 8, "gpu": "Integrated"},
),
Document(
page_content="High-performance workstation laptop with 64GB RAM and a professional Quadro GPU. Ideal for engineers. Price: $3500.",
metadata={"product_type": "laptop", "price": 3500, "ram_gb": 64, "gpu": "Quadro"},
),
Document(
page_content="A compact and lightweight laptop for everyday tasks. Features 16GB RAM and a very reasonable price of $950.",
metadata={"product_type": "laptop", "price": 950, "ram_gb": 16, "gpu": "Integrated"},
),
Document(
page_content="This is a top-tier gaming desktop with an i9 processor and an RTX 4090. It costs $2500.",
metadata={"product_type": "desktop", "price": 2500, "ram_gb": 32, "gpu": "RTX 4090"},
),
]
# Set up the LLM and embeddings
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
embeddings = OpenAIEmbeddings()
# Create a Chroma vector store
vectorstore = Chroma.from_documents(docs, embeddings)
# Define the metadata fields and their types
metadata_field_info = {
"product_type": String,
"price": Integer,
"ram_gb": Integer,
"gpu": String,
}
# Define the enabled comparison and operation types
# We'll allow equality, greater than, less than, and AND operations
enabled_filter_list = [
Comparison.EQ,
Comparison.GT,
Comparison.LT,
Operation.AND,
]
# Instantiate the SelfQueryRetriever
retriever = SelfQueryRetriever.from_llm(
llm,
vectorstore,
metadata_field_info,
enabled_filter_list,
verbose=True,
)
# Now, let's query!
query = "Show me laptops that cost less than $1000 and have at least 16GB of RAM."
results = retriever.invoke(query)
for doc in results:
print(f"Content: {doc.page_content}")
print(f"Metadata: {doc.metadata}\n")
When you run this, the SelfQueryRetriever passes your natural language query to the LLM. The LLM, knowing the metadata_field_info and enabled_filter_list, doesn’t just try to answer the question. Instead, it constructs a structured query that LangChain can use to filter the vector store.
You’ll see output similar to this (the exact LLM output might vary):
[LLM_THOUGHT]
The user is asking for laptops with a price less than $1000 and at least 16GB of RAM.
I need to filter the documents based on these criteria.
The metadata fields available are: product_type (String), price (Integer), ram_gb (Integer), gpu (String).
The allowed comparisons are: EQ, GT, LT, AND.
I will construct a query that filters by:
- product_type EQUAL TO "laptop"
- price LESS THAN 1000
- ram_gb GREATER THAN OR EQUAL TO 16
This translates to an AND operation combining these predicates.
[/LLM_THOUGHT]
Content: A compact and lightweight laptop for everyday tasks. Features 16GB RAM and a very reasonable price of $950.
Metadata: {'product_type': 'laptop', 'price': 950, 'ram_gb': 16, 'gpu': 'Integrated'}
Content: A budget-friendly ultrabook perfect for students. It has 8GB of RAM and a solid-state drive. Price: $750.
Metadata: {'product_type': 'laptop', 'price': 750, 'ram_gb': 8, 'gpu': 'Integrated'}
Notice that the LLM correctly identified the relevant metadata fields, the desired comparisons (< for price, >= for RAM), and even implicitly added a filter for product_type="laptop" because the question was about "laptops." It then combined these into an AND operation. The retriever then uses this structured query to efficiently fetch only the relevant documents from the vector store.
The core problem this solves is bridging the gap between human language and structured data retrieval. Instead of users having to learn specific query languages or complex boolean logic for filtering, they can simply ask questions naturally. The LLM acts as a translator, converting fuzzy natural language into precise, executable filter conditions.
Internally, the SelfQueryRetriever uses LangChain’s Query Constructor. The LLM’s output isn’t just a string; it’s a structured object representing a query tree. This tree can include predicates (like price < 1000), comparisons (Comparison.LT), and logical operations (Operation.AND). This structured representation is then used by the underlying vector store to perform the actual filtering.
The levers you control are primarily:
metadata_field_info: This tells the LLM what metadata fields exist and what their data types are (String, Integer, Float, Boolean, Date). This is crucial for the LLM to know what it can filter on and how to interpret values.enabled_filter_list: This restricts the LLM to only use specific comparison operators (e.g.,EQ,GT,LT,LTE,GTE,NEQ,CONTAINS,STARTSWITH,ENDSWITH) and logical operations (AND,OR). This is a security and predictability measure, ensuring the LLM doesn’t generate nonsensical or unsupported query logic.
The most surprising part is how the LLM can infer implicit filters. In the example, the query was "Show me laptops…". The LLM didn’t need an explicit instruction to filter product_type="laptop"; it inferred it from the context of the question itself. This makes it incredibly powerful for dynamic filtering where the user’s intent might not perfectly map to explicit metadata values.
The next step often involves chaining this retriever with a document question-answering chain, so the LLM answers specific questions based on the filtered results.