Metadata filters in LlamaIndex are how you tell your retrieval system to only look at a specific subset of your documents, making your searches faster and more accurate.
Let’s see this in action. Imagine you have a bunch of documents about different companies, and each document has metadata like company_name, industry, and year.
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext, load_index_from_storage
from llama_index.core.vector_stores import MetadataFilters, FilterCondition, ExactMatchFilter, FilterOperator
# Assume you have documents in a directory named 'data'
# Each document has metadata associated with it.
# For example, a document might look like:
# {"text": "...", "metadata": {"company_name": "Acme Corp", "industry": "Manufacturing", "year": 2023}}
# Load documents (or create an index if it doesn't exist)
try:
storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)
except:
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
index.storage_context.persist(persist_dir="./storage")
# Define a query
query = "What were the main challenges faced by the company last year?"
# --- Scenario 1: Retrieve from all documents ---
print("--- Retrieving from all documents ---")
retriever_all = index.as_retriever()
nodes_all = retriever_all.retrieve(query)
print(f"Found {len(nodes_all)} nodes.")
# For simplicity, we'll just print the metadata of the first node if found
if nodes_all:
print(f"Example metadata: {nodes_all[0].metadata}")
# --- Scenario 2: Retrieve only from documents related to 'Acme Corp' ---
print("\n--- Retrieving only from 'Acme Corp' documents ---")
filters_acme = MetadataFilters(filters=[
ExactMatchFilter(key="company_name", value="Acme Corp")
])
retriever_acme = index.as_retriever(filters=filters_acme)
nodes_acme = retriever_acme.retrieve(query)
print(f"Found {len(nodes_acme)} nodes for Acme Corp.")
if nodes_acme:
print(f"Example metadata: {nodes_acme[0].metadata}")
# --- Scenario 3: Retrieve only from documents related to 'Tech' industry in year 2022 ---
print("\n--- Retrieving only from 'Tech' industry, year 2022 ---")
filters_tech_2022 = MetadataFilters(filters=[
ExactMatchFilter(key="industry", value="Technology"),
ExactMatchFilter(key="year", value=2022)
])
retriever_tech_2022 = index.as_retriever(filters=filters_tech_2022)
nodes_tech_2022 = retriever_tech_2022.retrieve(query)
print(f"Found {len(nodes_tech_2022)} nodes for Tech in 2022.")
if nodes_tech_2022:
print(f"Example metadata: {nodes_tech_2022[0].metadata}")
# --- Scenario 4: Using a condition for numerical comparison ---
print("\n--- Retrieving documents with year greater than 2021 ---")
filters_gt_2021 = MetadataFilters(filters=[
FilterCondition(key="year", value=2021, operator=FilterOperator.GT)
])
retriever_gt_2021 = index.as_retriever(filters=filters_gt_2021)
nodes_gt_2021 = retriever_gt_2021.retrieve(query)
print(f"Found {len(nodes_gt_2021)} nodes with year > 2021.")
if nodes_gt_2021:
print(f"Example metadata: {nodes_gt_2021[0].metadata}")
This code demonstrates how to create a retrieval system that’s aware of document metadata. Without filters, a query like "What were the main challenges faced by the company last year?" would scan all documents in your index. With filters, you can drastically reduce the search space. For instance, ExactMatchFilter(key="company_name", value="Acme Corp") tells LlamaIndex to only consider documents where the company_name metadata field is exactly "Acme Corp". Similarly, FilterCondition(key="year", value=2021, operator=FilterOperator.GT) restricts results to documents where the year is strictly greater than 2021. These filters are applied before the similarity search, meaning the LLM only ever sees relevant candidates, leading to faster responses and fewer irrelevant results.
The core problem metadata filters solve is scalability and relevance. As your knowledge base grows, searching through every single document becomes computationally expensive and prone to returning noisy, irrelevant results. By pre-filtering, you guide the retrieval process, ensuring that only documents semantically related to your query and matching your specific criteria are considered. This is crucial for applications where context is king, like customer support bots that need to pull up specific account information or legal research tools that must adhere to strict document types and dates.
Internally, LlamaIndex integrates these filters with the underlying vector store. When you define MetadataFilters, LlamaIndex translates these into the appropriate query mechanism for your chosen vector database (e.g., Pinecone, Weaviate, Chroma, or even its in-memory version). For many vector databases, this means the filtering happens at the database level, often leveraging specialized indexing for metadata fields. This is significantly more efficient than retrieving all potential matches and then filtering them in Python. The MetadataFilters object is a collection of BaseFilter objects. ExactMatchFilter is straightforward: it checks for equality. FilterCondition is more powerful, allowing for numerical and string comparisons using operators like GT (greater than), LT (less than), GTE (greater than or equal to), LTE (less than or equal to), NEQ (not equal to), CONTAINS (for string matching), and STARTSWITH/ENDSWITH. You can combine multiple filters, and by default, they are combined with a logical AND, meaning all specified conditions must be met.
A subtle but important aspect of metadata filtering is how it interacts with the embedding generation and document chunking process. If you’re filtering by year, you need to ensure that the year metadata is accurately captured and stored at the time of indexing. If a document is chunked, the metadata is typically associated with each chunk. However, if your filtering logic relies on metadata that might vary within a document (e.g., filtering for a specific company mentioned in a paragraph), you might need a more granular metadata strategy or a different retrieval approach altogether. The filters operate on the metadata attached to the Node objects that are retrieved.
The next step in refining retrieval is often exploring hierarchical filters or more complex logical combinations (OR conditions) if your vector store supports them, or implementing custom filtering logic after an initial retrieval.