Query engines don’t just fetch data; they actively negotiate with your indexes to get the best possible answers.

Let’s see how this plays out. Imagine you have a document about "The History of Coffee" and you ask your LlamaIndex: "What are the origins of coffee?"

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.settings import Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

# Load documents (replace with your actual document path)
documents = SimpleDirectoryReader("./data").load_data()

# Configure settings (using OpenAI for embeddings and LLM)
Settings.embed_model = OpenAIEmbedding()
Settings.llm = OpenAI(model="gpt-3.5-turbo")

# Build the index
index = VectorStoreIndex.from_documents(documents)

# Create a query engine with default retriever settings
query_engine = index.as_query_engine()

# Query the index
response = query_engine.query("What are the origins of coffee?")
print(response)

When you run this, a few things happen under the hood that aren’t immediately obvious. The query_engine doesn’t just blindly grab the top k results from the vector store. It’s smart. It uses a "retriever" component.

The retriever’s job is to take your natural language query, convert it into a vector, and then find the most relevant chunks of your original documents based on that vector. But "most relevant" can be defined in a few ways, and that’s where configuration comes in.

The core of retriever configuration revolves around two main ideas: how many documents to fetch, and how to rank them. The default is usually a sensible starting point, but for nuanced performance, you’ll want to tweak these.

Here’s where you can inject control. When you create your query_engine, you can customize the retriever. Let’s say you want to ensure you’re getting at least 5 relevant chunks, even if the initial similarity score is a bit lower, and you want to use a different similarity threshold.

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.settings import Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core.retrievers import VectorIndexRetriever

# ... (previous setup for documents, settings, index) ...

# Create a retriever with custom parameters
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=5,  # Fetch at least 5 documents
    vector_store_query_mode="hybrid", # Use a hybrid search (vector + text)
    alpha=0.7 # Weight for vector similarity in hybrid search
)

# Create a query engine using the custom retriever
query_engine = index.as_query_engine(retriever=retriever)

# Query the index
response = query_engine.query("What are the origins of coffee?")
print(response)

In this example, similarity_top_k=5 tells the retriever to aim for at least five results. The vector_store_query_mode="hybrid" and alpha=0.7 are more advanced. Hybrid search combines the semantic matching of vector embeddings with keyword matching (like BM25), which can be more robust for certain queries. The alpha parameter balances these two. An alpha of 1.0 would mean pure vector search, while 0.0 would be pure keyword search.

The mental model is this: your query enters the query_engine. It hands off the query to the retriever. The retriever consults the index’s underlying data structures (like the vector store). It uses its configured parameters (similarity_top_k, vector_store_query_mode, etc.) to decide which chunks are most promising. These chunks are then passed back to the query_engine, which feeds them to the LLM for final answer synthesis.

The real power here is that you can swap out different retriever types entirely. For instance, you might have a KeywordTableRetriever for exact matches, or a RecursiveRetriever for multi-stage retrieval. You can even combine retrievers using a QueryFusionRetriever to get results from multiple strategies and then merge them.

Most people don’t realize that the VectorIndexRetriever itself has internal logic for deciding how to query the vector store. Beyond just similarity_top_k, there’s sparse_top_k for keyword-based retrieval (if enabled), unique_filters, and sparse_embed_model which can all influence the set of candidates before they even reach the LLM. Understanding these granular controls allows you to fine-tune recall and precision.

Once you’ve mastered retriever configuration, you’ll start looking at how to blend multiple retriever strategies for even more robust information retrieval.

Want structured learning?

Take the full Llamaindex course →