The LlamaIndex Sub-Question Engine doesn’t just break down your questions; it strategically decomposes them into smaller, more manageable sub-questions that your LLM can answer more effectively, especially when dealing with complex, multi-part queries.

Let’s say you have a document about the history of AI, and you want to ask: "What were the major breakthroughs in AI research between 1950 and 1980, and how did these advancements influence the development of expert systems?"

The Sub-Question Engine, when given this query and your AI history document, might generate something like this internally:

  1. Sub-question 1: "What were the major breakthroughs in AI research between 1950 and 1980?"
  2. Sub-question 2: "How did the AI breakthroughs between 1950 and 1980 influence the development of expert systems?"

It then sends these sub-questions to your LLM, likely in sequence or in parallel, using your indexed data to find the relevant information for each. Finally, it synthesizes the answers from these sub-questions into a single, coherent response to your original, complex query.

This process is crucial because LLMs, while powerful, can struggle with queries that require multiple distinct pieces of information to be retrieved, processed, and then combined. They might "forget" earlier parts of the query or fail to make the necessary connections if asked to do too much at once. By breaking it down, you’re essentially guiding the LLM through a structured thought process, improving accuracy and completeness.

Here’s a peek at how you might set it up in Python:

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.tools import QueryEngineTool, Tool
from llama_index.core.question_generators import LLMQuestionGenerator
from llama_index.core.question_rewriting import SubQuestionQueryEngine
from llama_index.llms.openai import OpenAI # Or your preferred LLM

# Configure your LLM (ensure OPENAI_API_KEY is set in your environment)
Settings.llm = OpenAI(model="gpt-4o-mini")

# Load your documents
documents = SimpleDirectoryReader("data").load_data()

# Build an index from the documents
index = VectorStoreIndex.from_documents(documents)

# Get a query engine for the index
query_engine = index.as_query_engine()

# Create a QueryEngineTool for the sub-question engine to use
query_engine_tool = QueryEngineTool(
    query_engine=query_engine,
    metadata={"name": "ai_history_index", "description": "Provides information about the history of AI research and development."},
)

# Initialize the SubQuestionQueryEngine
# We explicitly pass the LLMQuestionGenerator here for clarity,
# though it's often the default.
question_gen = LLMQuestionGenerator(Settings.llm)
sub_question_query_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=[query_engine_tool],
    question_generator=question_gen,
    use_async=True # Use async for potentially faster execution
)

# Now, ask your complex query
response = sub_question_query_engine.query(
    "What were the major breakthroughs in AI research between 1950 and 1980, "
    "and how did these advancements influence the development of expert systems?"
)

print(response)

Notice how we wrap the base query_engine in a QueryEngineTool. This is how the SubQuestionQueryEngine knows what to query against. The metadata is important; it helps the LLM understand the context and purpose of this specific tool.

The SubQuestionQueryEngine’s core logic involves a loop: it first uses the LLMQuestionGenerator (or whatever question generator you provide) to break the initial query into sub-questions. Then, for each sub-question, it selects the most appropriate QueryEngineTool (if you have multiple tools available) and executes the query. Finally, it takes all the answers and synthesizes them.

The use_async=True flag is not just for speed; it enables the engine to potentially run multiple sub-questions in parallel if the underlying QueryEngineTool supports asynchronous operations, significantly reducing latency for complex queries.

The key levers you control are:

  • The LLM: The quality of your question generation and synthesis is directly tied to the LLM you use. A more capable LLM will produce better sub-questions and more coherent final answers.
  • The QueryEngineTools: If you have multiple data sources or different indexing strategies (e.g., a vector index for general knowledge, a knowledge graph for structured facts), you can provide multiple QueryEngineTools. The SubQuestionQueryEngine will then attempt to route sub-questions to the most relevant tool.
  • The LLMQuestionGenerator: You can customize how questions are broken down. While the default is usually good, you might fine-tune it or use a different generator if you find the default isn’t producing the sub-questions you expect.
  • The ResponseSynthesizer (implicitly used): The final step of combining sub-answers is handled by LlamaIndex’s response synthesis modules. You can customize this if you need specific ways to merge information.

One thing that’s often overlooked is how the SubQuestionQueryEngine handles dependencies between sub-questions. It doesn’t explicitly pass the answer of sub-question A as context to sub-question B within the LLM’s prompt for sub-question B. Instead, it collects all answers from all sub-questions and then feeds them together to the final response synthesis step. This is a crucial distinction: the LLM doesn’t "remember" intermediate answers in the way a human might. The context for the final synthesis includes all retrieved pieces, allowing the LLM to weave them together.

The next step you’ll likely explore is how to handle situations where a sub-question requires information not present in any of your indexed data, or how to optimize tool selection when you have a very large number of QueryEngineTools.

Want structured learning?

Take the full Llamaindex course →