LlamaIndex doesn’t just return text; it can give you back structured data, and Pydantic models are its favorite way to do it.
Let’s see LlamaIndex churn out some structured JSON, then we’ll break down how it works.
import os
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.openai import OpenAI
from llama_index.core.query_engine import OutputParser
from llama_index.core.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field
from typing import List, Optional
# Assume you have an OpenAI API key set as an environment variable
# os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"
# Sample data in a file named 'data.txt'
# The quick brown fox jumps over the lazy dog.
# The cat sat on the mat.
# The dog chased the ball.
# John Doe is a software engineer. He lives in New York. His email is john.doe@example.com.
# Jane Smith is a product manager. She lives in San Francisco. Her email is jane.smith@example.com.
# Ensure your data file exists
with open("data.txt", "w") as f:
f.write("The quick brown fox jumps over the lazy dog.\n")
f.write("The cat sat on the mat.\n")
f.write("The dog chased the ball.\n")
f.write("John Doe is a software engineer. He lives in New York. His email is john.doe@example.com.\n")
f.write("Jane Smith is a product manager. She lives in San Francisco. Her email is jane.smith@example.com.\n")
# Define a Pydantic model for structured output
class PersonInfo(BaseModel):
name: str = Field(..., description="Full name of the person")
occupation: Optional[str] = Field(None, description="Person's occupation")
location: Optional[str] = Field(None, description="City and state where the person lives")
email: Optional[str] = Field(None, description="Person's email address")
# Define a Pydantic model for a list of people
class PeopleList(BaseModel):
people: List[PersonInfo] = Field(..., description="A list of people and their information")
# Load documents
documents = SimpleDirectoryReader(".").load_data()
# Initialize LLM (using OpenAI as an example)
# You can adjust the model and temperature
Settings.llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
# Build an index
index = VectorStoreIndex.from_documents(documents)
# Create a query engine with a Pydantic output parser
# We tell the engine that we expect a list of PeopleInfo objects
query_engine = index.as_query_engine(
output_parser=PydanticOutputParser(PeopleList)
)
# Query the index
response = query_engine.query("Extract information about all people mentioned in the documents.")
# The response will be an instance of the PeopleList Pydantic model
print(response)
print(type(response))
print(response.people[0].name)
When you run this, you’ll see output like:
people=[PersonInfo(name='John Doe', occupation='software engineer', location='New York', email='john.doe@example.com'), PersonInfo(name='Jane Smith', occupation='product manager', location='San Francisco', email='jane.smith@example.com')]
<class '__main__.PeopleList'>
John Doe
LlamaIndex makes this possible by leveraging the LLM’s ability to understand and generate structured formats. When you specify PydanticOutputParser(PeopleList), LlamaIndex constructs a prompt for the LLM that explicitly asks it to return data conforming to the PeopleList schema. This prompt usually includes a JSON schema derived from your Pydantic model, along with instructions for the LLM to "output the JSON that conforms to this schema." The PydanticOutputParser then takes the LLM’s raw text response, attempts to parse it as JSON, and validates it against your Pydantic model. If validation fails, it can often retry the LLM call.
The core problem LlamaIndex solves here is bridging the gap between the LLM’s probabilistic text generation and your application’s need for deterministic, structured data. LLMs are excellent at understanding natural language and generating human-readable text, but directly asking them to produce a perfectly formatted JSON array of objects with specific types (like strings for names, optional strings for occupations) is unreliable. The PydanticOutputParser acts as a robust intermediary. It defines the contract for the output using Pydantic’s strong typing and validation, and LlamaIndex handles the complex prompt engineering and parsing logic to enforce that contract with the LLM.
Internally, when you set output_parser=PydanticOutputParser(YourModel), LlamaIndex does a few key things:
- Schema Generation: It introspects your Pydantic
BaseModel(or list of models) to generate a JSON schema. This schema precisely describes the expected structure, field names, data types, and any constraints (likedescriptionfields that help the LLM understand context). - Prompt Augmentation: It injects instructions and the generated JSON schema into the prompt sent to the LLM. The prompt will typically look something like: "Given the following text, extract the requested information and return it as a JSON object conforming to the schema below. … [JSON Schema] …".
- Response Parsing and Validation: It receives the LLM’s output, which is expected to be a JSON string. It attempts to parse this string into a Python object.
- Pydantic Validation: It then uses Pydantic to validate the parsed object against your original
BaseModel. If the data doesn’t match the schema (e.g., a string is provided where an integer was expected, or a required field is missing), Pydantic will raise aValidationError. - Retry Mechanism (Optional): LlamaIndex often includes a retry mechanism. If Pydantic validation fails, it can send the original prompt back to the LLM with an error message indicating why the output was invalid, asking it to try again. This significantly increases the reliability of getting correctly structured data.
The exact levers you control are the Pydantic models themselves, the LLM configuration (model choice, temperature, context window), and the query text. By carefully defining your Pydantic models with descriptive fields and Field annotations, you guide the LLM towards extracting precisely the information you need. For example, adding a description="The primary color of the object" to a color: str field in your Pydantic model will prompt the LLM to be more specific about what "color" means in that context.
The most surprising thing most people don’t realize is that the LLM isn’t just magically producing JSON. LlamaIndex is actually performing a sophisticated dance of prompt engineering, schema serialization, and deserialization. The LLM is instructed to produce JSON, but it’s the PydanticOutputParser that guarantees the output conforms to your application’s needs through rigorous validation. Without this validation layer, the LLM’s output, even if it looks like JSON, could be subtly malformed or incomplete, leading to runtime errors downstream.
Once you’ve successfully extracted structured data, the next logical step is often to use that data for further processing, such as performing database lookups, triggering actions based on extracted fields, or generating reports.