LangChain’s Pydantic integration lets you pull structured, typed data out of LLM responses, moving beyond simple strings.
Let’s see it in action. Imagine you’re building a sentiment analysis tool and want to extract both the sentiment (positive, negative, neutral) and a brief justification from a user’s review.
First, define your desired output structure using Pydantic:
from pydantic import BaseModel, Field
from typing import Literal
class SentimentAnalysis(BaseModel):
sentiment: Literal["positive", "negative", "neutral"] = Field(description="The overall sentiment of the review.")
justification: str = Field(description="A brief explanation for the assigned sentiment.")
Now, integrate this with LangChain. You’ll use an LLMChain and specify your Pydantic model as the output parser.
from langchain_openai import ChatOpenAI
from langchain.chains import LLMChain
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import PydanticOutputParser
# Initialize the LLM
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
# Define the prompt
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant that analyzes sentiment. \n{format_instructions}"),
("human", "{review}")
])
# Set up the Pydantic output parser
parser = PydanticOutputParser(pydantic_object=SentimentAnalysis)
# Create the LLMChain
chain = LLMChain(
llm=llm,
prompt=prompt,
output_parser=parser,
verbose=True # Set to True to see the generated prompt with format instructions
)
# Example review
review = "This movie was absolutely fantastic! The acting was superb and the plot kept me on the edge of my seat. I highly recommend it."
# Run the chain
result = chain.invoke({"review": review, "format_instructions": parser.get_format_instructions()})
print(result)
When you run this, LangChain automatically injects instructions into the prompt telling the LLM how to format its output to match your Pydantic model. The LLMChain then uses the PydanticOutputParser to validate the LLM’s response and convert it into an instance of your SentimentAnalysis Pydantic model.
The output will look something like this (the exact JSON might vary slightly):
...
Prompt after formatting:
...
You are a helpful assistant that analyzes sentiment.
The output should be a JSON object that conforms to the following schema. This JSON data should be enclosed in triple backticks (```json ... ```).
{"sentiment": "positive" | "negative" | "neutral", "justification": "string"}
This movie was absolutely fantastic! The acting was superb and the plot kept me on the edge of my seat. I highly recommend it.
...
{'review': 'This movie was absolutely fantastic! The acting was superb and the plot kept me on the edge of my seat. I highly recommend it.', 'format_instructions': 'The output should be a JSON object that conforms to the following schema. This JSON data should be enclosed in triple backticks (```json ... ```).\n{"sentiment": "positive" | "negative" | "neutral", "justification": "string"}', 'text': SentimentAnalysis(sentiment='positive', justification='The movie was fantastic with superb acting and an engaging plot.')}
Notice how the result['text'] is now a SentimentAnalysis object, not just a string. You can directly access its attributes: result['text'].sentiment and result['text'].justification.
The fundamental problem this solves is the inherent unstructured nature of LLM output. LLMs are text generators. While they can be instructed to produce JSON, they can also easily "hallucinate" or produce malformed JSON that breaks downstream parsing. Pydantic provides a robust schema and validation layer. LangChain’s integration automates the prompt engineering for output formatting and the parsing/validation of the LLM’s response, significantly reducing the boilerplate code you’d otherwise need to write to achieve reliable structured data extraction. It handles retries and re-prompting internally if the LLM’s initial output doesn’t conform to the requested format, making the process much more resilient.
What most people don’t realize is that the get_format_instructions() method is not just a simple string dump of the Pydantic model. It’s a carefully crafted set of instructions designed to guide the LLM’s generation process, including specifying that the JSON output should be enclosed in markdown code blocks (```json ... ```). This explicit instruction helps the LLM isolate the structured output from any conversational preamble or postamble it might otherwise generate.
The next hurdle you’ll often face is handling complex, nested Pydantic models or dealing with lists of structured objects.