LangChain unit and integration tests are often skipped because the "chain" itself feels like the atomic unit, but in reality, it’s the interactions between components that are the true source of fragility.
Let’s see a simple chain in action. Imagine we have a RetrievalQA chain that takes a user’s question, retrieves relevant documents from a vector store, and then uses an LLM to answer the question based on those documents.
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.document_loaders import TextLoader
# Load documents and create a vector store
loader = TextLoader("my_document.txt")
documents = loader.load()
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(documents, embeddings)
# Initialize LLM and RetrievalQA chain
llm = OpenAI(temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever()
)
# Ask a question
question = "What is the main topic of the document?"
result = qa_chain.run(question)
print(result)
This looks straightforward. The RetrievalQA chain orchestrates a few things:
- It takes the
questionand passes it to theretriever(ourvectorstore). - The
retrieverfetchesdocumentsrelevant to the question. - It then formats these
documentsand thequestioninto a prompt for thellm. - Finally, the
llmgenerates ananswer.
The problem arises when one of these components misbehaves, or when their interaction produces unexpected results. For instance, what if the retriever returns no documents? Or what if the llm hallucinates an answer despite having good context? This is where testing becomes crucial.
Unit tests should focus on individual components. For example, you’d unit test your custom retriever logic to ensure it returns the correct documents for a given query, or you’d unit test a custom prompt formatter to ensure it constructs the prompt as expected.
Integration tests, however, are where LangChain’s complexity truly shines and where testing is most valuable. These tests verify that multiple components (like the retriever, the prompt template, and the LLM) work together as intended.
Consider an integration test for our RetrievalQA chain. We don’t want to hit a live LLM API for every test run; that’s slow, expensive, and non-deterministic. Instead, we use mocking or fake implementations.
from unittest.mock import MagicMock
from langchain.schema import Document
from langchain.chains import RetrievalQA
# Mock the LLM and Retriever
mock_llm = MagicMock()
mock_retriever = MagicMock()
# Define expected behavior for the mock retriever
mock_retriever.get_relevant_documents.return_value = [
Document(page_content="This document is about LangChain testing."),
Document(page_content="LangChain allows you to build chains of components.")
]
# Define expected behavior for the mock LLM
mock_llm.predict.return_value = "The main topic is LangChain testing."
# Create a RetrievalQA chain with the mocked components
qa_chain = RetrievalQA.from_chain_type(
llm=mock_llm,
chain_type="stuff",
retriever=mock_retriever,
return_source_documents=True # Let's test this too
)
# Run the chain with a test question
question = "What is the main topic?"
result = qa_chain({"query": question}) # Note: newer versions use dict input
# Assertions
assert result["result"] == "The main topic is LangChain testing."
assert len(result["source_documents"]) == 2
assert mock_retriever.get_relevant_documents.called_once_with(question)
assert mock_llm.predict.called_once_with(
# The exact prompt string will depend on the chain_type and prompt template
# This is a simplified example. In reality, you'd construct the expected prompt.
"..." # Expected prompt string here
)
This integration test verifies that the RetrievalQA chain correctly calls the retriever, passes the retrieved documents and the question to the LLM (via a prompt), and returns the expected output. We’re testing the orchestration logic.
The one thing most people don’t realize is that when you use chain_type="stuff", LangChain takes all the retrieved documents and "stuffs" them into a single prompt for the LLM. If you have many documents, this can exceed the LLM’s context window. Testing the size of the generated prompt, or using different chain_types like "map_reduce" or "refine" in your integration tests, is critical for robust applications.
The next concept you’ll want to explore is using LangChain’s Runnable interface for more granular control and testing of individual steps within complex chains.