LangChain’s RAG evaluation tools let you measure how well your Retrieval Augmented Generation system is actually retrieving relevant information.

Here’s a RAG system in action, processing a query and returning a response based on retrieved documents.

from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

# Load documents
loader = WebBaseLoader("https://www.langchain.com/blog/2023-05-03-langchain-expressions/")
docs = loader.load()

# Split documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

# Create embeddings and vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(documents=splits, embedding=embeddings)
retriever = vectorstore.as_retriever()

# Define the RAG chain
prompt = ChatPromptTemplate.from_template("""Answer the following question based only on the provided context:
{context}
Question: {question}
""")
llm = ChatOpenAI(model_name="gpt-3.5-turbo")

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
)

# Example query
query = "What are LangChain Expressions?"
response = rag_chain.invoke(query)
print(response.content)

This setup allows a language model to answer questions by first retrieving relevant text chunks from a knowledge base (in this case, a blog post about LangChain Expressions) and then using those chunks as context to formulate an answer. The core problem RAG evaluation addresses is: how do we know if the retrieval part is doing its job effectively? If the retriever pulls back irrelevant or insufficient information, the LLM’s answer will be poor, regardless of how good the LLM itself is.

The evaluation framework in LangChain typically involves defining a set of questions and their corresponding "ground truth" answers or expected outcomes. For retrieval quality, we’re particularly interested in metrics that assess how well the retrieved documents align with the user’s query. The langchain.evaluation module provides tools for this.

The RetrievalQAChain or a custom RAG chain can be instrumented to log the retrieved documents for each query. These logged documents are then passed to evaluation metrics. Key metrics for retrieval quality include:

  • Context Precision: Measures how many of the retrieved documents are actually relevant to the question. A high precision means the retriever isn’t bringing back a lot of "noise."
  • Context Recall: Measures how many of the necessary documents were retrieved. A high recall means the retriever is finding most of what it needs to answer the question.
  • Faithfulness: While often considered an LLM evaluation metric, it’s indirectly tied to retrieval. If retrieved context is not used or is contradicted by the LLM’s answer, it can indicate poor retrieval (either irrelevant docs were retrieved, or necessary docs were missed).

Let’s look at how you might set up an evaluation for retrieval quality.

from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain.evaluation.qa import QAEvalChain, ContextPrecision, ContextRecall
from langchain.evaluation.rag import GroundTruthScorer
import os

# Assume you have your RAG chain set up as before, but we'll modify it slightly for evaluation

# 1. Load and prepare data (same as before)
loader = WebBaseLoader("https://www.langchain.com/blog/2023-05-03-langchain-expressions/")
docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(documents=splits, embedding=embeddings)
retriever = vectorstore.as_retriever()

# 2. Define your evaluation dataset (questions and ground truth answers/relevant docs)
# For simplicity, we'll create a small manual dataset. In practice, this would be much larger.
evaluation_data = [
    {
        "question": "What are LangChain Expressions?",
        "answer": "LangChain Expressions (LCEL) is a declarative way to build LLM applications.",
        "ground_truth_docs": [
            "LangChain Expression Language (LCEL) provides a declarative way to build LLM applications. It allows you to compose chains using standard Python idioms.",
            "LCEL allows developers to easily compose LLM applications. It is built on top of LangChain primitives and offers a flexible and powerful way to build complex chains.",
        ]
    },
    {
        "question": "How does LCEL help developers?",
        "answer": "LCEL helps developers build LLM applications by providing a declarative and composable way to chain together different components.",
        "ground_truth_docs": [
            "LCEL allows developers to easily compose LLM applications. It is built on top of LangChain primitives and offers a flexible and powerful way to build complex chains.",
            "The declarative nature of LCEL means that you can build complex chains with minimal boilerplate code.",
        ]
    }
]

# 3. Create a RAG chain that logs retrieved documents
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

prompt = ChatPromptTemplate.from_template("""Answer the following question based only on the provided context:
{context}
Question: {question}
""")
llm = ChatOpenAI(model_name="gpt-3.5-turbo")

def create_rag_chain_with_retrieval_logging(retriever):
    chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | prompt
        | llm
    )
    # To log, we need to run the retriever separately and capture its output
    # This requires a slightly different structure for evaluation purposes
    return chain

# 4. Run the evaluation
eval_results = []
context_precision_metric = ContextPrecision()
context_recall_metric = ContextRecall()

for item in evaluation_data:
    question = item["question"]
    ground_truth_docs_content = item["ground_truth_docs"]

    # Manually retrieve docs for evaluation
    retrieved_docs = retriever.invoke(question)
    retrieved_docs_content = [doc.page_content for doc in retrieved_docs]

    # Evaluate Context Precision
    precision_score = context_precision_metric.evaluate_strings(
        prediction=retrieved_docs_content,
        reference=ground_truth_docs_content,
    )

    # Evaluate Context Recall
    recall_score = context_recall_metric.evaluate_strings(
        prediction=retrieved_docs_content,
        reference=ground_truth_docs_content,
    )

    eval_results.append({
        "question": question,
        "retrieved_docs": retrieved_docs_content,
        "ground_truth_docs": ground_truth_docs_content,
        "context_precision": precision_score,
        "context_recall": recall_score,
    })

# 5. Analyze results
for result in eval_results:
    print(f"Question: {result['question']}")
    print(f"  Context Precision: {result['context_precision']:.2f}")
    print(f"  Context Recall: {result['context_recall']:.2f}")
    # print(f"  Retrieved: {result['retrieved_docs'][:2]}...") # Print first 2 for brevity
    # print(f"  Ground Truth: {result['ground_truth_docs'][:2]}...") # Print first 2 for brevity
    print("-" * 20)

The ContextPrecision metric works by taking the set of retrieved documents and the set of documents that should have been retrieved (ground truth) and calculates the proportion of retrieved documents that are actually present in the ground truth. If your retriever returns 5 documents, and 3 of them are truly relevant (i.e., they are in your ground truth set), the precision is 3/5 = 0.6.

Conversely, ContextRecall calculates the proportion of the required documents that were successfully retrieved. If your ground truth specifies 4 relevant documents, but your retriever only found 3 of them (even if it also found 2 irrelevant ones), the recall is 3/4 = 0.75.

The evaluation dataset is crucial. It needs to be representative of the types of questions your RAG system will handle, and the ground_truth_docs should accurately reflect the content that must be present to answer the question correctly. This is often the most labor-intensive part of RAG evaluation.

The "ground truth" for retrieval is not always a perfect set of documents. Sometimes, multiple different documents could equally well support an answer. The evaluation metrics are designed to be robust to some of this ambiguity, but the quality of your evaluation dataset directly impacts the reliability of your metrics.

You’ll want to iterate on your retriever’s configuration (e.g., k value for as_retriever(), chunking strategy, embedding model) based on these metrics. For instance, if context precision is low, you might need to adjust your chunking to be more granular or use a different embedding model that better captures semantic similarity. If context recall is low, you might need to increase the number of documents retrieved (k) or reconsider your chunking strategy if relevant information is being split across too many small chunks.

The next step in evaluating your RAG system will likely involve assessing the quality of the generated answers themselves, using metrics like faithfulness and answer relevance, which build upon the retrieved context.

Want structured learning?

Take the full Langchain course →