The most surprising truth about LlamaIndex RAG evaluation is that "correctness" isn’t a single, monolithic concept; it’s a nuanced interplay of faithfulness, relevance, and the user’s underlying intent.

Let’s see this in action. Imagine a RAG system designed to answer questions about a company’s internal HR policies.

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.evaluation import (
    CorrectnessEvaluator,
    FaithfulnessEvaluator,
    RelevanceEvaluator,
)
from llama_index.core.response.notebook_utils import display_response
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# Configure LLM and Embedding models
Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")

# Load documents and build index
documents = SimpleDirectoryReader("./hr_policies").load_data()
index = VectorStoreIndex.from_documents(documents)

# Create a query engine
query_engine = index.as_query_engine()

# Define the question
question = "What is the policy on remote work for employees in the engineering department?"

# Generate a response
response = query_engine.query(question)
display_response(response)

Now, let’s break down how we evaluate this response using the triad metrics: Faithfulness, Relevance, and Correctness.

Faithfulness asks: "Does the generated answer stick to the provided source documents?" A high faithfulness score means the LLM isn’t hallucinating or bringing in outside knowledge. It’s grounded.

faithfulness_evaluator = FaithfulnessEvaluator(llm=Settings.llm)
faithfulness_result = await faithfulness_evaluator.aevaluate(response)
print(f"Faithfulness Score: {faithfulness_result.passing}")
print(f"Faithfulness Reason: {faithfulness_result.feedback}")

Relevance asks: "Does the generated answer address the user’s question?" A highly faithful answer could still be irrelevant if it talks about the right documents but misses the point of the query.

relevance_evaluator = RelevanceEvaluator(llm=Settings.llm)
relevance_result = await relevance_evaluator.aevaluate(response, query=question)
print(f"Relevance Score: {relevance_result.passing}")
print(f"Relevance Reason: {relevance_result.feedback}")

Correctness is the overall judgment. It’s not just about faithfulness and relevance in isolation, but how they combine to satisfy the user’s intent. A correct answer is both faithful and relevant, and ideally, it’s also comprehensive and helpful. The CorrectnessEvaluator in LlamaIndex often leverages a more sophisticated prompt that considers these facets.

correctness_evaluator = CorrectnessEvaluator(llm=Settings.llm)
correctness_result = await correctness_evaluator.aevaluate(response, query=question)
print(f"Correctness Score: {correctness_result.passing}")
print(f"Correctness Reason: {correctness_result.feedback}")

The underlying mechanism for these evaluators is a separate LLM call, often using a carefully crafted prompt. For instance, the faithfulness prompt might look something like: "Given the following context and response, does the response contain any information not supported by the context? Answer YES or NO and explain why." Similarly, relevance prompts guide the LLM to assess if the answer directly addresses the query.

The CorrectnessEvaluator is essentially a meta-evaluator. It synthesizes the findings from faithfulness and relevance, and potentially other factors like helpfulness or completeness, into a final judgment. It’s like a teacher grading an essay: they check if the facts are right (faithfulness), if it answers the prompt (relevance), and then give an overall score based on how well it all comes together.

A critical, yet often overlooked, aspect is how the retrieved context itself influences these metrics. If the RAG system retrieves documents that are off-topic or lack the specific information needed, even a perfect LLM will struggle to produce a faithful and relevant answer. The evaluation metrics are thus indirectly assessing the retrieval stage as well. The LLM evaluator is only as good as the data it’s given to judge against.

The next logical step in improving RAG quality is to move beyond static evaluation and explore techniques for adaptive retrieval, where the system dynamically adjusts its retrieval strategy based on the nuances of the query and the initial retrieved results.

Want structured learning?

Take the full Llamaindex course →