RAGAS metrics are not just a score; they’re a precise diagnostic tool that reveals why your RAG pipeline is failing, not just that it’s failing.
Let’s see RAGAS in action. Imagine you’ve got a RAG pipeline that pulls information from a knowledge base to answer user questions. You’re using LlamaIndex to orchestrate this.
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.evaluation import RagasEvaluator
from llama_index.core.response.notebook_utils import display_response
# Load your documents
documents = SimpleDirectoryReader("./data").load_data()
# Build an index
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
# Define your question
question = "What is the capital of France?"
# Get a response
response = query_engine.query(question)
# Initialize RAGAS evaluator
# You'll need to set these environment variables for RAGAS to work:
# OPENAI_API_KEY
# Ensure you have a valid OpenAI API key set.
evaluator = RagasEvaluator(
raise_error=False, # Don't stop if one metric fails
verbose=True
)
# Evaluate the response
eval_results = evaluator.evaluate(
query=question,
response_text=str(response),
contexts=[node.get_content() for node in response.source_nodes]
)
# Display results
print(eval_results)
This code snippet sets up a basic RAG pipeline, queries it, and then uses RagasEvaluator from LlamaIndex to score the response. The eval_results dictionary will contain scores for various metrics like faithfulness, answer relevancy, and context precision.
The core problem RAGAS solves is the black box nature of RAG. Before RAGAS, you’d get an answer and think, "Is this good?" You might manually check if it’s correct, if it uses the right documents, and if it directly answers the question. RAGAS automates this with specific metrics.
Here’s how it works internally:
- Faithfulness: Checks if the generated answer is factually consistent with the provided context. RAGAS uses an LLM to compare the answer against each sentence in the context. If any part of the answer contradicts the context, faithfulness drops.
- Answer Relevancy: Measures how well the generated answer addresses the user’s question. Again, an LLM is used to gauge the semantic overlap between the question and the answer.
- Context Precision: Evaluates if all the relevant information from the retrieved context was actually used in generating the answer. RAGAS checks if the sentences in the context that are crucial for answering the question were indeed leveraged.
- Context Recall: Assesses if the retrieved context contained all the information necessary to answer the question. This is about whether your retriever found all the important pieces of information from the entire knowledge base.
The RagasEvaluator in LlamaIndex wraps these core RAGAS metrics. When you call evaluator.evaluate(), it takes your question, the LLM’s answer, and the source nodes (the context) and passes them to the RAGAS framework. RAGAS then uses its own LLM calls to compute scores for each metric.
The real power comes from understanding what each metric tells you about your pipeline’s components:
- Low Faithfulness: Your LLM might be hallucinating or misinterpreting the retrieved context. This points to issues with the LLM’s reasoning capabilities or the quality/clarity of the retrieved documents.
- Low Answer Relevancy: Your LLM isn’t directly answering the question, even if the context is relevant. This could mean your prompt is poorly designed, or the LLM is going off-topic.
- Low Context Precision: Your retriever is pulling in a lot of noise, or your LLM is ignoring useful parts of the context. This often means your retriever is too broad, or your prompt isn’t guiding the LLM to use the context effectively.
- Low Context Recall: Your retriever isn’t finding the necessary information in the first place. This is a direct indictment of your retrieval mechanism (e.g., embedding model, chunking strategy, search algorithm).
A common pitfall is treating RAGAS as a simple pass/fail system. The individual metric scores are far more valuable. For example, an answer might be highly faithful to the context but completely irrelevant to the question. This tells you your retriever found accurate information, but your generation step failed to use it appropriately.
The next step after evaluating your RAG pipeline with RAGAS is often to fine-tune your retrieval strategy based on the specific metric failures.