The most surprising thing about LLM evaluation is that "quality" isn’t a single, objective metric, but rather a constellation of subjective, context-dependent user needs that we’re trying to approximate with automated signals.
Let’s see this in action. Imagine we have a simple LLM application that summarizes news articles.
from transformers import pipeline
# Load a summarization model
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
article = """
The latest quarterly earnings report from TechCorp Inc. showed a significant increase in revenue,
driven by strong performance in their cloud computing division. Net income rose by 15% year-over-year,
exceeding analyst expectations. The company announced plans to invest heavily in AI research and development
in the coming fiscal year, with a focus on expanding their generative AI capabilities.
"""
summary = summarizer(article, max_length=50, min_length=20, do_sample=False)[0]['summary_text']
print(summary)
Output:
TechCorp Inc.'s latest quarterly earnings report showed a significant increase in revenue, driven by strong performance in their cloud computing division. Net income rose by 15% year-over-year, exceeding analyst expectations.
This looks good, but how do we know it’s good? And how do we track if it gets better or worse over time as we fine-tune our model or change prompts? This is where automated MLOps evaluation comes in.
The core problem we’re solving is bridging the gap between human judgment of LLM output quality and scalable, repeatable automated checks. Humans can read a summary and say, "This is concise, captures the main points, and is factually accurate." Automating this requires breaking down these human judgments into measurable signals.
At a high level, LLM evaluation in MLOps involves these components:
-
Golden Datasets (Test Sets): A curated set of prompts and their ideal or acceptable outputs. These are the ground truth against which we compare our LLM’s generations. For our summarization example, this would be articles paired with human-written, high-quality summaries.
-
Evaluation Metrics: These are the specific, quantifiable measures we use to score the LLM’s output against the golden dataset. They fall into several categories:
- Lexical Overlap: Metrics like ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and BLEU (Bilingual Evaluation Understudy) measure the overlap of n-grams between the generated text and the reference text. ROUGE-L, for example, focuses on the longest common subsequence.
- Semantic Similarity: Metrics like BERTScore or sentence embeddings (cosine similarity) capture meaning rather than exact word matches. These are crucial because a summary can be good even if it uses different words than the reference.
- Task-Specific Metrics: For summarization, we might also look at factuality (is the summary true to the source?), conciseness (is it short enough?), and relevance (does it cover the most important points?). For question answering, we’d check accuracy. For code generation, we’d check if the code compiles and passes unit tests.
- LLM-as-a-Judge: Increasingly, we use a powerful LLM itself to evaluate the output of another LLM. This can be framed as a prompt: "Given the following article and summary, rate the summary on a scale of 1-5 for conciseness and accuracy. Provide your reasoning."
-
Evaluation Framework/Pipeline: This is the code that orchestrates the process:
- Takes a prompt (or a batch of prompts).
- Sends it to the LLM to get a generation.
- Compares the generation against the golden dataset’s reference output using the chosen metrics.
- Logs the scores and the generation for analysis.
Here’s a simplified Python snippet using rouge-score and sentence-transformers for our summarization example.
from rouge_score import rouge_scorer
from sentence_transformers import SentenceTransformer, util
import torch
# Assume 'golden_summaries' is a list of human-written summaries for our articles
golden_summaries = [
"TechCorp's Q3 earnings report showed revenue growth, especially in cloud computing, beating analyst forecasts. They plan AI R&D investment.",
# ... other golden summaries
]
generated_summaries = [summary] # Our LLM's output
# ROUGE-L
scorer_rouge = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
rouge_scores = [scorer_rouge.score(gold, gen)['rougeL'].fmeasure for gold, gen in zip(golden_summaries, generated_summaries)]
# BERTScore (using Sentence Transformers for simplicity here, though a dedicated BERTScore implementation is better)
# In a real scenario, you'd use a library like 'bert_score'
model_st = SentenceTransformer('all-MiniLM-L6-v2')
# Convert to embeddings
embeddings_gold = model_st.encode(golden_summaries, convert_to_tensor=True)
embeddings_gen = model_st.encode(generated_summaries, convert_to_tensor=True)
# Calculate cosine similarity
cosine_scores = [util.cos_sim(embeddings_gold[i], embeddings_gen[i]).item() for i in range(len(golden_summaries))]
print(f"ROUGE-L F1 Scores: {rouge_scores}")
print(f"Cosine Similarity Scores: {cosine_scores}")
This setup allows us to track metrics like ROUGE-L F1 (e.g., [0.78]) and Cosine Similarity (e.g., [0.85]) over time. If our ROUGE score drops from 0.78 to 0.65 after a model update, we know something is wrong.
The one thing most people don’t realize is that the choice of metric is a feature engineering step for your LLM evaluation. If you only use ROUGE, you’re implicitly defining "good" as "word overlap with the reference." If your LLM starts generating semantically equivalent but paraphrased summaries, ROUGE will penalize it. You need a suite of metrics that capture different facets of quality, and you might even weight them based on your specific application’s priorities. For example, for a medical summarizer, factuality might be weighted much higher than conciseness.
The next problem you’ll run into is how to handle the cost and latency of running these evaluations, especially when using LLM-as-a-Judge or complex semantic similarity models on large datasets.