MLflow LLM Evaluation: Automated Judging and Metrics (2026)

MLflow’s LLM evaluation features turn your LLM into a judge, capable of scoring its own outputs against predefined criteria.

Let’s see it in action. Imagine we have a simple LLM application that summarizes news articles. We want to evaluate how well it’s doing.

from mlflow.models import infer_signature
from mlflow.types.schema import Schema, ColSpec
from mlflow.utils.environment import infer_conda_env
import mlflow
import pandas as pd

# Assume you have a function that generates summaries
def generate_summary(article_text):
    # This is a placeholder for your actual LLM call
    # In a real scenario, this would call an LLM API or model
    return f"Summary of: {article_text[:50]}..."

# Sample data
data = {
    "article": [
        "The latest advancements in quantum computing promise to revolutionize data processing.",
        "Scientists have discovered a new exoplanet with potential for liquid water.",
        "The stock market experienced a significant downturn today due to inflation fears."
    ],
    "expected_summary": [
        "Quantum computing advancements will transform data processing.",
        "New exoplanet found, possibly with water.",
        "Stock market drops amid inflation concerns."
    ]
}
df = pd.DataFrame(data)

# Infer the model signature
input_schema = Schema([ColSpec("article", "string")])
output_schema = Schema([ColSpec("summary", "string")])
signature = infer_signature(df[["article"]], df[["summary"]], input_schema=input_schema, output_schema=output_schema)

# Infer the conda environment
conda_env = infer_conda_env(
    python="3.9.12",
    pip_requirements=["mlflow", "pandas", "transformers"], # Add necessary libraries
    conda_channels=["conda-forge"]
)

# Log the "model" (in this case, our simple function)
with mlflow.start_run() as run:
    mlflow.pyfunc.log_model(
        "summarizer_model",
        python_model=mlflow.pyfunc.PythonModel.wrap_with_func(generate_summary),
        signature=signature,
        conda_env=conda_env,
        input_example=df[["article"]].head(1)
    )
    print(f"Model logged with run ID: {run.info.run_id}")

# Now, let's set up evaluation
from mlflow.models.evaluation import EvaluationResult, log_evaluation
from mlflow.models.evaluation.base import ModelEvaluator

class CustomSummaryEvaluator(ModelEvaluator):
    def __init__(self, expected_summaries):
        self.expected_summaries = expected_summaries

    def evaluate(self, model, data, params=None):
        predictions = model.predict(data["article"])
        results = []
        for i, pred in enumerate(predictions):
            results.append({
                "article": data["article"][i],
                "predicted_summary": pred,
                "expected_summary": self.expected_summaries[i]
            })
        return EvaluationResult(
            predictions=pd.DataFrame(results),
            metrics={"custom_metric": 0.5} # Placeholder metric
        )

# Assuming you have the run_id from the previous step
run_id_to_evaluate = run.info.run_id # Replace with your actual run ID

# Load the logged model
logged_model_uri = f"runs:/{run_id_to_evaluate}/summarizer_model"
loaded_model = mlflow.pyfunc.load_model(logged_model_uri)

# Prepare data for evaluation
eval_data = df.copy()
eval_data["summary"] = eval_data["article"].apply(lambda x: generate_summary(x)) # This would be your model's actual output

# Instantiate and run the evaluator
evaluator = CustomSummaryEvaluator(eval_data["expected_summary"].tolist())
evaluation = evaluator.evaluate(loaded_model, eval_data["article"].tolist())

# Log the evaluation results
log_evaluation(
    model_uri=logged_model_uri,
    evaluation=evaluation,
    model_config={"custom_param": "value"}
)

print("Evaluation logged successfully!")

MLflow’s LLM evaluation isn’t just about spitting out numbers; it’s about creating a framework to systematically assess the quality of your language model’s outputs. The core idea is to define what "good" looks like and then automate the process of checking if your model meets those standards. This goes beyond simple accuracy or F1 scores, allowing you to evaluate subjective qualities like coherence, relevance, and adherence to specific instructions.

The system works by allowing you to define custom ModelEvaluator classes. These evaluators take your model and a dataset (often with ground truth or desired outputs) and produce a set of predictions along with computed metrics. MLflow then logs these results, making them visible and comparable within the MLflow UI. You can log various types of metrics, including standard ones like ROUGE or BLEU, and also custom metrics that you define based on your specific use case.

The key levers you control are the ModelEvaluator implementation and the data you use for evaluation. For the ModelEvaluator, you’re essentially writing the logic for how to score your model’s output. This could involve comparing generated text against a reference, using another LLM to judge the output, or applying a set of predefined rules. The data is crucial; it needs to be representative of the real-world scenarios your model will encounter and should include the necessary ground truth or criteria for evaluation.

One of the most powerful aspects is the ability to use LLMs themselves as evaluators. MLflow provides built-in support for using LLMs to score outputs based on prompts you define. For example, you can prompt an LLM like GPT-4 with instructions like: "Given the following article and its summary, rate the summary on a scale of 1 to 5 for factual accuracy and coherence. Provide a brief explanation for your rating." MLflow then orchestrates this process, sending the data and prompts to the LLM, collecting the scores, and aggregating them into metrics. This allows for nuanced, human-like evaluation at scale, even for qualitative aspects of text generation that are hard to capture with traditional metrics.

The next step is to explore integrating with various LLM providers and understanding how to define complex, multi-faceted evaluation prompts.