MLflow Model Evaluation: Compare Metrics Across Runs (2026)

MLflow’s metric comparison feature is so powerful because it lets you see the absolute difference in performance between two models, not just their individual scores.

Let’s say you’ve trained a few models for a classification task and logged their metrics using MLflow. You want to see which one is truly better.

Here’s a snapshot of what that might look like in the MLflow UI. You’d navigate to the "Experiments" view, select the runs you want to compare, and then click "Compare".

| Metric       | Run 1 (UUID: abcdef123) | Run 2 (UUID: fedcba987) | Difference |
|--------------|-------------------------|-------------------------|------------|
| `accuracy`   | 0.85                    | 0.92                    | +0.07      |
| `precision`  | 0.88                    | 0.91                    | +0.03      |
| `recall`     | 0.82                    | 0.93                    | +0.11      |
| `f1_score`   | 0.85                    | 0.92                    | +0.07      |

This table shows the raw metrics for each run, but the "Difference" column is where the magic happens. It highlights how much better Run 2 is than Run 1 for each metric. For example, Run 2 has an accuracy 0.07 higher than Run 1. This immediate quantitative comparison is crucial for making informed decisions about which model to promote.

The underlying mechanism is simple: MLflow’s UI fetches the logged metrics for the selected runs and performs element-wise subtraction. You can log metrics using the mlflow.log_metric(key, value) function within your training script.

import mlflow
import mlflow.sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

# Load data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

# --- Run 1 ---
with mlflow.start_run(run_name="Logistic Regression - Run 1") as run1:
    lr = LogisticRegression(max_iter=200)
    lr.fit(X_train, y_train)
    accuracy = lr.score(X_test, y_test)
    precision = mlflow.sklearn.eval_and_log_metrics(lr, X_test, y_test, prefix="eval_")["eval_precision"]
    recall = mlflow.sklearn.eval_and_log_metrics(lr, X_test, y_test, prefix="eval_")["eval_recall"]
    f1 = mlflow.sklearn.eval_and_log_metrics(lr, X_test, y_test, prefix="eval_")["eval_f1_score"]

    mlflow.log_metric("accuracy", accuracy)
    mlflow.log_metric("precision", precision)
    mlflow.log_metric("recall", recall)
    mlflow.log_metric("f1_score", f1)
    print(f"Run 1 logged metrics: Accuracy={accuracy:.2f}, Precision={precision:.2f}, Recall={recall:.2f}, F1={f1:.2f}")


# --- Run 2 (with slightly different parameters or data) ---
with mlflow.start_run(run_name="Logistic Regression - Run 2") as run2:
    lr_tuned = LogisticRegression(max_iter=200, C=0.5) # Example: tuned hyperparameter
    lr_tuned.fit(X_train, y_train)
    accuracy_tuned = lr_tuned.score(X_test, y_test)
    precision_tuned = mlflow.sklearn.eval_and_log_metrics(lr_tuned, X_test, y_test, prefix="eval_")["eval_precision"]
    recall_tuned = mlflow.sklearn.eval_and_log_metrics(lr_tuned, X_test, y_test, prefix="eval_")["eval_recall"]
    f1_tuned = mlflow.sklearn.eval_and_log_metrics(lr_tuned, X_test, y_test, prefix="eval_")["eval_f1_score"]

    mlflow.log_metric("accuracy", accuracy_tuned)
    mlflow.log_metric("precision", precision_tuned)
    mlflow.log_metric("recall", recall_tuned)
    mlflow.log_metric("f1_score", f1_tuned)
    print(f"Run 2 logged metrics: Accuracy={accuracy_tuned:.2f}, Precision={precision_tuned:.2f}, Recall={recall_tuned:.2f}, F1={f1_tuned:.2f}")

The "Compare" view is not just for metrics. You can also compare parameters and tags, which is essential for understanding why one run performed better than another. Did a change in learning_rate or batch_size lead to the improvement? The comparison table will show you.

A common pitfall is forgetting to log all relevant metrics for every run. If one run logs accuracy and another logs accuracy and precision, the comparison view might show incomplete data or errors. Ensure consistency in your logging.

The difference calculation is applied to numerical metrics. For categorical metrics or parameters, MLflow will simply show the values side-by-side. The "Difference" column is only populated for numerical metric comparisons.

When you are comparing runs, MLflow allows you to select runs directly from the Experiments page. You can check the boxes next to the runs you’re interested in and then click the "Compare" button that appears at the top. This dynamically generates the comparison table based on your selections.

If you’re using MLflow’s Python API, you can programmatically retrieve runs and their metrics to build custom comparison reports. The mlflow.search_runs() function is your friend here, allowing you to filter and sort runs based on various criteria. You can then iterate through the results, extract metrics, and perform your own comparisons.

Consider a scenario where you’re comparing models trained on different datasets or subsets. The "Compare" feature still works, but it’s crucial to interpret the metric differences in the context of the data used. A higher accuracy on a harder subset might be more significant than a similar absolute gain on an easier one.

This comparative analysis is fundamental to the MLOps lifecycle, enabling rapid iteration and informed decision-making about model deployment. It shifts the focus from just knowing a score to understanding improvement and relative performance.

The next level of comparison involves visualizing metric trends over time or across different hyperparameter sweeps, which MLflow’s plotting features facilitate.