MLflow’s run comparison feature is less about comparing the results of your ML experiments and more about comparing the recipes that produced those results.
Let’s say you’ve trained a few versions of a machine learning model. You’ve logged parameters, metrics, and artifacts for each training run using MLflow. Now you want to see which set of parameters led to the best performance. You head to the MLflow UI, select a few runs, and hit "Compare."
Here’s what you’re actually seeing:
{
"runs": [
{
"run_uuid": "a1b2c3d4e5f678901234567890abcdef",
"info": {
"experiment_id": "1234567890",
"run_name": "experiment_v1",
"start_time": 1678886400000,
"end_time": 1678886500000,
"lifecycle_stage": "active"
},
"data": {
"params": {
"learning_rate": "0.01",
"num_epochs": "100"
},
"metrics": {
"accuracy": "0.85",
"precision": "0.88"
},
"tags": {
"model_type": "random_forest"
}
},
"inputs": {
"dataset": [
{
"destination_path": "data/raw",
"path": "s3://my-bucket/data/raw.csv"
}
]
},
"artifacts": {
"model_path": "model/model.pkl"
}
},
{
"run_uuid": "f0e9d8c7b6a543210fedcba987654321",
"info": {
"experiment_id": "1234567890",
"run_name": "experiment_v2",
"start_time": 1678887000000,
"end_time": 1678887100000,
"lifecycle_stage": "active"
},
"data": {
"params": {
"learning_rate": "0.005",
"num_epochs": "120"
},
"metrics": {
"accuracy": "0.87",
"precision": "0.90"
},
"tags": {
"model_type": "random_forest"
}
},
"inputs": {
"dataset": [
{
"destination_path": "data/raw",
"path": "s3://my-bucket/data/raw.csv"
}
]
},
"artifacts": {
"model_path": "model/model_v2.pkl"
}
}
]
}
This JSON represents two runs. When you compare them in the UI, you’re essentially getting a side-by-side view of this data structure. MLflow aggregates all logged parameters, metrics, and tags from the selected runs. It then presents them in a tabular format, highlighting differences.
The primary problem MLflow’s run comparison solves is experiment traceability and reproducibility. Imagine you found a great model, but you can’t remember the exact learning_rate or batch_size that produced it. Or, you want to share your work, and you need to provide the precise configuration that led to your reported results. The comparison view directly addresses this by showing you the exact inputs (parameters, datasets) and outputs (metrics, artifacts) for each run.
Internally, MLflow’s backend store (whether it’s a file, a database, or a managed service) holds this information. When you select runs for comparison, the MLflow UI queries this backend for the run_uuids you’ve chosen. It then fetches all associated params, metrics, tags, and artifacts metadata for each run. The UI then renders this as a diffable table.
The key levers you control are what you log during your mlflow.start_run() context.
- Parameters (
mlflow.log_param): These are your hyperparameters, model configuration settings, or any input variables that define a specific training experiment. For example,mlflow.log_param("learning_rate", 0.01). - Metrics (
mlflow.log_metric): These are the performance indicators you track over time or at the end of training. They can be logged with or without timestamps. For example,mlflow.log_metric("accuracy", 0.85). - Tags (
mlflow.set_tag): These are arbitrary key-value pairs for organizing and identifying runs, likemlflow.set_tag("model_type", "resnet50")ormlflow.set_tag("data_version", "v1.2"). - Artifacts (
mlflow.log_artifact): These are any files or directories produced by your run, such as trained models, plots, or data files. For example,mlflow.log_artifact("model.pkl").
The comparison view is powerful because it collapses the noise. Instead of digging through individual run pages, you get a consolidated view. You can easily spot which parameter changes correlated with metric improvements. You can also see if a specific tag was associated with a better outcome.
The most surprising thing about MLflow’s run comparison is how it defaults to showing all logged parameters and metrics, even those with no variation across the selected runs. This can be overwhelming when you have many runs with dozens of parameters. However, the UI provides filtering and search capabilities to narrow this down. More importantly, the "Compare" view is not just a static table; you can select specific columns to display and sort by metrics, which is where the real analysis happens. You can also click on a run’s name to jump directly to its detailed page.
This leads to the next step: using the comparison to actually select a model for deployment, which involves understanding MLflow’s model registry.