MLflow’s search_runs function is a powerful tool for programmatically querying your logged ML experiments, but its real magic lies in how it allows you to filter and sort based on arbitrary logged parameters, metrics, and tags, effectively turning your experiment logs into a structured database.
Let’s see it in action. Imagine you’ve run a series of hyperparameter-tuned models for a classification task and logged parameters like learning_rate, batch_size, and metrics like accuracy and f1_score.
import mlflow
from mlflow.entities import ViewType
# Set your MLflow tracking URI if it's not the default (e.g., ./mlruns)
# mlflow.set_tracking_uri("http://localhost:5000")
# Define the experiment name or ID you want to search within
experiment_name = "my_classification_experiment"
experiment = mlflow.get_experiment_by_name(experiment_name)
# Construct a search query. This is a SQL-like syntax.
# We want runs where 'accuracy' is greater than 0.9 and 'learning_rate' is less than 0.01
search_query = "metrics.accuracy > 0.9 AND params.learning_rate < 0.01"
# Perform the search
runs = mlflow.search_runs(
experiment_ids=[experiment.experiment_id],
filter_string=search_query,
order_by=["metrics.f1_score DESC"], # Sort by f1_score in descending order
max_results=10, # Get the top 10 results
run_view_type=ViewType.ACTIVE # Only search active runs
)
# Print the results
if not runs.empty:
print(f"Found {len(runs)} runs matching the criteria:")
for index, row in runs.iterrows():
print(f" Run ID: {row['run_id']}")
print(f" Artifact URI: {row['artifact_uri']}")
print(f" Status: {row['status']}")
print(f" Params: learning_rate={row.params.learning_rate}, batch_size={row.params.batch_size}")
print(f" Metrics: accuracy={row.metrics.accuracy:.4f}, f1_score={row.metrics.f1_score:.4f}")
print("-" * 20)
else:
print("No runs found matching the criteria.")
This code snippet demonstrates how to fetch runs that meet specific performance criteria (metrics.accuracy > 0.9) and hyperparameter configurations (params.learning_rate < 0.01), and then sort them by another metric (metrics.f1_score DESC). The run_view_type parameter is useful for filtering out deleted runs.
The core problem search_runs solves is making your logged experiment data accessible and actionable after the runs have completed. Instead of manually sifting through the MLflow UI or writing custom parsing scripts for log files, you can treat your experiment history as a queryable dataset. This is crucial for tasks like:
- Reproducibility: Finding the exact parameters and code version that produced a successful model.
- Analysis: Identifying trends in performance based on hyperparameters or data splits.
- Model Selection: Automatically picking the best performing model based on predefined criteria.
- Retraining: Finding similar successful runs to inform the parameters for a new training job.
Internally, MLflow’s search_runs translates your filter_string and order_by clauses into queries against its backend store (which could be a local SQLite file, a PostgreSQL database, or a managed service). The params, metrics, and tags are indexed in this backend, allowing for efficient filtering. The artifact_uri is a pointer to where the model artifacts and any other logged files are stored. The run_id is the unique identifier for each execution.
You have direct control over the filter_string, which uses a SQL-like syntax. You can combine conditions using AND and OR. Supported operations include =, !=, >, <, >=, <=, LIKE, ILIKE (case-insensitive LIKE), IN, NOT IN. You can query params.<param_name>, metrics.<metric_name>, and tags.<tag_key>. The order_by clause works similarly, allowing you to specify which field to sort by and in which direction (ASC or DESC).
When you query params or metrics, MLflow expects those keys to exist for every run you’re filtering on. If a run is missing a parameter or metric you’re filtering by, it simply won’t be included in the results, which is usually the desired behavior. However, it’s worth noting that search_runs can be surprisingly lenient with data types in comparisons if the backend permits it (e.g., comparing a string that looks like a number). The most robust way to ensure consistent filtering is to log your parameters and metrics with consistent, expected types.
The max_results parameter is your friend when dealing with large experiment histories, preventing you from overwhelming your system or your memory. It’s also good practice to use run_view_type=ViewType.ACTIVE unless you specifically need to find archived or deleted runs, as this reduces the search space and can improve performance.
The search_runs function returns a Pandas DataFrame, which is incredibly convenient for further programmatic manipulation and analysis. You can easily access specific columns (like params.learning_rate, metrics.accuracy) as Series, or iterate through rows to extract information for specific runs.
One subtle but powerful aspect of search_runs is its ability to filter on tags. Tags are often used for arbitrary metadata that doesn’t fit neatly into parameters or metrics, like git_commit, data_version, model_type, or user_id. You can query these just as easily: tags.git_commit = 'abcdef12345' or tags.model_type IN ('resnet', 'vgg'). This makes it exceptionally useful for tracing models back to their exact development context or for segmenting results by deployment strategy.
Once you’ve mastered search_runs, the next logical step is to integrate this programmatic querying into automated model deployment pipelines, where you might automatically select the best performing model from a batch of recent runs based on production-ready metrics.