MLflow Nested Runs: Track Cross-Validation Experiments (2026)

MLflow Nested Runs let you organize hyperparameter tuning by wrapping individual cross-validation folds within their own runs, creating a clear hierarchy.

Let’s see this in action. Imagine you’re tuning a RandomForestClassifier and want to run it with 5-fold cross-validation. Without nested runs, all your fold results would clutter a single parent run’s artifact directory.

Here’s how you’d typically structure it:

import mlflow
import mlflow.sklearn
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

# Sample data
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# Define the model
rf = RandomForestClassifier(n_estimators=100, random_state=42)

# Define the cross-validation strategy
cv_strategy = 5

# Start the main parent run
with mlflow.start_run(run_name="RandomForest Hyperparameter Tuning") as parent_run:
    # Log parent run parameters (e.g., model type, CV folds)
    mlflow.log_param("model_type", "RandomForestClassifier")
    mlflow.log_param("cv_folds", cv_strategy)

    # Perform cross-validation
    # cross_validate returns a dictionary of scores
    scores = cross_validate(rf, X, y, cv=cv_strategy, return_estimator=True)

    # Iterate through each fold and log it as a nested run
    for i, estimator in enumerate(scores["estimator"]):
        # Start a nested run for this fold
        with mlflow.start_run(run_name=f"Fold {i+1}", nested=True) as child_run:
            # Log fold-specific parameters if any (e.g., if you were tuning per fold)
            mlflow.log_param("fold_index", i)

            # Log the trained estimator for this fold
            mlflow.sklearn.log_model(estimator, f"model_fold_{i+1}")

            # Log metrics for this fold
            mlflow.log_metric(f"fold_{i+1}_test_score", scores["test_score"][i])
            mlflow.log_metric(f"fold_{i+1}_fit_time", scores["fit_time"][i])

    # After all folds are logged, you can log aggregate metrics from the parent run
    avg_test_score = sum(scores["test_score"]) / len(scores["test_score"])
    avg_fit_time = sum(scores["fit_time"]) / len(scores["fit_time"])
    mlflow.log_metric("avg_test_score", avg_test_score)
    mlflow.log_metric("avg_fit_time", avg_fit_time)

print(f"Parent Run ID: {parent_run.info.run_id}")

The most surprising thing about nested runs is that they’re not just for organization; they directly impact how MLflow’s UI displays and filters your experiments, making it easier to isolate and analyze subsets of runs.

In the example above, the nested=True argument in mlflow.start_run() is the key. When nested=True, MLflow creates a child run that is a direct descendant of the current active run. This isn’t a flat list; it’s a tree structure. The parent run RandomForest Hyperparameter Tuning now has Fold 1, Fold 2, etc., as its children.

You can view this hierarchy in the MLflow UI. Each fold run will appear indented under the parent run. This is crucial for hyperparameter searches where you might have a parent run for a specific set of hyperparameters, and each nested run represents a fold of the cross-validation for that hyperparameter set.

The cross_validate function from scikit-learn conveniently returns the fitted estimators for each fold. We iterate through these, and for each estimator, we start a new nested run. Inside this nested run, we log the specific model artifact for that fold using mlflow.sklearn.log_model. This way, you can later load and inspect the model trained on any specific fold.

The real power comes when you want to analyze results. In the MLflow UI, you can filter by parent runs or by nested runs. If you’re looking for the best performing fold across all your hyperparameter tuning attempts (which would be structured with multiple parent runs, each containing nested fold runs), you can easily query for the highest fold_X_test_score across all nested runs.

Consider a scenario where you’re performing a grid search. Each combination of hyperparameters would be a parent run. Inside each parent run, you’d have your N nested runs for the N-fold cross-validation. This creates a deeply organized experiment tree: Grid Search Run -> Hyperparameter Set A (Parent Run) -> Fold 1, Fold 2, … Fold N (Nested Runs).

The fact that nested runs are actual, independent runs with their own run_id is significant. This means you can query them programmatically just like any other run. For instance, you could fetch all runs where mlflow.entities.Run.info.parent_id matches your parent run’s run_id. This allows for complex analyses and aggregations outside the UI.

When you log metrics within a nested run, like fold_1_test_score, they are associated with that specific fold’s run. This keeps your metrics granular. You can then, in the parent run, aggregate these metrics (e.g., calculate the average test score across all folds) and log them as parent-level metrics. This provides both detailed fold-level insights and a summary view.

The nested=True flag doesn’t change the fundamental behavior of mlflow.start_run beyond establishing the parent-child relationship. Each nested run still gets its own unique run_id, can log parameters, metrics, and artifacts independently, and can even have its own nested runs (though this is less common in typical CV scenarios and more for complex workflow orchestration).

What most people miss is that you can mix and match. A parent run doesn’t have to contain only nested runs. You can log parameters and metrics directly to the parent run alongside starting nested runs. This is useful for logging overall experiment settings or aggregate results that don’t belong to a specific fold.

Next, you’ll likely explore how to leverage these nested runs for distributed hyperparameter tuning, where each parent run might be executed on a different worker.