ML Reproducibility: Track Experiments Like a Pro

The most surprising thing about experiment tracking is that it’s not really about tracking experiments at all; it’s about tracking changes and outcomes so you can reconstruct a specific, successful training run.

Let’s watch a quick run. Imagine you’re training a PyTorch model. You’re using MLflow for tracking.

import mlflow
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score

# --- Configuration ---
n_features = 20
n_samples = 1000
learning_rate = 0.01
epochs = 50
batch_size = 32
hidden_units = 64
random_seed = 42

# --- Data Generation ---
X, y = make_classification(n_samples=n_samples, n_features=n_features, random_state=random_seed)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_seed)

# --- Model Definition ---
class SimpleNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out

model = SimpleNN(n_features, hidden_units, 1) # Binary classification
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# --- MLflow Setup ---
mlflow.start_run(run_name="SimpleNN_Training")

# Log parameters
mlflow.log_param("learning_rate", learning_rate)
mlflow.log_param("epochs", epochs)
mlflow.log_param("batch_size", batch_size)
mlflow.log_param("hidden_units", hidden_units)
mlflow.log_param("random_seed", random_seed)
mlflow.log_param("n_features", n_features)
mlflow.log_param("n_samples", n_samples)

# Log the model code (optional but good practice)
mlflow.log_artifact(__file__)

# --- Training Loop ---
X_train_tensor = torch.FloatTensor(X_train)
y_train_tensor = torch.FloatTensor(y_train).unsqueeze(1)
X_test_tensor = torch.FloatTensor(X_test)
y_test_tensor = torch.FloatTensor(y_test).unsqueeze(1)

dataset = torch.utils.data.TensorDataset(X_train_tensor, y_train_tensor)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True)

for epoch in range(epochs):
    for i, (inputs, labels) in enumerate(dataloader):
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        if (i + 1) % 10 == 0:
            print(f'Epoch [{epoch+1}/{epochs}], Step [{i+1}/{len(dataloader)}], Loss: {loss.item():.4f}')
            mlflow.log_metric("train_loss", loss.item(), step=epoch * len(dataloader) + i)

    # Evaluate on test set
    with torch.no_grad():
        test_outputs = model(X_test_tensor)
        test_preds = (torch.sigmoid(test_outputs) > 0.5).float()
        accuracy = accuracy_score(y_test, test_preds.numpy())
        print(f'Epoch [{epoch+1}/{epochs}], Test Accuracy: {accuracy:.4f}')
        mlflow.log_metric("test_accuracy", accuracy, step=epoch)

# Log the final model
mlflow.pytorch.log_model(model, "model")

mlflow.end_run()

When this script runs, MLflow creates a unique run_id. It logs all the parameters we explicitly told it to (learning_rate, epochs, etc.), metrics we logged during training (train_loss, test_accuracy), and the final trained model artifact. It even logs the script itself.

The core problem this solves is "Which exact combination of code, data version, and hyperparameters produced that one amazing result from last Tuesday?" Without tracking, this is a guessing game. You might remember you used learning_rate=0.01 and epochs=50, but did you use the same data preprocessing? Was the random_seed for data splitting the same? Was the exact version of PyTorch or scikit-learn used? Experiment tracking captures all these environmental and configuration details alongside the model and its performance.

The system works by creating a structured log for each distinct execution of your training code. This log, often called a "run," contains:

Parameters: Key-value pairs representing hyperparameters, configuration settings, and fixed inputs. These are static for a given run.
Metrics: Time-series or scalar values representing performance during or after training (loss, accuracy, F1-score). These can be logged at specific steps or epochs.
Artifacts: Files produced by the run. This is the most crucial part for reproducibility: the trained model weights, data preprocessing pipelines, plots, model architectures, and even the source code itself.

To reproduce that run, you’d go to your MLflow UI, find the run with the best test_accuracy, click "Download," and then use the downloaded MLmodel file (which MLflow creates) and the logged model artifact. MLflow’s log_model function often serializes not just the weights but also the model definition and its dependencies, allowing mlflow.<framework>.load_model to reconstruct it. If you also logged your data preprocessing steps or the dataset itself, you can reconstruct that too.

One aspect that’s often overlooked is how critical logging the exact code and library versions is. You can log the script file, but to truly replicate the environment, tools like pip freeze > requirements.txt or Conda’s environment files should be logged as artifacts. Then, before loading the model, you’d recreate that exact environment. Some advanced systems even integrate with containerization tools like Docker to save a snapshot of the entire execution environment.

The next hurdle is managing and comparing many such runs to select the best one systematically, often leading into hyperparameter optimization frameworks.