MLflow Callbacks: Log Metrics During Training Loops (2026)

MLflow callbacks can actually reduce the amount of data you log during training, not just manage it.

Let’s see this in action. Imagine you’re training a deep learning model with PyTorch and want to log training loss and accuracy every 10 steps and validation metrics every epoch. Without callbacks, you’d be scattering mlflow.log_metric() calls throughout your training script, making it messy and hard to maintain.

Here’s how you’d typically do it with MLflow callbacks:

import mlflow
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# --- Mock Data and Model ---
input_size = 10
hidden_size = 5
num_classes = 2
num_samples = 1000

X_train = torch.randn(num_samples, input_size)
y_train = torch.randint(0, num_classes, (num_samples,))
X_val = torch.randn(num_samples // 5, input_size)
y_val = torch.randint(0, num_classes, (num_samples // 5,))

train_dataset = TensorDataset(X_train, y_train)
val_dataset = TensorDataset(X_val, y_val)

train_loader = DataLoader(train_dataset, batch_size=32)
val_loader = DataLoader(val_dataset, batch_size=32)

model = nn.Sequential(
    nn.Linear(input_size, hidden_size),
    nn.ReLU(),
    nn.Linear(hidden_size, num_classes)
)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# --- MLflow Callback Setup ---
class MyTrainingCallback(mlflow.pytorch.PyTorchCallback):
    def __init__(self, log_interval_steps=10):
        super().__init__()
        self.log_interval_steps = log_interval_steps
        self.global_step = 0

    def on_train_batch_end(self, epoch, batch_idx, outputs, metrics=None):
        # Log training metrics every 'log_interval_steps'
        if self.global_step % self.log_interval_steps == 0:
            loss = outputs['loss'].item()
            # For demonstration, let's assume we can calculate accuracy here too
            # In a real scenario, you might need to compute it separately or pass it
            # For simplicity, we'll just log loss.
            mlflow.log_metric("train_loss", loss, step=self.global_step)
            print(f"Step {self.global_step}: Logged train_loss = {loss:.4f}")
        self.global_step += 1

    def on_validation_epoch_end(self, epoch, metrics=None):
        # Log validation metrics at the end of each epoch
        # In a real scenario, you'd compute validation loss/accuracy here
        # For simplicity, let's log a dummy value
        dummy_val_loss = 1.0 - (epoch / 100.0) # Simulate decreasing validation loss
        mlflow.log_metric("val_loss", dummy_val_loss, step=self.global_step)
        print(f"Epoch {epoch}: Logged val_loss = {dummy_val_loss:.4f}")

# --- Training Loop with Callback ---
mlflow.start_run()
callback = MyTrainingCallback(log_interval_steps=10)
num_epochs = 5

for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    for batch_idx, (data, target) in enumerate(train_loader):
        optimizer.zero_grad()
        outputs = model(data)
        loss = criterion(outputs, target)
        loss.backward()
        optimizer.step()

        # Prepare outputs for the callback
        batch_outputs = {'loss': loss}
        callback.on_train_batch_end(epoch, batch_idx, batch_outputs)

    # Simulate validation loop
    model.eval()
    # In a real scenario, you'd run validation here
    callback.on_validation_epoch_end(epoch)

mlflow.end_run()

This setup encapsulates the logging logic within the MyTrainingCallback class. The on_train_batch_end method handles logging training loss at specified intervals, and on_validation_epoch_end logs validation metrics after each epoch. The global_step is managed internally by the callback, ensuring metrics are logged against the correct training step.

The core problem MLflow callbacks solve is decoupling logging logic from your core training or evaluation code. This makes your training script cleaner, more readable, and easier to refactor. You can swap out MLflow for another logging tool, or change how you log (e.g., only log every 50 steps instead of 10), without significantly altering your model training logic. The callback acts as an intermediary, intercepting events during the training process and performing logging actions.

You can hook into various stages of the training lifecycle: on_train_start, on_train_batch_start, on_train_batch_end, on_train_end, on_validation_batch_start, on_validation_batch_end, on_validation_start, on_validation_end, and on_epoch_end. This granular control allows you to log precisely what you need, when you need it.

A common misconception is that callbacks are only for logging. While logging is a primary use case, they are powerful for much more. For instance, you can implement early stopping by monitoring validation metrics within on_validation_end and raising a mlflow.utils.EarlyStoppingException if a condition is met. You could also use them to adjust learning rates, save model checkpoints based on performance, or even perform data augmentation dynamically.

The most surprising thing most people don’t realize about MLflow callbacks is their ability to directly influence the training process itself. For example, you can define a custom callback that monitors a specific metric. If that metric crosses a certain threshold (e.g., validation loss stops decreasing for 3 epochs), you can raise an mlflow.utils.EarlyStoppingException from within the on_validation_end method of your callback. This exception is caught by the MLflow PyTorch integration, which then gracefully terminates the training loop, saving you compute time and preventing overfitting. This isn’t just about reporting metrics; it’s about acting on them programmatically during training.

Once you have callbacks set up for logging, the next natural step is to explore their use for controlling training flow, like implementing early stopping or learning rate scheduling.