MLflow + HuggingFace: Log Transformers Models and Metrics (2026)

Hugging Face Transformers models are state-of-the-art, but logging them effectively in MLflow can be surprisingly tricky because MLflow’s default model saving mechanism doesn’t inherently understand the complex object structure of a Hugging Face PreTrainedModel or Trainer state.

Let’s see MLflow in action with a Hugging Face Trainer.

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import mlflow

# Load a small dataset and tokenizer
dataset = load_dataset("glue", "sst2", split="train[:100]")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(examples):
    return tokenizer(examples["sentence"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Define model and training arguments
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=8,
    logging_dir="./logs",
    report_to="mlflow", # This is key for automatic logging
    run_name="hf_sst2_training"
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets,
)

# Start an MLflow run and train
with mlflow.start_run(run_name="hf_sst2_training_example") as run:
    print(f"MLflow Run ID: {run.info.run_id}")
    trainer.train()

    # Log the model explicitly for demonstration, though trainer.train() with report_to="mlflow"
    # often logs a basic artifact. This ensures we log it as a Hugging Face model.
    mlflow.transformers.log_model(
        transformers_model=trainer.model,
        artifact_path="huggingface_model",
        tokenizer=tokenizer,
        task="text-classification"
    )

    # Log custom metrics
    mlflow.log_metric("accuracy", 0.85) # Example metric
    mlflow.log_metric("f1_score", 0.82)

print("Training complete. Check your MLflow UI.")

This code snippet first loads a dataset and tokenizer, then sets up a transformers model and TrainingArguments. Crucially, training_args.report_to="mlflow" tells the Trainer to integrate with MLflow. When trainer.train() is called within an mlflow.start_run() context, MLflow automatically logs various training metrics (loss, learning rate, etc.) and, importantly, can log the model itself. We also explicitly show mlflow.transformers.log_model for more control, ensuring the model is logged with its associated tokenizer and task type, making it loadable directly by Hugging Face libraries later.

The core problem MLflow solves here is managing the lifecycle of your machine learning experiments, especially when dealing with complex models like those from Hugging Face. It provides a centralized place to track:

Code Versions: What code was run for each experiment.
Parameters: Hyperparameters used (e.g., learning rate, batch size).
Metrics: Performance indicators (accuracy, loss, F1-score).
Artifacts: The trained model files, tokenizers, and any other generated outputs.

This allows you to reproduce runs, compare different experiments, and deploy models with confidence. The mlflow.transformers flavor is specifically designed to handle the serialization and deserialization of Hugging Face models, ensuring that when you load a logged model, you get a fully functional PreTrainedModel object ready for inference or further fine-tuning.

The Trainer’s report_to="mlflow" argument automatically hooks into the MLflow logging infrastructure. It captures training and evaluation metrics logged by the Trainer and sends them to the active MLflow run. This includes things like loss, learning_rate, and metrics calculated during evaluation steps. When you save the model using mlflow.transformers.log_model, it creates a special directory structure within your MLflow artifacts that MLflow recognizes. This structure typically includes config.json, pytorch_model.bin (or tf_model.h5), tokenizer_config.json, and vocab.txt (or spiece.model), along with any other necessary files.

The mlflow.transformers.log_model function is the workhorse for saving Hugging Face models. It takes the transformers_model object (your fine-tuned model), a tokenizer object, and optionally a task string (like "text-classification", "token-classification", "question-answering"). When you call this, MLflow serializes these components and saves them in a format that the mlflow.<flavor>.load_model function (in this case, mlflow.transformers.load_model) knows how to reconstruct. This is crucial because simply pickling the model might not capture all the necessary configuration and tokenizer states needed for perfect reproducibility or downstream use.

When you log a model using mlflow.transformers.log_model, MLflow saves a MLmodel file in the artifact directory. This file is a small YAML configuration that tells MLflow how to load the model. It specifies the artifact_path to the saved model files and the flavor (which is transformers). It also includes the task and model_kwargs (arguments passed to the model constructor during loading), and tokenizer_kwargs. This metadata is what allows mlflow.transformers.load_model to correctly instantiate the model and tokenizer later.

The most surprising thing about MLflow’s integration with Hugging Face is how seamlessly it handles the specialized saving and loading requirements of these complex architectures, including their associated tokenizers and configurations, without requiring manual serialization of every single file.

If you’re not seeing metrics logged automatically when using Trainer, double-check that training_args.report_to is indeed set to "mlflow" and that you are running your script within an active mlflow.start_run() context. The Trainer will only log to MLflow if it detects an active run.

Next, you’ll want to explore how to use these logged models for inference with mlflow.transformers.load_model.