MLflow LLM Fine-Tuning: Track Runs and Parameters (2026)

Fine-tuning an LLM isn’t just about feeding it more data; it’s a high-stakes gamble where your entire training run could be lost to a forgotten parameter or a crashed process.

MLflow’s LLM fine-tuning capabilities are designed to prevent exactly that by treating your training jobs not as ephemeral scripts, but as first-class, traceable entities. Think of it as a super-powered Git for your machine learning experiments, but instead of code, you’re tracking models, data, and the hyperparameters that brought them into existence.

Let’s see this in action. Imagine you’re fine-tuning a Llama 2 model on your company’s internal documentation to build a more accurate internal chatbot.

First, you’ll need to install MLflow and the necessary LLM libraries:

pip install mlflow transformers datasets accelerate bitsandbytes peft

Now, you’ll wrap your fine-tuning script with MLflow. Here’s a simplified example using the transformers library and PEFT (Parameter-Efficient Fine-Tuning) for efficiency.

import mlflow
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model

# --- MLflow Configuration ---
# Set an experiment name. If it doesn't exist, MLflow will create it.
mlflow.set_experiment("LLM-FineTuning-Llama2")

# Define your hyperparameters
model_id = "meta-llama/Llama-2-7b-hf"
dataset_path = "your_internal_data.jsonl" # Replace with your actual dataset path
learning_rate = 2e-5
per_device_train_batch_size = 4
num_train_epochs = 3
lora_r = 16
lora_alpha = 32
lora_dropout = 0.05
output_dir = "./llama2-finetuned"

# --- Start MLflow Run ---
with mlflow.start_run():
    # Log hyperparameters
    mlflow.log_param("model_id", model_id)
    mlflow.log_param("dataset_path", dataset_path)
    mlflow.log_param("learning_rate", learning_rate)
    mlflow.log_param("per_device_train_batch_size", per_device_train_batch_size)
    mlflow.log_param("num_train_epochs", num_train_epochs)
    mlflow.log_param("lora_r", lora_r)
    mlflow.log_param("lora_alpha", lora_alpha)
    mlflow.log_param("lora_dropout", lora_dropout)
    mlflow.log_param("output_dir", output_dir)

    # --- Model and Tokenizer Loading ---
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    # Add a padding token if it doesn't exist (common for Llama)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        load_in_8bit=True, # Example: Use 8-bit quantization for memory efficiency
        device_map="auto",
        torch_dtype=torch.float16 # Use float16 for faster training
    )

    # --- PEFT Configuration ---
    peft_config = LoraConfig(
        r=lora_r,
        lora_alpha=lora_alpha,
        lora_dropout=lora_dropout,
        bias="none",
        task_type="CAUSAL_LM",
        # Specify which modules to apply LoRA to. This is crucial for Llama.
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
    )
    model = get_peft_model(model, peft_config)
    model.print_trainable_parameters() # This will show how few parameters are actually trained

    # --- Dataset Loading and Preprocessing ---
    dataset = load_dataset("json", data_files=dataset_path)

    def tokenize_function(examples):
        # Assuming your dataset has 'text' field. Adjust if needed.
        # You might need more complex tokenization and formatting for specific tasks.
        return tokenizer(examples["text"], truncation=True, max_length=512, padding="max_length")

    tokenized_dataset = dataset.map(tokenize_function, batched=True)

    # --- Training Arguments ---
    training_args = TrainingArguments(
        output_dir=output_dir,
        per_device_train_batch_size=per_device_train_batch_size,
        learning_rate=learning_rate,
        num_train_epochs=num_train_epochs,
        logging_dir=f"{output_dir}/logs",
        logging_steps=10,
        save_steps=500, # Save checkpoints periodically
        evaluation_strategy="no", # Or "steps" if you have a validation set
        # Add more arguments as needed for your specific training setup
    )

    # --- Trainer Initialization ---
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_dataset["train"], # Adjust if your dataset has multiple splits
        tokenizer=tokenizer,
    )

    # --- Start Training ---
    print("Starting training...")
    trainer.train()
    print("Training finished.")

    # --- Save Model and Log Artifacts ---
    model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)

    # Log the model artifact. MLflow will track it.
    mlflow.log_artifacts(output_dir, artifact_path="model")
    print(f"Model saved and logged to MLflow artifact path: {output_dir}")

When you run this script, MLflow automatically captures:

Parameters: Every mlflow.log_param call records a key-value pair. This includes your learning_rate, num_train_epochs, lora_r, and even the model_id.
Metrics: While not explicitly logged in this minimal example, you’d typically use mlflow.log_metric("loss", trainer.state.log_history[-1]['loss']) within your training loop or after training to track performance metrics like loss, accuracy, or perplexity.
Artifacts: The mlflow.log_artifacts(output_dir, artifact_path="model") line saves the entire fine-tuned model directory (weights, config, tokenizer files) as a traceable artifact associated with this specific run.

After running, you can launch the MLflow UI (mlflow ui) in your terminal from the directory where your mlruns folder is created. You’ll see a table of your experiments, and within each experiment, a list of runs. Clicking on a run shows you all logged parameters, metrics, and allows you to download the saved artifacts.

The core problem MLflow solves here is reproducibility and comparison. Without it, you’d have to manually track which combination of learning_rate=2e-5, lora_r=16, and num_train_epochs=3 produced that surprisingly good chatbot. With MLflow, that run is a distinct entry, forever linked to its exact configuration.

MLflow’s LLM capabilities extend beyond basic logging. You can integrate with Hugging Face’s Trainer or PyTorch Lightning to automatically log model checkpoints, evaluation metrics, and even the training dataset itself if it’s small enough. The mlflow.autolog() function can also automatically capture many parameters and metrics without explicit log_param calls, though for fine-grained control, manual logging is often preferred.

For example, imagine you want to compare the performance of LoRA with different lora_r values. You’d simply loop through your desired r values, starting a new mlflow.start_run() for each, logging the specific r value, training the model, and logging the artifacts. The MLflow UI then becomes your dashboard to visually compare these runs side-by-side, seeing which lora_r yielded the best validation loss (if you were logging it).

The true power emerges when you start logging more than just the final model. By logging intermediate checkpoints (trainer.save_model() and then mlflow.log_artifacts()) within your training loop, you can reconstruct any point in the training process. This is invaluable for debugging or for resuming training from a specific epoch if a run was interrupted.

One aspect that often surprises people is how MLflow handles large artifacts like LLM weights. When you use mlflow.log_artifacts, MLflow doesn’t necessarily copy the entire model into its backend storage immediately. Depending on your MLflow tracking server configuration (e.g., local file system, S3, Azure Blob Storage), it will store references or upload the files. For local development, it’s a simple directory copy. For remote servers, it’s optimized for cloud storage. The key is that MLflow manages the storage and retrieval, so your experiment is always linked to its precise data, regardless of where it physically resides.

The next step after mastering run tracking is understanding MLflow’s model registry, where you can version and stage your fine-tuned models for deployment.