MLflow Data Versioning: Log and Track Training Datasets (2026)

MLflow’s data versioning capability doesn’t just store your datasets; it creates a direct, auditable link between your model’s training data and its resulting artifact.

Let’s see this in action. Imagine you’re training a model to classify customer reviews.

First, you need to log your dataset. MLflow makes this straightforward:

import mlflow
import pandas as pd
from sklearn.model_selection import train_test_split

# Load your dataset
data = pd.read_csv("customer_reviews.csv")
labels = data['sentiment']
features = data.drop('sentiment', axis=1)

# Split data
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)

# Log the training dataset
mlflow.log_param("data_path", "customer_reviews.csv") # Optional: log original source
mlflow.log_input(mlflow.data.from_pandas(X_train, targets=y_train), context="training")
mlflow.log_input(mlflow.data.from_pandas(X_test, targets=y_test), context="validation")

# Now, start your MLflow run and train your model
with mlflow.start_run() as run:
    # ... model training code ...
    # Example: Train a simple Logistic Regression
    from sklearn.linear_model import LogisticRegression
    model = LogisticRegression()
    model.fit(X_train, y_train)

    # Log the model artifact
    mlflow.sklearn.log_model(model, "customer_sentiment_model")

    # You can now see the logged datasets associated with this run
    print(f"MLflow Run ID: {run.info.run_id}")

When you run this, MLflow doesn’t just copy your X_train DataFrame into its artifact store. It creates a Dataset object. This object contains a reference to the data (either a local path, a URI, or even a digest of the data itself if it’s small enough and you choose to embed it) and metadata about its schema and source. The context="training" tells MLflow this specific version of the data was used for training.

The problem MLflow data versioning solves is the "it worked on my machine" scenario, amplified. Without it, you’d have to manually track which version of customer_reviews.csv was used for which model training run. Did you modify customer_reviews.csv between training run A and run B? Which version of X_train was used for the model that’s now in production? This becomes a nightmare for reproducibility and debugging.

MLflow’s mlflow.data.from_pandas() is a key component. It inspects the DataFrame and infers a schema. When you later retrieve this dataset using mlflow.data.get_artifact_uri("training"), you get back a Dataset object that can be loaded back into a DataFrame, ensuring you’re working with the exact data the model was trained on. This is crucial for auditing, debugging, and retraining.

The context parameter isn’t just a label; it’s a way to organize your data inputs within a single run. You can log multiple datasets for different purposes: context="training", context="validation", context="testing", or even custom ones like context="feature_engineering_step_1". MLflow stores these as distinct inputs associated with the run.

Under the hood, when you log a dataset, MLflow calculates a unique digest (like an MD5 hash) of the data’s content. This digest is stored as part of the dataset’s metadata. If you try to log the exact same data multiple times within the same MLflow experiment, MLflow recognizes it and doesn’t duplicate the data storage. It simply creates a new Dataset object pointing to the existing, identical data artifact. This saves storage space and ensures that when you refer to a dataset by its digest, you always get the same underlying data.

Most people don’t realize that MLflow’s log_input function can accept more than just DataFrames. You can log Dataset objects created from various sources, including Parquet files, CSVs, or even custom data loaders, by passing the appropriate mlflow.data.base.BaseDataset subclass or using helper functions like mlflow.data.from_pandas(). This flexibility allows you to version complex data pipelines, not just simple tabular data.

The next challenge is often managing model retraining pipelines based on updated datasets.