MLflow doesn’t actually store your artifacts; it just knows where they are.

Let’s watch MLflow in action, specifically how it handles artifact storage. Imagine you’re running a training job locally and want to log a model and some metrics.

# Set up a local MLflow tracking server
mlflow ui

# In a separate terminal, run a Python script
python train_model.py

And here’s a snippet of train_model.py:

import mlflow
import mlflow.sklearn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
import pandas as pd

# Configure MLflow to use a specific backend store
# For this example, we'll use a local file path
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("S3_Artifact_Example")

# Generate some sample data
X, y = make_regression(n_samples=100, n_features=1, noise=10)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
df = pd.DataFrame(X, columns=['feature'])
df['target'] = y

with mlflow.start_run():
    # Log parameters
    mlflow.log_param("n_samples", 100)
    mlflow.log_param("noise", 10)

    # Train a simple model
    model = LinearRegression()
    model.fit(X_train, y_train)

    # Log the model
    mlflow.sklearn.log_model(model, "linear_model")

    # Log a dataset artifact
    mlflow.log_artifact(df, "raw_data")

    # Log a metric
    score = model.score(X_test, y_test)
    mlflow.log_metric("r2_score", score)

    print(f"Run completed with R2 score: {score}")

When you run this, MLflow records metadata (parameters, metrics, and artifact locations) in its backend store (in this case, a local mlruns directory by default if mlflow.set_tracking_uri wasn’t called, or whatever you specified). The actual artifacts — the saved model files (linear_model/model.pkl) and the CSV data (raw_data/data.csv) — are uploaded to a separate location.

By default, if you’re not explicitly configuring a cloud storage backend, MLflow will store artifacts locally alongside the backend store. However, the real power comes when you configure it for cloud object storage.

To make MLflow store artifacts in Amazon S3, you’d set an environment variable before starting your MLflow client or server:

export MLFLOW_S3_ENDPOINT_URL="https://s3.amazonaws.com"
export MLFLOW_TRACKING_URI="sqlite:///mlflow.db" # Or your backend store
export ARTIFACT_ROOT="s3://your-mlflow-bucket/path/to/artifacts"
mlflow.set_tracking_uri(os.environ["MLFLOW_TRACKING_URI"])
mlflow.set_experiment("S3_Artifact_Example")
# ... rest of your script, but with ARTIFACT_ROOT configured for the run
# If using Python:
import os
os.environ['MLFLOW_TRACKING_URI'] = "sqlite:///mlflow.db" # Example backend
os.environ['MLFLOW_S3_ENDPOINT_URL'] = "https://s3.amazonaws.com"
os.environ['ARTIFACT_ROOT'] = "s3://your-mlflow-bucket/path/to/artifacts"
# Then within your script, MLflow will pick these up.
# Or explicitly:
mlflow.set_registry_uri(os.environ['MLFLOW_TRACKING_URI'])
mlflow.set_tracking_uri(os.environ['MLFLOW_TRACKING_URI'])
mlflow.log_artifact(..., artifact_path="your_path", artifact_location=os.environ['ARTIFACT_ROOT'])

When mlflow.sklearn.log_model() or mlflow.log_artifact() is called, MLflow uploads the files to s3://your-mlflow-bucket/path/to/artifacts/<run_id>/<artifact_path>. The MLflow backend store (your mlflow.db or mlruns directory) will then contain entries pointing to these S3 URIs. When you navigate to a run in the MLflow UI and click to download an artifact, MLflow uses the S3 URI stored in the backend to fetch the file directly from S3.

For Google Cloud Storage (GCS), it’s similar:

export MLFLOW_GCS_ENDPOINT="storage.googleapis.com"
export MLFLOW_TRACKING_URI="postgresql://user:password@host:port/database" # Example backend
export ARTIFACT_ROOT="gs://your-mlflow-bucket/path/to/artifacts"

And for Azure Blob Storage:

export MLFLOW_AZURE_STORAGE_BLOB_ENDPOINT="your_storage_account.blob.core.windows.net"
export MLFLOW_TRACKING_URI="file:///path/to/local/mlruns" # Example backend
export ARTIFACT_ROOT="azure://your_container/path/to/artifacts"

The key insight is that MLflow’s backend store (where metadata lives) and its artifact store (where the actual files live) are decoupled. You can use a local SQLite database for metadata and S3 for artifacts, or a PostgreSQL database for metadata and GCS for artifacts. This allows you to scale your metadata storage independently of your artifact storage.

When MLflow logs an artifact, it doesn’t compress or bundle it. It directly uploads the file or directory structure to the configured object store. This means that if you log a directory with many small files, you’ll see many individual objects in your S3 bucket or GCS bucket. The MLflow UI then uses the stored URI to reconstruct the directory view by listing objects in the backend store. This direct upload is why MLflow is efficient; it doesn’t add an intermediate processing step.

The next thing you’ll grapple with is efficient retrieval and versioning of these artifacts, especially when dealing with large datasets or models.

Want structured learning?

Take the full Mlflow course →