MLflow Projects are a way to package your machine learning code so it can be reproduced on any machine, anywhere.

Let’s say you’ve trained a model and want to share it with a colleague, or run it again yourself months later. How do you ensure it runs exactly as it did before? You need to capture not just the code, but also the exact dependencies (libraries, versions), the data it used, and any parameters. MLflow Projects provide a structured way to do this.

Imagine you have a Python script for training, train.py:

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import mlflow
import argparse

def train_model(data_path, alpha, max_iter):
    # Load data
    data = pd.read_csv(data_path)
    X = data[['feature1', 'feature2']]
    y = data['target']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Log parameters
    mlflow.log_param("alpha", alpha)
    mlflow.log_param("max_iter", max_iter)

    # Train model
    model = LinearRegression() # Simplified for example, imagine this takes alpha/max_iter
    model.fit(X_train, y_train)

    # Log metrics
    score = model.score(X_test, y_test)
    mlflow.log_metric("r2_score", score)

    # Save model
    mlflow.sklearn.log_model(model, "model")
    print(f"Model trained with R2 score: {score}")

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--data_path", default="data.csv", help="Path to the training data CSV file.")
    parser.add_argument("--alpha", type=float, default=0.5, help="Regularization strength.")
    parser.add_argument("--max_iter", type=int, default=100, help="Maximum iterations for optimizer.")
    args = parser.parse_args()

    # Start an MLflow run
    with mlflow.start_run():
        train_model(args.data_path, args.alpha, args.max_iter)

To make this an MLflow Project, you create a MLproject file in the same directory:

name: MyLinearRegressionProject

# Specify the entry point for running the project.
# This points to the main Python script and its arguments.
entry_points:
  train:
    parameters:
      data_path: {type: string, default: "data.csv"}
      alpha: {type: float, default: 0.5}
      max_iter: {type: int, default: 100}
    command: "python train.py --data_path {data_path} --alpha {alpha} --max_iter {max_iter}"

# Define the environment dependencies.
# This ensures the project runs with the same libraries.
# Conda is preferred for reproducibility.
conda_env: conda.yaml

And a conda.yaml file for dependencies:

name: mlflow-project-env
channels:
  - conda-forge
dependencies:
  - python=3.9.7
  - pandas=1.3.4
  - scikit-learn=1.0.0
  - mlflow=1.20.0

Now, if you have this directory structure:

my_project/
├── MLproject
├── conda.yaml
├── train.py
└── data.csv

You can run this project from outside the my_project directory using the mlflow run command:

mlflow run my_project -e train --param data_path=./my_project/data.csv --param alpha=0.8

MLflow will:

  1. Read the MLproject file.
  2. Create a new Conda environment based on conda.yaml (or use an existing one if it matches).
  3. Execute the command specified in the entry_points.train.command, substituting the provided parameters.
  4. Log the run details (parameters, metrics, artifacts) to your MLflow tracking server.

This mlflow run command is the key. It abstracts away the environment setup and execution details. Instead of manually installing libraries and running scripts, you declare your project’s needs and MLflow handles the rest. This is powerful because it means someone else can clone your repository, create a data.csv, and run mlflow run my_project and get exactly the same results, assuming they have MLflow installed.

The entry_points section in MLproject is crucial. It defines named commands you can run. Here, we have one named train. The parameters under train declare what inputs this entry point accepts, their types, and default values. These correspond to the arguments your train.py script expects. The command then specifies how to invoke the script using these parameters. The {parameter_name} syntax is a template that MLflow fills in.

The conda_env section is where the magic of reproducibility truly shines. By specifying a conda.yaml file, MLflow can create an isolated environment with the exact Python version and package versions required. This prevents "it worked on my machine" problems. If you don’t specify conda_env, MLflow will try to infer dependencies or use the current environment, which is less reproducible.

The most surprising thing about MLflow Projects is how little boilerplate you need to add to your existing ML code to make it runnable anywhere. You don’t need to rewrite your training scripts to be MLflow-aware beyond parsing command-line arguments, which is already a common practice. The MLproject file acts as a declarative wrapper.

When you run mlflow run my_project, MLflow first looks for the MLproject file. It then sets up the environment defined by conda_env. If conda_env is present, it creates a new Conda environment (or reuses an existing one if the name and dependencies match exactly). If conda_env is missing, it uses the current environment. After the environment is ready, it executes the command specified for the chosen entry_point, substituting any parameters you provided on the command line for the placeholders in the command string. All output, including logs and saved artifacts, is then associated with this specific run in MLflow.

One aspect that often trips people up is how MLflow handles paths. When you specify data_path: "./my_project/data.csv" in your mlflow run command, this path is relative to where you execute the mlflow run command, not necessarily relative to the my_project directory itself. MLflow then passes this absolute or relative path to your train.py script. For maximum robustness, it’s often best to use absolute paths or paths that are reliably located relative to the project root defined in MLproject.

The next logical step after mastering reproducible runs is understanding how to version and deploy these projects, often involving MLflow’s model registry and deployment tools.

Want structured learning?

Take the full Mlflow course →