MLflow’s PostgreSQL backend store is often perceived as a simple database connection, but its real power lies in how it decouples experiment tracking from your application’s lifecycle, enabling robust historical analysis and auditability.

Let’s see MLflow and PostgreSQL working together. Imagine you’re running an experiment.

import mlflow
from mlflow.tracking import MlflowClient
import psycopg2

# Assuming PostgreSQL is running and accessible
# Replace with your actual connection details
db_host = "localhost"
db_port = "5432"
db_user = "mlflow_user"
db_password = "mlflow_password"
db_name = "mlflow_db"

# Construct the SQLAlchemy URI for MLflow
# This is the key piece for configuring the backend store
sqlalchemy_uri = f"postgresql://{db_user}:{db_password}@{db_host}:{db_port}/{db_name}"

# Set the MLflow tracking URI to point to your PostgreSQL backend
mlflow.set_tracking_uri(sqlalchemy_uri)

# Initialize MlflowClient to interact with the backend
client = MlflowClient()

# Start a new MLflow run
with mlflow.start_run(run_name="pg_backend_test_run") as run:
    run_id = run.info.run_id
    print(f"MLflow Run ID: {run_id}")

    # Log a parameter
    mlflow.log_param("learning_rate", 0.01)

    # Log a metric
    mlflow.log_metric("accuracy", 0.95)

    # Log an artifact (e.g., a small text file)
    with open("model_info.txt", "w") as f:
        f.write("This is a dummy model info file.")
    mlflow.log_artifact("model_info.txt")

print(f"Experiment complete. Check MLflow UI or query PostgreSQL directly.")

# To verify directly in PostgreSQL (requires psycopg2 installed)
try:
    conn = psycopg2.connect(
        host=db_host,
        port=db_port,
        user=db_user,
        password=db_password,
        dbname=db_name
    )
    cur = conn.cursor()

    # Query for the run
    cur.execute(f"SELECT run_uuid, run_name FROM mlflow_runs WHERE run_uuid = '{run_id}'")
    run_data = cur.fetchone()
    print(f"\nDirect PostgreSQL Query: Found run - UUID: {run_data[0]}, Name: {run_data[1]}")

    # Query for a parameter
    cur.execute(f"SELECT key, value FROM mlflow_params WHERE run_uuid = '{run_id}' AND key = 'learning_rate'")
    param_data = cur.fetchone()
    print(f"Direct PostgreSQL Query: Found parameter - Key: {param_data[0]}, Value: {param_data[1]}")

    cur.close()
    conn.close()
except Exception as e:
    print(f"\nCould not connect to PostgreSQL or query data: {e}")

This script sets up MLflow to use a PostgreSQL database as its backend store. When mlflow.set_tracking_uri() is called with a SQLAlchemy-compatible URI for PostgreSQL, MLflow automatically starts serializing experiment metadata (runs, parameters, metrics, tags, artifacts) into the configured database tables. The MlflowClient then interacts with this database to log and retrieve information. The direct PostgreSQL queries at the end demonstrate that the data is indeed being written to the database tables managed by MLflow.

The core problem MLflow’s backend store solves is the ephemeral nature of local file-based tracking. When you run experiments locally, MLflow typically logs to mlruns/. This is fine for initial development, but it becomes problematic for:

  1. Collaboration: Sharing mlruns/ directories is cumbersome and prone to conflicts.
  2. Productionization: You don’t want experiment tracking tied to a specific application instance’s filesystem. A dedicated backend store provides a centralized, persistent source of truth.
  3. Scalability: For many users and many experiments, a robust database backend is essential for performance and manageability.
  4. Auditing & Reproducibility: A central database makes it easier to query historical runs, understand model evolution, and reproduce previous results.

MLflow’s PostgreSQL backend store, configured via a SQLAlchemy URI, maps experiment components to specific database tables:

  • mlflow_runs: Stores high-level information about each run (UUID, start/end times, status, name, user, etc.).
  • mlflow_params: Stores key-value pairs for parameters logged during a run.
  • mlflow_metrics: Stores logged metrics, including their history over time for a given run.
  • mlflow_tags: Stores key-value pairs for tags associated with a run.
  • mlflow_experiments: Stores information about experiments themselves.
  • mlflow_registered_models: Stores metadata for registered models.
  • mlflow_model_versions: Stores information about specific versions of registered models.

When you use mlflow.log_param("learning_rate", 0.01), MLflow executes an INSERT statement into the mlflow_params table for the current run’s UUID. Similarly, mlflow.log_metric("accuracy", 0.95) writes to mlflow_metrics, and mlflow.log_artifact("model_info.txt") stores metadata about the artifact (like its path relative to the artifact store) in tables like mlflow_artifacts (though artifact content itself is typically stored in an artifact store like S3 or GCS, not directly in the DB).

The levers you control are primarily within the SQLAlchemy URI and PostgreSQL configuration:

  • postgresql://user:password@host:port/database: This is the core of your configuration.

    • user: The PostgreSQL user MLflow will authenticate as. This user needs CREATE, INSERT, SELECT, UPDATE, DELETE privileges on the MLflow database.
    • password: The password for the specified user.
    • host: The hostname or IP address of your PostgreSQL server.
    • port: The port PostgreSQL is listening on (default is 5432).
    • database: The name of the PostgreSQL database MLflow will use. This database must exist before MLflow attempts to connect.
  • PostgreSQL Server Configuration:

    • max_connections: Ensure PostgreSQL is configured with enough connections to handle MLflow’s load, especially if you have many concurrent users or runs.
    • shared_buffers: Allocate sufficient memory for PostgreSQL’s buffer cache to speed up queries.
    • work_mem: Tune work_mem if you encounter performance issues with complex queries, particularly related to metrics or large parameter sets.
  • MLflow Environment Variable: You can set the tracking URI via an environment variable: export MLFLOW_TRACKING_URI="postgresql://user:password@host:port/database". This is often preferred in production environments.

When you configure MLflow with mlflow.set_tracking_uri("postgresql://..."), MLflow will attempt to create the necessary schema (tables, indexes) if they don’t exist, using the provided credentials. This is why the user needs CREATE privileges. If the database doesn’t exist, MLflow will not create it; you must create the database beforehand.

The most surprising thing about MLflow’s PostgreSQL backend is its implicit handling of artifact store URIs. While the metadata for artifacts (names, paths, sizes) is stored in PostgreSQL, the actual artifact content (e.g., model weights, checkpoints, plots) is typically stored separately in a designated artifact store (like S3, Azure Blob Storage, GCS, or a network filesystem). MLflow stores the location of these artifacts in PostgreSQL, but it does not store the binary data itself in the database. This design choice prevents the database from becoming bloated and keeps query performance high, but it means you need to manage both the database and the artifact store separately for complete experiment reproducibility.

The next step is usually integrating this backend with the MLflow UI for visualization, or configuring artifact storage for full experiment persistence.

Want structured learning?

Take the full Mlflow course →