MLflow’s A/B testing lets you run multiple versions of a model in production simultaneously, directing a percentage of traffic to each, and then analyze their performance to pick the best one.

Imagine you have a recommendation system. You’ve trained a new, potentially better model model-v2. You want to see if it actually performs better than your current production model model-v1 before fully switching over. MLflow A/B testing lets you do this by splitting your incoming user traffic. You might send 90% of users to model-v1 and 10% to model-v2, then track metrics like click-through rate (CTR) for both.

Here’s a simplified look at how you might set this up using MLflow’s Python API. First, you need to have your models logged in MLflow. Let’s assume model-v1 and model-v2 are already registered in MLflow Model Registry.

import mlflow
import mlflow.pyfunc
from mlflow.models.signature import infer_signature
import pandas as pd

# Assuming you have these models registered in MLflow Model Registry
# For demonstration, let's create dummy models
class DummyModelV1(mlflow.pyfunc.PythonModel):
    def predict(self, context, model_input):
        return ["recommendation_A"] * len(model_input)

class DummyModelV2(mlflow.pyfunc.PythonModel):
    def predict(self, context, model_input):
        return ["recommendation_B"] * len(model_input)

# Log dummy models if they don't exist
try:
    mlflow.pyfunc.log_model(
        artifact_path="model_v1",
        python_model=DummyModelV1(),
        registered_model_name="recommendation-model-v1",
        signature=infer_signature(pd.DataFrame({'user_id': [1]}), pd.DataFrame({'recommendation': ["recommendation_A"]}))
    )
    print("Logged DummyModelV1")
except Exception as e:
    print(f"DummyModelV1 already exists or error: {e}")

try:
    mlflow.pyfunc.log_model(
        artifact_path="model_v2",
        python_model=DummyModelV2(),
        registered_model_name="recommendation-model-v2",
        signature=infer_signature(pd.DataFrame({'user_id': [1]}), pd.DataFrame({'recommendation': ["recommendation_B"]}))
    )
    print("Logged DummyModelV2")
except Exception as e:
    print(f"DummyModelV2 already exists or error: {e}")

# Get the latest versions of the registered models
model_v1_uri = f"models:/recommendation-model-v1/latest"
model_v2_uri = f"models:/recommendation-model-v2/latest"

# Define the experiment and the A/B test run
experiment_name = "recommendation_ab_test"
mlflow.set_experiment(experiment_name)

with mlflow.start_run() as run:
    # Define the A/B test configuration
    # This is where you specify the models and their traffic allocation
    ab_test_config = {
        "candidate_models": [
            {"uri": model_v1_uri, "name": "model_v1", "weight": 0.9}, # 90% traffic
            {"uri": model_v2_uri, "name": "model_v2", "weight": 0.1}  # 10% traffic
        ]
    }

    # Log the A/B test configuration as a parameter
    mlflow.log_params(ab_test_config)

    # In a real scenario, you'd deploy this A/B test configuration.
    # MLflow doesn't directly deploy models. It helps you manage the experiment.
    # You would use MLflow's deployment tools or other orchestration
    # platforms (like Kubernetes, SageMaker, etc.) to actually serve
    # these models with the specified traffic split.

    # For demonstration, let's simulate receiving a request and
    # selecting a model based on weights.
    # In a real system, this logic would be in your inference service.
    import random

    def get_model_for_request(config):
        models = config["candidate_models"]
        weights = [m["weight"] for m in models]
        chosen_model_info = random.choices(models, weights=weights, k=1)[0]
        print(f"Simulating request: Choosing model {chosen_model_info['name']} with URI {chosen_model_info['uri']}")
        return mlflow.pyfunc.load_model(chosen_model_info['uri'])

    # Simulate a few requests
    print("\nSimulating inference requests:")
    for _ in range(5):
        model_to_use = get_model_for_request(ab_test_config)
        # Dummy input data for prediction
        input_data = pd.DataFrame({'user_id': [123]})
        prediction = model_to_use.predict(input_data)
        print(f"  -> Prediction: {prediction}")

    print(f"\nMLflow Run ID for this A/B test: {run.info.run_id}")
    print(f"View this A/B test run in MLflow UI: {mlflow.get_tracking_uri()}:///{run.info.experiment_id}/{run.info.run_id}")

This code sets up a placeholder for an A/B test. The crucial part is the ab_test_config dictionary, where you list the MLflow model URIs (pointing to registered model versions) and their corresponding weight for traffic allocation. MLflow logs this configuration within a run.

The actual serving and traffic splitting logic isn’t handled by MLflow itself. You’d typically integrate this MLflow configuration into your model serving infrastructure. For example, a custom inference server could read this configuration, use random.choices (or a more sophisticated load balancer) to pick a model based on weights, and then make the prediction. You’d also need to log metrics (like CTR, conversion rate) back to MLflow, associating them with the specific model version that served the request.

The magic MLflow provides is centralizing the definition of your A/B tests and making it easy to track which model version was used for which request (if your serving infrastructure logs this correlation). You can then use MLflow’s tools to compare the logged metrics across different model versions to determine which performed better.

The most counterintuitive aspect of MLflow A/B testing is that MLflow itself doesn’t serve the models or split the traffic. It acts as the central registry and experiment tracker. The A/B testing configuration you log is essentially a blueprint that your separate serving infrastructure must interpret and implement. You’re defining the intent of the A/B test within MLflow, and then building or configuring a system to execute that intent in production.

The next step after running your A/B test and collecting data is to analyze the results using MLflow’s comparison tools or by querying the logged metrics to make a data-driven decision about which model to promote to full production.

Want structured learning?

Take the full Mlflow course →