MLflow Model Serving: Deploy REST API for Inference (2026)

MLflow Model Serving lets you deploy your trained machine learning models as REST APIs, making them accessible for real-time inference without needing to rerun your entire training pipeline.

Let’s see it in action. Imagine you’ve trained a scikit-learn model and logged it with MLflow.

import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

# Train a dummy model
X, y = make_classification(n_samples=100, n_features=4, random_state=42)
model = RandomForestClassifier(random_state=42)
model.fit(X, y)

# Log the model
with mlflow.start_run() as run:
    mlflow.sklearn.log_model(model, "random-forest-model", registered_model_name="MyRandomForest")
    run_id = run.info.run_id
    print(f"Model logged with run_id: {run_id}")

Now, you want to serve this model. MLflow provides a built-in serving mechanism. You can launch a local server that exposes a REST API endpoint.

First, ensure you have MLflow installed: pip install mlflow[serving].

To serve the model, you’ll need its URI. If you registered the model, you can use its name and version: models:/MyRandomForest/latest. If you didn’t register it, you’d use the runs:/<run_id>/<artifact_path> format.

Let’s serve the registered model:

mlflow models serve -m models:/MyRandomForest/latest -p 5000

This command starts a web server on port 5000. The -m flag specifies the model URI, and -p sets the port.

Once the server is running, you can send POST requests to http://127.0.0.1:5000/invocations with your inference data in the request body. The data should be in JSON format, typically a list of records or a dictionary representing the input features.

For our scikit-learn model, the input should be a list of lists, where each inner list is a sample with 4 features.

{
  "dataframe_split": {
    "columns": [0, 1, 2, 3],
    "data": [
      [0.1, -0.2, 0.3, -0.4],
      [0.5, 0.6, -0.7, 0.8]
    ]
  }
}

The server will respond with the model’s predictions. For the example above, you might get something like:

{
  "predictions": [0, 1]
}

The core problem MLflow Model Serving solves is bridging the gap between a trained model artifact and a production-ready API. Instead of manually building Flask/FastAPI apps, managing dependencies, and configuring deployment infrastructure, MLflow abstracts much of this complexity. It understands various ML frameworks (scikit-learn, TensorFlow, PyTorch, ONNX, etc.) and knows how to load and run them for inference.

Internally, MLflow uses a Python web framework (like Flask) to create the REST API. When you run mlflow models serve, it loads the specified model artifact. For each incoming request to /invocations, it deserializes the input data, transforms it into the format expected by the underlying model (e.g., a NumPy array or Pandas DataFrame), calls the model’s predict method, and then serializes the output predictions back into JSON. The dataframe_split format is a common way to send data, as it explicitly defines columns and data, avoiding ambiguity.

The model’s predict method is the heart of the inference process. MLflow invokes this method on the loaded model object. The exact input format expected by predict is determined by the model itself and how it was logged. MLflow’s serving layer handles the translation from the incoming JSON request to this expected format. This is why the dataframe_split structure, which maps directly to how Pandas DataFrames are often structured, is so convenient.

A key detail often overlooked is how MLflow handles different model flavors. When you log a model, you specify its flavor (e.g., mlflow.sklearn, mlflow.tensorflow). The mlflow models serve command inspects the model’s metadata to determine the correct "Python function" or "flavor" to use for loading and inference. This allows a single serving command to work seamlessly with models trained in vastly different frameworks, as long as they’ve been logged consistently with MLflow.

The next step after basic serving is understanding how to configure the serving environment for production, including aspects like containerization and scaling.