MLflow Gateway: Route Requests to Multiple AI Models (2026)

MLflow Gateway lets you serve multiple ML models behind a single API endpoint, dynamically routing requests to the best model for the job.

Let’s see it in action. Imagine you have two different sentiment analysis models: one that’s fast and good for general text, and another that’s more accurate but slower, trained on a specific domain. You want to use both, but your application shouldn’t have to know which one to call.

Here’s how you’d set up MLflow Gateway to handle this:

First, you need to register your models with MLflow. Let’s say you have sentiment-model-fast and sentiment-model-accurate already logged and registered in the MLflow Model Registry.

Next, you define a Gateway configuration. This is a YAML file that tells MLflow Gateway how to route requests.

model_endpoints:
  - name: sentiment-router
    routes:
      - model_uri: "models:/sentiment-model-fast/production"
        alias: fast
        # No specific routing rules, this will be the default if no other rule matches
      - model_uri: "models:/sentiment-model-accurate/production"
        alias: accurate
        # Example: Route requests with 'domain: specific' to this model
        # In a real scenario, you'd have logic to extract this from the request payload
        # For now, we'll assume a simple direct mapping for demonstration
        # This section is commented out to show a simpler default routing for now.
        # More complex routing can be added with 'model_request_condition'.

This configuration defines an endpoint named sentiment-router. It has two routes, pointing to the production stage of our two registered models, giving them aliases fast and accurate. By default, if no other routing logic is specified, MLflow Gateway will pick one of the routes (often the first one listed or based on load balancing).

To make this configuration active, you start the MLflow Gateway service:

mlflow gateway start --config gateway_config.yaml

Now, your application can send requests to the MLflow Gateway endpoint, which will be running on a port (default is 5000).

Let’s simulate a request to this sentiment-router endpoint. Your application would send a POST request to http://localhost:5000/invocations with a JSON payload.

If you send:

{
  "dataframe_split": {
    "columns": ["text"],
    "data": [["This is a great product!"]]
  }
}

MLflow Gateway receives this. Without explicit routing rules, it might send it to sentiment-model-fast. The response would come back, for example:

{
  "predictions": [
    {"label": "positive", "score": 0.95}
  ]
}

Now, imagine you want to route requests for "specific domain" text to the accurate model. You’d modify your gateway_config.yaml to include routing conditions:

model_endpoints:
  - name: sentiment-router
    routes:
      - model_uri: "models:/sentiment-model-fast/production"
        alias: fast
      - model_uri: "models:/sentiment-model-accurate/production"
        alias: accurate
        model_request_condition: "request.json()['dataframe_split']['data'][0][0].startswith('Domain specific text')" # Example condition

With this updated configuration, if you send a request like:

{
  "dataframe_split": {
    "columns": ["text"],
    "data": [["Domain specific text: The latest research shows significant improvements."]]
  }
}

MLflow Gateway would evaluate the model_request_condition. Since the text starts with "Domain specific text", it would route this request to sentiment-model-accurate. The response might be:

{
  "predictions": [
    {"label": "positive", "score": 0.99, "domain_confidence": "high"}
  ]
}

This demonstrates how MLflow Gateway acts as an intelligent proxy, abstracting away the complexity of managing multiple model endpoints. It allows you to deploy diverse models and have a single point of access for your applications, with the flexibility to route requests based on various criteria.

The core problem MLflow Gateway solves is model orchestration at inference time. Instead of your application needing to fetch a model, download it, and then run inference, it simply calls the Gateway. The Gateway, in turn, uses its configuration to select, load (if necessary), and invoke the appropriate model from your MLflow Model Registry. This is particularly powerful when you have model versions, A/B testing setups, or a need to route requests based on request features or business logic.

A key aspect is how MLflow Gateway handles model loading and unloading. When a model is requested for the first time or after a period of inactivity, MLflow Gateway will fetch it from the MLflow Model Registry and load it into memory. For performance, it keeps actively used models loaded. If memory becomes a constraint or models are not used for a while, the Gateway might unload them to free up resources, automatically reloading them when they are requested again. This dynamic loading/unloading is managed by MLflow’s underlying serving infrastructure to optimize resource utilization without significant latency impact for typical workloads.

The model_request_condition uses a Python expression that has access to the incoming request. The request object is a Flask request object, so you can access request.json, request.headers, etc. This allows for very granular control over routing, like checking specific fields in the payload, HTTP headers, or even performing simple text analysis directly within the condition.

The next step is exploring advanced routing strategies like weighted routing for A/B testing or canary deployments, and integrating with external systems for more complex decision-making.