MLOps Blue-Green: Deploy New Models with Zero Downtime (2026)

Deploying a new machine learning model without impacting your users is a critical but often tricky part of MLOps. Blue-green deployments offer a robust solution.

Imagine you have a live model serving predictions, let’s call it "Blue." You’ve just trained a shiny new model, "Green," and you want to switch over. Instead of flipping a switch and hoping for the best, a blue-green strategy involves running both Blue and Green simultaneously, gradually shifting traffic to Green until you’re confident it’s stable, then decommissioning Blue.

Here’s a simplified setup using a load balancer (like Nginx or a cloud provider’s load balancer) and two separate deployment environments for your model inference service.

The Setup

Let’s say your model inference service is a Python Flask app.

Blue Environment:
- Running on http://localhost:5001
- Serving the current, stable model.
Green Environment:
- Running on http://localhost:5002
- Serving the new model you want to deploy.
Load Balancer (Nginx):
- Configured to listen on http://localhost:8080
- Directs traffic to either the Blue or Green environment.

Nginx Configuration (nginx.conf)

http {
    upstream blue_backend {
        server localhost:5001;
    }

    upstream green_backend {
        server localhost:5002;
    }

    server {
        listen 8080;
        server_name localhost;

        location /predict {
            # Initially, send all traffic to the blue environment
            proxy_pass http://blue_backend;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
        }
    }
}

To start this, you’d have your Blue model running on port 5001, your Green model on 5002, and Nginx running with the above config, listening on 8080.

The Gradual Rollout

Initial State: All traffic goes to Blue.
- proxy_pass http://blue_backend;
Testing Green: You can manually test the Green environment by directly accessing http://localhost:5002/predict.

Shifting a Small Percentage: Let’s say you want to send 10% of traffic to Green. You’d update your Nginx configuration. This is often done dynamically with tools like nginx-lua-module or by reconfiguring and reloading Nginx. For simplicity, we’ll show a static config change.

Updated Nginx Configuration (Example: 10% to Green)

http {
    upstream blue_backend {
        server localhost:5001 weight=9; # 90%
    }

    upstream green_backend {
        server localhost:5002 weight=1; # 10%
    }

    server {
        listen 8080;
        server_name localhost;

        location /predict {
            # Traffic now split based on weights
            proxy_pass http://blue_backend; # Nginx will pick one based on weights
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
        }
    }
}

After saving this and reloading Nginx (sudo nginx -s reload), 10% of requests to http://localhost:8080/predict will go to Green.

Monitoring: You meticulously watch your metrics: error rates, latency, prediction quality (if you have a feedback loop or offline validation). If Green shows any issues, you can immediately revert by changing the weights back or by changing proxy_pass back to http://blue_backend;.
Increasing Traffic: If Green is stable, you gradually increase its weight.
- 50/50 split: weight=5 for both.
- 90/10 split (Green gets 90%): weight=1 for Blue, weight=9 for Green.
Full Switchover: Once Green is handling 100% of traffic (e.g., weight=0 for Blue, weight=1 for Green, then proxy_pass http://green_backend;), you can safely stop the Blue environment.

The most surprising true thing about this is that the "downtime" you’re preventing isn’t just about the service being unavailable; it’s about the quality of service. A model that starts returning incorrect predictions is effectively "down" for your users, even if the API is still responding. Blue-green deployments allow you to test the behavior of the new model in production with real traffic before it affects everyone.

Internally, the load balancer is the key. It acts as a traffic director, intelligent enough to split requests based on predefined rules (like weights) or even more advanced criteria (like session affinity, which would send a user’s subsequent requests to the same model version). The environments (Blue and Green) are completely independent until the load balancer merges their traffic.

The levers you control are primarily:

Traffic Splitting Strategy: How you define the percentages or rules for directing traffic (e.g., Nginx weights, canary analysis tools).
Rollback Mechanism: How quickly and easily you can divert traffic back to the stable version if issues arise.
Monitoring and Alerting: The metrics you track and the thresholds that trigger alerts or automated rollbacks.

A critical aspect often overlooked is how you handle stateful predictions or session management. If your model relies on context from previous requests within a user’s session, simply splitting traffic by weight might break this. You’d need your load balancer or the inference service itself to implement sticky sessions (or "session affinity"), ensuring a user’s requests consistently go to the same model instance (Blue or Green) throughout their interaction. This requires careful configuration of the load balancer to recognize session identifiers.

After mastering blue-green deployments, you’ll likely explore advanced deployment strategies like A/B testing for models.