Shadow deployments are the secret sauce for rolling out new machine learning models without risking your live system.
Here’s a new model, model-v2, being deployed alongside the current production model, model-v1.
{
"model_id": "model-v1",
"version": "1.0.0",
"is_production": true
}
{
"model_id": "model-v2",
"version": "1.0.0",
"is_production": false
}
When an inference request comes in, it’s sent to both model-v1 and model-v2. model-v1’s response is what actually gets returned to the user. model-v2’s response is logged and compared against model-v1’s.
# Example Kubernetes Service definition
apiVersion: v1
kind: Service
metadata:
name: inference-service
spec:
selector:
app: ml-inference
ports:
- protocol: TCP
port: 80
targetPort: 8080
# Example Kubernetes Deployment for model-v1 (production)
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-inference-v1
spec:
replicas: 3
selector:
matchLabels:
app: ml-inference
version: v1
template:
metadata:
labels:
app: ml-inference
version: v1
spec:
containers:
- name: inference-container
image: your-docker-repo/ml-inference:v1.0.0
ports:
- containerPort: 8080
# Example Kubernetes Deployment for model-v2 (shadow)
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-inference-v2
spec:
replicas: 3
selector:
matchLabels:
app: ml-inference
version: v2
template:
metadata:
labels:
app: ml-inference
version: v2
spec:
containers:
- name: inference-container
image: your-docker-repo/ml-inference:v1.0.0 # Same image, different model weights
ports:
- containerPort: 8080
In this setup, a gateway or ingress controller would be configured to route all incoming traffic to a shared service endpoint. This service endpoint, in turn, is configured to route requests to both the v1 and v2 deployments. The key is that only the response from the v1 deployment is sent back to the client.
# Simplified request handling logic within the gateway/service
import requests
def handle_inference_request(request_data):
# Send request to production model
prod_response = requests.post("http://ml-inference-v1:8080/predict", json=request_data)
production_result = prod_response.json()
# Send request to shadow model (asynchronously or in a separate thread)
try:
shadow_response = requests.post("http://ml-inference-v2:8080/predict", json=request_data)
shadow_result = shadow_response.json()
# Log shadow_result and compare with production_result for analysis
log_comparison(production_result, shadow_result)
except Exception as e:
# Log error for shadow model, but don't impact user
log_error(f"Shadow model inference failed: {e}")
# Return the production model's result to the user
return production_result
The problem this solves is the inherent risk of deploying a new ML model. If the new model performs poorly, has bugs, or introduces unexpected behavior, it could degrade user experience, cause financial loss, or even crash the entire service. Shadow deployments isolate this risk by allowing you to observe the new model’s behavior on real-world traffic without affecting live users. You can collect data on its predictions, latency, and error rates, compare it against the current production model, and only promote the new model once you’re confident in its performance.
Internally, this requires a sophisticated routing layer – often managed by an API gateway, service mesh (like Istio or Linkerd), or custom proxy logic. This layer duplicates incoming requests and sends them to multiple backend services (your model deployments). Each backend service processes the request independently. The routing layer then orchestrates which response is returned to the client and how the other responses are handled (e.g., logged for analysis). You control the traffic splitting (100% to production, 0% to shadow), the comparison metrics (e.g., accuracy, bias, latency), and the promotion criteria.
A common oversight is assuming that "shadowing" means the shadow model is a perfect replica in terms of environment. In reality, the shadow model might be running on slightly different hardware, or with a different batch size if your inference endpoint supports it, or even with a slightly different preprocessing step if that’s part of the inference service. This can lead to subtle discrepancies that aren’t due to the model weights themselves but rather the execution context. Always ensure your shadow deployment environment closely mirrors your production inference environment to get the most meaningful comparisons.
The next step after validating a shadow deployment is often a canary release.