Serving the same machine learning model across multiple cloud providers simultaneously is a surprisingly common requirement, often driven by enterprise-level needs for redundancy, disaster recovery, or regulatory compliance that demands data residency in specific regions.

Let’s see how this looks in practice. Imagine you have a trained fraud-detection-model that you want to serve from both AWS and GCP.

First, you’d package your model and its serving code into a container. For example, using FastAPI for the API and Uvicorn for the server:

# main.py
from fastapi import FastAPI
import joblib
import numpy as np

app = FastAPI()

# Load the pre-trained model
# In a real scenario, this would be loaded from a persistent store like S3 or GCS
model = joblib.load("model.pkl")

@app.post("/predict")
async def predict(data: list[list[float]]):
    predictions = model.predict(np.array(data))
    return {"predictions": predictions.tolist()}

# To run this locally for testing:
# uvicorn main:app --reload

This main.py file, along with model.pkl and a Dockerfile, would be built into an image:

# Dockerfile
FROM python:3.9-slim

WORKDIR /app

COPY requirements.txt requirements.txt
RUN pip install --no-cache-dir -r requirements.txt

COPY main.py main.py
COPY model.pkl model.pkl

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "80"]

Once the image is built (e.g., docker build -t my-fraud-detector .), you’d push it to a container registry accessible by both cloud providers. A common pattern is to use separate registries per cloud, like Amazon ECR and Google Container Registry.

# On AWS
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 123456789012.dkr.ecr.us-east-1.amazonaws.com
docker tag my-fraud-detector:latest 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-fraud-detector:latest
docker push 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-fraud-detector:latest

# On GCP
gcloud auth configure-docker us-central1-docker.pkg.dev
docker tag my-fraud-detector:latest us-central1-docker.pkg.dev/my-gcp-project/my-repo/my-fraud-detector:latest
docker push us-central1-docker.pkg.dev/my-gcp-project/my-repo/my-fraud-detector:latest

Now, you deploy this containerized application to each cloud. This is where the multi-cloud serving strategy truly comes into play. You’ll typically use managed Kubernetes services like Amazon EKS and Google Kubernetes Engine (GKE) or serverless container platforms like AWS Fargate and Google Cloud Run.

For EKS, you might have a deployment configuration like this:

# eks-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: fraud-detector-deployment
  labels:
    app: fraud-detector
spec:
  replicas: 3
  selector:
    matchLabels:
      app: fraud-detector
  template:
    metadata:
      labels:
        app: fraud-detector
    spec:
      containers:
      - name: fraud-detector
        image: 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-fraud-detector:latest
        ports:
        - containerPort: 80

And for GKE, a similar deployment:

# gke-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: fraud-detector-deployment
  labels:
    app: fraud-detector
spec:
  replicas: 3
  selector:
    matchLabels:
      app: fraud-detector
  template:
    metadata:
      labels:
        app: fraud-detector
    spec:
      containers:
      - name: fraud-detector
        image: us-central1-docker.pkg.dev/my-gcp-project/my-repo/my-fraud-detector:latest
        ports:
        - containerPort: 80

The crucial part for multi-cloud serving is how you route traffic. You’ll need a global load balancer or a DNS-based traffic management system. Services like AWS Route 53 with latency-based routing or traffic policies, or Google Cloud DNS with traffic steering policies, are commonly used.

A simple approach is to have a primary endpoint (e.g., api.mycompany.com) that resolves to different IP addresses based on the user’s location or a predefined failover strategy. For instance, Route 53 might be configured to send traffic to an AWS Application Load Balancer (ALB) in one region and a Google Cloud Load Balancer in another.

The ALB in AWS would target your EKS cluster (or Fargate tasks), and the Google Cloud Load Balancer would target your GKE cluster (or Cloud Run services).

+-------------------+     +------------------------+     +----------------------+
|                   |     |                        |     |                      |
|  User Request     | --> | Global Load Balancer   | --> | Cloud Provider A     |
|                   |     | (e.g., Route 53/GCP DNS)|     | (e.g., AWS EKS/ALB)  |
+-------------------+     +------------------------+     +----------------------+
                                      |
                                      |
                                      v
                                    +----------------------+
                                    |                      |
                                    | Cloud Provider B     |
                                    | (e.g., GKE/GCLB)     |
                                    +----------------------+

This setup allows for high availability. If one cloud provider’s service becomes unavailable, the global load balancer can automatically redirect traffic to the healthy endpoints in the other provider.

The complexity arises not just in deployment but in managing the consistency of your models and data across clouds. For instance, if your model requires frequent updates, you need a robust CI/CD pipeline that can deploy to multiple environments seamlessly. Tools like Spinnaker, Argo CD, or even custom scripting with cloud SDKs can orchestrate these deployments.

A subtle but critical aspect of multi-cloud serving is managing the inference latency and cost. While you gain resilience, you also introduce the overhead of maintaining infrastructure and data synchronization across different providers. The performance of your model might differ slightly between clouds due to variations in hardware, network, and managed service implementations. You’ll want to continuously monitor inference times and costs on each provider to ensure optimal performance and budget adherence.

The next challenge you’ll encounter is synchronizing model state and feature stores across these distributed deployments.

Want structured learning?

Take the full MLOps & AI DevOps course →