Blue-Green vs. Canary: Production Release Showdown

Kubernetes’s built-in rolling updates are often pitched as zero-downtime, but they’re really just "less-downtime" by default, and that’s a critical distinction.

Let’s see this in action. Imagine we have a simple Nginx deployment running:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: my-app-container
        image: nginx:1.25.0 # Our current version
        ports:
        - containerPort: 80

When we want to update the image to nginx:1.25.1, a standard rolling update will do something like this:

Scale down one old pod.
Scale up one new pod.
Wait for the new pod to be ready.
Repeat until all pods are updated.

This means for a brief period, you might have a mix of old and new pods, and traffic hitting the old ones will get the old version, while traffic hitting the new ones gets the new. If your application isn’t designed for this transient state (e.g., database schema changes, API incompatibilities), you’re in for trouble.

Blue-Green and Canary deployments are strategies to mitigate this risk. They offer more control by separating the deployment of the new version from the moment traffic is switched over.

Blue-Green Deployment

In a blue-green deployment, you run two identical production environments: "Blue" (the current version) and "Green" (the new version).

Deploy Green: You deploy the new version of your application to the "Green" environment. This environment is not yet receiving live traffic.
Test Green: You can thoroughly test the "Green" environment without impacting users.
Switch Traffic: When you’re confident, you switch your load balancer (or Ingress controller) to point all traffic from "Blue" to "Green."
Keep Blue (Optional): You keep the "Blue" environment running for a period as a rollback option. If issues arise with "Green," you can quickly switch traffic back to "Blue."

In Kubernetes, this often involves two separate Deployments and using a Service or Ingress to manage traffic.

Here’s a conceptual setup:

Service my-app-v1: Points to pods with label version: "1.25.0".
Deployment my-app-v1: Manages pods with version: "1.25.0".
Deployment my-app-v2: Manages pods with version: "1.25.1". This deployment is scaled up but not yet receiving traffic.
Service my-app: This is the single entry point for users. Initially, it targets pods labeled version: "1.25.0".

To switch to the new version:

Update the my-app Service’s selector to target pods with version: "1.25.1".
Scale down the my-app-v1 Deployment.

# Original Service targeting Blue
apiVersion: v1
kind: Service
metadata:
  name: my-app
spec:
  selector:
    app: my-app
    version: "1.25.0" # Targets Blue
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80

# New Deployment for Green
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app-v2
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-app
      version: "1.25.1" # New version
  template:
    metadata:
      labels:
        app: my-app
        version: "1.25.1" # New version
    spec:
      containers:
      - name: my-app-container
        image: nginx:1.25.1 # Updated image
        ports:
        - containerPort: 80

After my-app-v2 is running and ready, you’d change the main my-app Service:

# Updated Service to target Green
apiVersion: v1
kind: Service
metadata:
  name: my-app
spec:
  selector:
    app: my-app
    version: "1.25.1" # Now targets Green
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80

Then, you’d scale down the old deployment: kubectl scale deployment my-app-v1 --replicas=0.

Canary Deployment

Canary deployments introduce the new version to a small subset of users first, then gradually roll it out.

Deploy New Version: Deploy the new version alongside the old version.
Route Small Traffic: Configure your Ingress or Service Mesh to send a small percentage of live traffic (e.g., 5%) to the new version.
Monitor: Closely monitor the new version for errors, latency, and user feedback.
Gradually Increase Traffic: If the new version performs well, incrementally increase the traffic percentage (e.g., 10%, 25%, 50%, 100%).
Rollback: If issues are detected at any stage, immediately shift all traffic back to the old version and address the problems.

This approach is less disruptive than a full cutover and allows for real-world testing with minimal impact.

In Kubernetes, this is commonly achieved with an Ingress controller that supports traffic splitting (like Nginx Ingress or Traefik) or using a Service Mesh (like Istio or Linkerd).

Here’s how it might look with Nginx Ingress:

# Deployment for the old version (Blue)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app-v1
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-app
      version: "1.25.0"
  template:
    metadata:
      labels:
        app: my-app
        version: "1.25.0"
    spec:
      containers:
      - name: my-app-container
        image: nginx:1.25.0
        ports:
        - containerPort: 80

# Deployment for the new version (Canary)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app-v2
spec:
  replicas: 1 # Start with fewer replicas for canary
  selector:
    matchLabels:
      app: my-app
      version: "1.25.1"
  template:
    metadata:
      labels:
        app: my-app
        version: "1.25.1"
    spec:
      containers:
      - name: my-app-container
        image: nginx:1.25.1
        ports:
        - containerPort: 80

And the Ingress resource to manage the split:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-app-ingress
  annotations:
    nginx.ingress.kubernetes.io/canary: "true"
    nginx.ingress.kubernetes.io/canary-weight: "10" # Send 10% of traffic to v2
spec:
  rules:
  - host: myapp.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: my-app-v1-service # Default service for v1
            port:
              number: 80
  # Canary backend definition - this is what the annotation points to
  # In newer versions of Nginx Ingress, you might define this directly in the main spec
  # or via a separate Ingress resource with specific canary annotations.
  # This example is illustrative; actual implementation varies by ingress controller version.
  # For demonstration, imagine 'my-app-v2-service' exists and targets v2 pods.

The key here is that the Ingress controller intercepts incoming requests and, based on its configuration (like canary-weight), forwards them to either the my-app-v1-service or my-app-v2-service. You’d incrementally update the canary-weight annotation and/or scale up my-app-v2 as confidence grows.

A common, often overlooked, detail in Service Mesh canary deployments is how the mesh handles session affinity. If your application relies on sticky sessions, you need to ensure your Service Mesh or Ingress is configured to respect that affinity for both the old and new versions during the transition, or you risk breaking user sessions. Without explicit configuration, the mesh might independently route requests from the same user to different versions, leading to unexpected behavior.