Kubernetes’s built-in rolling updates are often pitched as zero-downtime, but they’re really just "less-downtime" by default, and that’s a critical distinction.
Let’s see this in action. Imagine we have a simple Nginx deployment running:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 3
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: my-app-container
image: nginx:1.25.0 # Our current version
ports:
- containerPort: 80
When we want to update the image to nginx:1.25.1, a standard rolling update will do something like this:
- Scale down one old pod.
- Scale up one new pod.
- Wait for the new pod to be ready.
- Repeat until all pods are updated.
This means for a brief period, you might have a mix of old and new pods, and traffic hitting the old ones will get the old version, while traffic hitting the new ones gets the new. If your application isn’t designed for this transient state (e.g., database schema changes, API incompatibilities), you’re in for trouble.
Blue-Green and Canary deployments are strategies to mitigate this risk. They offer more control by separating the deployment of the new version from the moment traffic is switched over.
Blue-Green Deployment
In a blue-green deployment, you run two identical production environments: "Blue" (the current version) and "Green" (the new version).
- Deploy Green: You deploy the new version of your application to the "Green" environment. This environment is not yet receiving live traffic.
- Test Green: You can thoroughly test the "Green" environment without impacting users.
- Switch Traffic: When you’re confident, you switch your load balancer (or Ingress controller) to point all traffic from "Blue" to "Green."
- Keep Blue (Optional): You keep the "Blue" environment running for a period as a rollback option. If issues arise with "Green," you can quickly switch traffic back to "Blue."
In Kubernetes, this often involves two separate Deployments and using a Service or Ingress to manage traffic.
Here’s a conceptual setup:
- Service
my-app-v1: Points to pods with labelversion: "1.25.0". - Deployment
my-app-v1: Manages pods withversion: "1.25.0". - Deployment
my-app-v2: Manages pods withversion: "1.25.1". This deployment is scaled up but not yet receiving traffic. - Service
my-app: This is the single entry point for users. Initially, it targets pods labeledversion: "1.25.0".
To switch to the new version:
- Update the
my-appService’s selector to target pods withversion: "1.25.1". - Scale down the
my-app-v1Deployment.
# Original Service targeting Blue
apiVersion: v1
kind: Service
metadata:
name: my-app
spec:
selector:
app: my-app
version: "1.25.0" # Targets Blue
ports:
- protocol: TCP
port: 80
targetPort: 80
# New Deployment for Green
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app-v2
spec:
replicas: 3
selector:
matchLabels:
app: my-app
version: "1.25.1" # New version
template:
metadata:
labels:
app: my-app
version: "1.25.1" # New version
spec:
containers:
- name: my-app-container
image: nginx:1.25.1 # Updated image
ports:
- containerPort: 80
After my-app-v2 is running and ready, you’d change the main my-app Service:
# Updated Service to target Green
apiVersion: v1
kind: Service
metadata:
name: my-app
spec:
selector:
app: my-app
version: "1.25.1" # Now targets Green
ports:
- protocol: TCP
port: 80
targetPort: 80
Then, you’d scale down the old deployment: kubectl scale deployment my-app-v1 --replicas=0.
Canary Deployment
Canary deployments introduce the new version to a small subset of users first, then gradually roll it out.
- Deploy New Version: Deploy the new version alongside the old version.
- Route Small Traffic: Configure your Ingress or Service Mesh to send a small percentage of live traffic (e.g., 5%) to the new version.
- Monitor: Closely monitor the new version for errors, latency, and user feedback.
- Gradually Increase Traffic: If the new version performs well, incrementally increase the traffic percentage (e.g., 10%, 25%, 50%, 100%).
- Rollback: If issues are detected at any stage, immediately shift all traffic back to the old version and address the problems.
This approach is less disruptive than a full cutover and allows for real-world testing with minimal impact.
In Kubernetes, this is commonly achieved with an Ingress controller that supports traffic splitting (like Nginx Ingress or Traefik) or using a Service Mesh (like Istio or Linkerd).
Here’s how it might look with Nginx Ingress:
# Deployment for the old version (Blue)
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app-v1
spec:
replicas: 3
selector:
matchLabels:
app: my-app
version: "1.25.0"
template:
metadata:
labels:
app: my-app
version: "1.25.0"
spec:
containers:
- name: my-app-container
image: nginx:1.25.0
ports:
- containerPort: 80
# Deployment for the new version (Canary)
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app-v2
spec:
replicas: 1 # Start with fewer replicas for canary
selector:
matchLabels:
app: my-app
version: "1.25.1"
template:
metadata:
labels:
app: my-app
version: "1.25.1"
spec:
containers:
- name: my-app-container
image: nginx:1.25.1
ports:
- containerPort: 80
And the Ingress resource to manage the split:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: my-app-ingress
annotations:
nginx.ingress.kubernetes.io/canary: "true"
nginx.ingress.kubernetes.io/canary-weight: "10" # Send 10% of traffic to v2
spec:
rules:
- host: myapp.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: my-app-v1-service # Default service for v1
port:
number: 80
# Canary backend definition - this is what the annotation points to
# In newer versions of Nginx Ingress, you might define this directly in the main spec
# or via a separate Ingress resource with specific canary annotations.
# This example is illustrative; actual implementation varies by ingress controller version.
# For demonstration, imagine 'my-app-v2-service' exists and targets v2 pods.
The key here is that the Ingress controller intercepts incoming requests and, based on its configuration (like canary-weight), forwards them to either the my-app-v1-service or my-app-v2-service. You’d incrementally update the canary-weight annotation and/or scale up my-app-v2 as confidence grows.
A common, often overlooked, detail in Service Mesh canary deployments is how the mesh handles session affinity. If your application relies on sticky sessions, you need to ensure your Service Mesh or Ingress is configured to respect that affinity for both the old and new versions during the transition, or you risk breaking user sessions. Without explicit configuration, the mesh might independently route requests from the same user to different versions, leading to unexpected behavior.