Istio’s outlier detection is fundamentally about preventing cascading failures by intelligently removing unhealthy instances from a service pool before they can negatively impact a larger system.
Let’s see it in action. Imagine we have two versions of a productpage service, v1 and v2, and v2 is currently experiencing intermittent network issues. We want Istio to automatically stop sending traffic to the unhealthy v2 instances.
Here’s a simplified productpage deployment with Istio sidecars:
apiVersion: apps/v1
kind: Deployment
metadata:
name: productpage-v1
spec:
replicas: 3
selector:
matchLabels:
app: productpage
version: v1
template:
metadata:
labels:
app: productpage
version: v1
spec:
containers:
- name: productpage
image: your-repo/productpage:v1
ports:
- containerPort: 9080
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: productpage-v2
spec:
replicas: 3
selector:
matchLabels:
app: productpage
version: v2
template:
metadata:
labels:
app: productpage
version: v2
spec:
containers:
- name: productpage
image: your-repo/productpage:v2
ports:
- containerPort: 9080
And a VirtualService to route traffic:
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: productpage
spec:
hosts:
- productpage
http:
- route:
- destination:
host: productpage
subset: v1
weight: 50
- destination:
host: productpage
subset: v2
weight: 50
Now, let’s introduce the DestinationRule with outlier detection. This tells Istio how to determine an outlier and what to do when it finds one.
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: productpage
spec:
host: productpage
subsets:
- name: v1
labels:
version: v1
- name: v2
labels:
version: v2
# Outlier detection configuration for v2
trafficPolicy:
outlierDetection:
consecutive5xxErrors: 3 # If 3 consecutive 5xx errors occur
interval: 10s # Check every 10 seconds
baseEjectionTime: 60s # Eject for 60 seconds
maxEjectionPercent: 50 # Don't eject more than 50% of instances
With this DestinationRule, Istio’s Envoy proxies will monitor requests to productpage instances belonging to the v2 subset. If a v2 instance returns 3 consecutive 5xx errors (like HTTP 503 Service Unavailable), Istio will consider it an outlier. It will then temporarily eject that instance from the load balancing pool for 60 seconds. This ejection will happen for up to 50% of the v2 instances. During this ejection period, traffic will be routed solely to the healthy v1 instances and any remaining healthy v2 instances. After 60 seconds, Istio will attempt to send a request to the ejected instance to see if it has recovered. If it responds successfully, it will be added back into the pool.
This mechanism prevents a single faulty instance from bringing down the entire productpage service. It’s a form of active health checking and graceful degradation, ensuring that only healthy service instances serve traffic. The consecutive5xxErrors setting is crucial – it defines the threshold for what constitutes an "unhealthy" response, preventing transient network blips from causing unnecessary ejections. The interval dictates how frequently these checks are performed, balancing responsiveness with the overhead of monitoring. baseEjectionTime provides a cooldown period, giving the instance time to recover before being considered for traffic again. maxEjectionPercent is a safety valve, ensuring that Istio doesn’t remove too many instances, which could lead to capacity issues or even a denial-of-service situation if the detection logic is overly aggressive.
The real power comes from how Istio uses Envoy’s built-in outlier detection. Envoy acts as the agent, performing the health checks and managing the ejection. Istio’s DestinationRule is simply the configuration interface that instructs Envoy on the specific criteria and actions. When an instance is ejected, Envoy will return an HTTP 503 to the client for requests that would have been routed to that specific ejected instance. This is key – the client itself doesn’t inherently know about the ejection; it just receives an error that it can then potentially retry (or handle as it sees fit).
The next thing you’ll want to configure is how to handle failing services beyond just ejecting instances, like implementing retries.