A circuit breaker doesn’t prevent a service from failing; it prevents a failing service from taking down everything else.

Imagine you have three services: User Service, Order Service, and Payment Service. A user requests their order history. User Service calls Order Service, which in turn calls Payment Service to verify transaction details.

// User Service Request Flow
GET /users/{userId}/orders
  -> calls Order Service GET /orders?userId={userId}
    -> calls Payment Service GET /payments?orderId={orderId}

If Payment Service starts timing out or returning errors, Order Service will start failing. If User Service doesn’t do anything about Order Service failing, it will also start failing. Now, any other service that relies on User Service (e.g., Notification Service to send an email, Analytics Service to log the request) will also fail. This is cascading failure – one small problem spreads like wildfire.

A circuit breaker, typically implemented within Order Service when it calls Payment Service, acts like an electrical circuit breaker. It monitors calls to Payment Service.

The Three States of a Circuit Breaker

  1. Closed: Everything is working. Calls to Payment Service are allowed through. The breaker monitors for failures. If failures exceed a certain threshold (e.g., 50% of requests fail within a 10-second window), the breaker "trips" and moves to the Open state.

  2. Open: The breaker has tripped. All further calls to Payment Service are immediately rejected with an error (e.g., 503 Service Unavailable or a custom CircuitBreakerOpenException). No actual calls are made to Payment Service. This prevents Order Service from wasting resources on requests that are guaranteed to fail and gives Payment Service time to recover. After a configured timeout (e.g., 30 seconds), the breaker moves to the Half-Open state.

  3. Half-Open: The breaker allows a single, limited number of test requests to Payment Service. If this test request succeeds, the breaker resets to Closed, assuming Payment Service has recovered. If the test request fails, the breaker immediately trips back to Open, and the timeout period restarts.

Common Causes and Fixes for Circuit Breaker Issues

1. Under-provisioned Downstream Service Resources

  • Diagnosis: Monitor metrics for Payment Service (CPU, memory, network I/O, connection pool usage). Look for sustained high utilization. Simultaneously, check Order Service logs for a high rate of connection timeouts or read timeouts when calling Payment Service.
  • Fix: Increase resources for Payment Service. For example, if using Kubernetes, scale up the deployment: kubectl scale deployment payment-service --replicas=5. If it’s a database, increase instance size or provision more read replicas.
  • Why it works: The Payment Service was simply overwhelmed. Giving it more capacity allows it to process requests within the timeout windows, preventing failures that would trip the breaker.

2. Network Latency or Congestion Between Services

  • Diagnosis: Use ping or traceroute from the Order Service pods/VMs to the Payment Service endpoints. Look for high latency (e.g., consistently over 100ms) or packet loss. Check network monitoring tools in your cloud provider or on-premise infrastructure.
  • Fix: Optimize network routing, upgrade network hardware, or consider placing services in the same availability zone/region if they are geographically separated. If using a service mesh like Istio, ensure traffic is routed efficiently.
  • Why it works: Long network round trips exceed the configured timeouts in Order Service, causing failures. Reducing latency allows requests to complete within their allotted time.

3. Inefficient Queries or Heavy Load on Downstream Service

  • Diagnosis: Analyze Payment Service logs for slow queries or operations. Use database performance monitoring tools to identify queries taking longer than expected. Check application performance monitoring (APM) for Payment Service to pinpoint slow code paths.
  • Fix: Optimize database queries (add indexes, rewrite queries), implement caching in Payment Service for frequently accessed, non-critical data, or optimize the code logic.
  • Why it works: Slow operations in Payment Service cause requests to exceed Order Service’s timeouts. Making Payment Service faster means it responds within the timeout, preventing failures.

4. Incorrect Circuit Breaker Configuration (Too Sensitive)

  • Diagnosis: Examine the circuit breaker configuration in Order Service. Look for very low thresholds for failure rate (e.g., 10% failures) or very short time windows (e.g., 5 seconds).
  • Fix: Increase the failure threshold and/or the time window. For example, in Resilience4j (Java), you might configure:
    CircuitBreakerConfig circuitBreakerConfig = CircuitBreakerConfig.custom()
        .failureRateThreshold(50) // 50% failures
        .waitDurationInOpenState(Duration.ofSeconds(30)) // 30 seconds in open state
        .slidingWindowType(CircuitBreakerConfig.SlidingWindowType.COUNT_BASED)
        .slidingWindowSize(100) // Look at the last 100 calls
        .build();
    
  • Why it works: The breaker was tripping too easily on transient, minor glitches. By making the thresholds more robust, it only trips for sustained, significant problems.

5. Incorrect Circuit Breaker Configuration (Not Sensitive Enough)

  • Diagnosis: The opposite of the above. The breaker rarely trips even when Payment Service is clearly struggling, and cascading failures are observed. Thresholds are too high (e.g., 90% failures) or window too large.
  • Fix: Lower the failure threshold and/or the time window. For instance, in Resilience4j:
    CircuitBreakerConfig circuitBreakerConfig = CircuitBreakerConfig.custom()
        .failureRateThreshold(25) // 25% failures
        .waitDurationInOpenState(Duration.ofSeconds(15)) // 15 seconds in open state
        .slidingWindowType(CircuitBreakerConfig.SlidingWindowType.TIME_BASED)
        .slidingWindowSize(Duration.ofSeconds(30)) // Look at the last 30 seconds
        .build();
    
  • Why it works: The breaker wasn’t detecting actual problems quickly enough. Lowering thresholds makes it responsive to genuine issues, preventing them from escalating.

6. Resource Leaks in Downstream Service

  • Diagnosis: Monitor Payment Service for steadily increasing memory usage or a growing number of open file handles/connections over time, even under moderate load. This points to a leak.
  • Fix: Identify and fix the resource leak in Payment Service’s codebase (e.g., not closing database connections, not releasing memory). This often requires code review and debugging.
  • Why it works: A resource leak degrades Payment Service performance over time, eventually leading to failures and timeouts. Fixing the leak restores stable performance.

7. Circuit Breaker Implementation Bugs/Misuse

  • Diagnosis: This is rare but possible. Check the circuit breaker library’s documentation. Are you correctly wrapping the calls? Is the state management behaving as expected (e.g., not staying open indefinitely, not immediately reopening after a single failure in half-open)?
  • Fix: Correctly implement the circuit breaker pattern. Ensure you are only wrapping idempotent or retryable calls if your breaker configuration allows retries (though typically, circuit breakers prevent retries when open). Consult the specific library’s best practices.
  • Why it works: The breaker itself was faulty or misapplied, not the underlying service. Correct implementation ensures it functions as intended.

After fixing all these, the next error you’ll likely see is a RateLimitExceededException from a service that’s now too popular because the circuit breaker isn’t acting as a shock absorber for temporary downstream issues.

Want structured learning?

Take the full Microservices course →