The retry pattern doesn’t actually solve the problem of transient failures; it just hides them, often making debugging harder.

Imagine a user trying to place an order. Their browser makes a request to the OrderService.

POST /orders
{
  "userId": "user-123",
  "items": [...]
}

The OrderService needs to check inventory, so it calls the InventoryService.

// Request from OrderService to InventoryService
{
  "items": [...]
}

Now, let’s say the InventoryService is temporarily overloaded. It might respond with a 503 Service Unavailable error. This is a transient failure – it’s not that the InventoryService is broken, just that it’s swamped right now.

This is where the retry pattern kicks in. The OrderService is configured to retry the InventoryService call a few times if it gets a 503.

// OrderService configuration snippet
{
  "retryPolicy": {
    "maxAttempts": 3,
    "delay": "5s",
    "retryableErrors": ["503"]
  }
}

If the InventoryService recovers within those 5 seconds, the OrderService’s call succeeds, and the order proceeds. The user never even knows there was a hiccup. The retry pattern, in this case, successfully masked a temporary issue.

But what if the InventoryService doesn’t recover? After 3 attempts, the OrderService will eventually fail, but the error might be confusing. Instead of a clear "Inventory Service unavailable," the user might see a generic "Order failed" message, or worse, the OrderService itself might throw a timeout error because its own internal retry mechanism for calling InventoryService ran out of gas.

The mental model here is about decoupling and resilience. Services don’t need to be perfectly available all the time. They just need to be available enough, and the retry pattern is one way to achieve that "enough."

Internally, when the OrderService makes its call to InventoryService, it wraps that call in a try-catch block. If the catch block encounters a 503 (or whatever is configured), it waits for the specified delay (e.g., 5 seconds) and then tries the call again, up to maxAttempts. This happens entirely within the OrderService’s process.

The levers you control are:

  • maxAttempts: How many times to retry. Too few, and you don’t handle transient issues. Too many, and you can overload the downstream service further and increase latency for your own service.
  • delay: How long to wait between retries. A fixed delay can lead to "thundering herd" problems if many services retry simultaneously. Exponential backoff (e.g., 5s, 10s, 20s) is often better.
  • retryableErrors: Which HTTP status codes or error types are considered transient. 503 is common, but 429 Too Many Requests is another. You don’t want to retry 400 Bad Request or 404 Not Found, as those indicate a client-side problem or a resource that doesn’t exist, and retrying won’t help.

The real magic happens when you combine retries with circuit breakers. If a service keeps failing, you don’t want to keep retrying indefinitely. A circuit breaker will "trip" after a certain number of failures, preventing further calls to the failing service for a period. This gives the failing service time to recover and prevents your service from wasting resources on calls that are guaranteed to fail.

If you’ve implemented retries and are still seeing intermittent failures, you’re likely looking at a distributed tracing problem where the original cause of the transient failure isn’t being propagated correctly.

Want structured learning?

Take the full Microservices course →