The most surprising thing about microservice latency is that the fastest way to reduce it often involves making your services do more work.

Let’s watch this in action. Imagine a simple e-commerce checkout flow. A user clicks "Place Order."

Here’s a simplified view of the requests hitting our backend services:

  1. POST /orders (frontend -> Order Service)
  2. POST /payments (Order Service -> Payment Service)
  3. POST /inventory (Order Service -> Inventory Service)
  4. POST /shipping (Order Service -> Shipping Service)

Each of these calls adds latency. The total latency is the sum of the request time for each service plus network hops.

Frontend: User clicks "Place Order"
  |
  v
Order Service (150ms: validates order, creates DB entry)
  |
  +--- Payment Service (200ms: processes credit card)
  |
  +--- Inventory Service (100ms: decrements stock)
  |
  +--- Shipping Service (250ms: creates shipping label)
  |
  v
Order Service (responds to frontend)

Total latency here is roughly 150ms (Order Service initial) + max(200ms, 100ms, 250ms) + Order Service final processing. Let’s say the Order Service final processing is 50ms. Total: 150 + 250 + 50 = 450ms.

Now, what if we make the Order Service synchronous?

  1. POST /orders (frontend -> Order Service)
    • Order Service internally calls Payment, Inventory, Shipping.
    • Order Service waits for all of them to complete.
    • Order Service responds to frontend.
Frontend: User clicks "Place Order"
  |
  v
Order Service (150ms initial + 250ms internal calls + 50ms final)
  |
  +--- Payment Service (200ms)
  |
  +--- Inventory Service (100ms)
  |
  +--- Shipping Service (250ms)
  |
  v
Order Service (responds to frontend)

The total time from the frontend’s perspective is still around 450ms. But the individual service calls are now happening within the Order Service’s overall request. This isn’t about making individual services faster; it’s about how you orchestrate them.

The key levers you control are:

  • Service Granularity: How small are your services? Too small means too many network hops.
  • Communication Patterns: Synchronous vs. Asynchronous. Request/Reply vs. Event-driven.
  • Data Ownership & Consistency: How do services share or access data?
  • Caching: Where and how do you cache data to avoid redundant calls?

Let’s dive into profiling. You need to see where the time is actually spent.

Tools and Techniques

  1. Distributed Tracing: This is your bread and butter. Tools like Jaeger, Zipkin, or OpenTelemetry collectors are essential. They stitch together requests across services.

    • Diagnosis: Look for traces where a single service call dominates the overall request duration. In our example, the Shipping Service at 250ms is the bottleneck.
    • Fix: If a specific downstream service is consistently slow, you have a few options:
      • Optimize the slow service: Profile the Shipping Service itself. Is it a database query? A third-party API call?
      • Introduce Caching: If the shipping calculation is deterministic for certain inputs, cache the results.
        # Example: Redis cache for shipping rates
        GET shipping_rate:usps:zip12345
        
        This returns a cached value if present, avoiding the downstream call.
      • Asynchronous Processing: If shipping label generation isn’t strictly required for the immediate user response, make it asynchronous. The Order Service publishes an OrderCreated event, and the Shipping Service subscribes to it.
        // Order Service publishes
        {
          "event": "OrderCreated",
          "orderId": "abc-123",
          "shippingAddress": "...",
          "items": [...]
        }
        
        The Order Service can then respond to the user much faster, while shipping happens in the background. This shifts latency from the user-facing path to a background process.
  2. Application Performance Monitoring (APM) Tools: Datadog, New Relic, Dynatrace provide high-level views and deep dives.

    • Diagnosis: APM dashboards will show you service dependencies and the latency of each connection. They’ll highlight NFR (Non-Functional Requirement) violations for latency.
    • Fix: Use APM insights to identify chatty services (too many calls between them) or services with high error rates that might be retrying and causing delays. For example, if the Payment Service is intermittently failing, it might be retrying, adding significant latency to the Order Service.
  3. Service Mesh (e.g., Istio, Linkerd): These can provide detailed metrics on inter-service communication without application code changes.

    • Diagnosis: The service mesh’s telemetry will show you request volumes, success rates, and latencies between specific service instances. You can see if a particular instance of the Payment Service is performing poorly.
    • Fix: You can use the service mesh to implement circuit breakers or request timeouts. If the Payment Service is consistently slow, you can configure a timeout of, say, 150ms for calls to it.
      # Istio VirtualService example
      spec:
        hosts:
        - payment.namespace.svc.cluster.local
        http:
        - route:
          - destination:
              host: payment.namespace.svc.cluster.local
            timeout:
              seconds: 150 # 2.5 minutes
              nanos: 0
      
      This prevents a single slow payment request from blocking the entire order flow.
  4. Profiling Individual Services: When distributed tracing points to a specific service as the culprit, you need to profile that service.

    • Diagnosis: Use language-specific profilers (e.g., pprof for Go, cProfile for Python, JProfiler for Java) to find CPU hotspots or memory allocation issues within the service.
    • Fix: If profiling shows excessive time spent in JSON serialization/deserialization, consider using a faster library or a binary format like Protocol Buffers or MessagePack for internal communication.
      // Go example using protobuf
      // Instead of: json.Marshal(data)
      // Use: proto.Marshal(data)
      
      This reduces CPU overhead and can speed up serialization by orders of magnitude.

The one thing most people don’t know about optimizing latency is that sometimes the "slowest" path is actually the most performant from an end-user perspective. If a synchronous call takes 450ms but an asynchronous one involves a 100ms immediate response followed by 350ms of background work, the user feels faster. The total system work might be similar, but the perceived latency is drastically reduced by decoupling.

The next problem you’ll encounter is managing the complexity introduced by asynchronous communication and eventual consistency.

Want structured learning?

Take the full Microservices course →