The most surprising thing about distributed tracing is that it doesn’t actually track requests; it tracks spans, which are discrete units of work within a request.
Let’s watch a request flow through a simple e-commerce system. Imagine a user clicks "Add to Cart." This initiates a chain of events:
- Frontend Service: Receives the click, creates a trace.
- API Gateway: Receives the request from the frontend. It’s a new span, child of the frontend’s span.
- Cart Service: The gateway forwards the request. This is a new span, child of the gateway’s span.
- Product Service: The cart service needs product details. It calls the product service. This is a new span, child of the cart service’s span.
- Database: The product service queries its database. This is a new span, child of the product service’s span.
Here’s what that looks like in a tracing tool (like Jaeger or Zipkin), simplified:
Trace ID: a1b2c3d4e5f6...
Span 1: Frontend (HTTP Request)
ID: 001
Parent ID: - (Root span)
Start Time: 2023-10-27T10:00:00Z
Duration: 50ms
Tags: http.method=POST, http.url=/cart/add
Span 2: API Gateway (Receive Request)
ID: 002
Parent ID: 001
Start Time: 2023-10-27T10:00:00.010Z
Duration: 30ms
Tags: http.method=POST, http.url=/cart/add
Span 3: Cart Service (Add Item)
ID: 003
Parent ID: 002
Start Time: 2023-10-27T10:00:00.020Z
Duration: 25ms
Tags: service.name=cart-service, operation=addItem
Span 4: Product Service (Get Product Details)
ID: 004
Parent ID: 003
Start Time: 2023-10-27T10:00:00.030Z
Duration: 15ms
Tags: service.name=product-service, operation=getProduct
Span 5: Product DB (Query)
ID: 005
Parent ID: 004
Start Time: 2023-10-27T10:00:00.035Z
Duration: 5ms
Tags: db.type=postgres, db.statement="SELECT * FROM products WHERE id = 'XYZ'"
The trace ID links all these spans together. The parent ID shows the causal relationship. This visual representation, called a "waterfall" or "Gantt" chart, is the core of distributed tracing. You can immediately see that the product database query took 5ms, the product service took 15ms total (including the DB call), and the entire operation from the gateway’s perspective took 30ms. If the product service was slow, you’d see Span 4 taking much longer, pointing you directly to that component.
The problem distributed tracing solves is understanding performance and errors in complex, multi-service architectures. Without it, debugging a request that fails or is slow means SSHing into multiple servers, checking logs, and trying to piece together the timeline manually. It’s a nightmare. Tracing automates this by correlating events across service boundaries.
Internally, it works by passing a unique trace ID and a parent span ID with every outgoing request. This is usually done via HTTP headers (like traceparent or custom headers like X-B3-TraceId, X-B3-SpanId). When a service receives a request with these headers, it uses them to create its own span, linking it to the incoming request’s trace and parent span. If it makes an outgoing request, it injects its own trace ID and its own span ID as the parent ID for the next service.
The levers you control are primarily:
- Instrumentation: How you add the tracing code to your services. Libraries like OpenTelemetry, Jaeger clients, or Zipkin clients handle the header propagation and span creation.
- Sampling: You can’t afford to trace every single request in a high-throughput system. You configure sampling rates (e.g., trace 1% of all requests, or 100% of requests from a specific user ID, or 100% of requests that resulted in an error).
- Data Export: Where and how the trace data (spans) is sent. This could be to a local agent, a cloud-based collector, or directly to a tracing backend like Jaeger, Zipkin, or Datadog.
The most counterintuitive part of distributed tracing is that the synchronous nature of the trace propagation (passing IDs via headers) is what allows you to build a picture of asynchronous operations. While the spans themselves represent discrete, often synchronous, units of work, the act of passing the trace context allows you to connect the dots between services that might be responding to events, queuing messages, or otherwise operating in a non-linear fashion. The trace context is the thread that stitches these disparate actions into a coherent narrative of a single logical operation.
The next concept you’ll encounter is dealing with asynchronous operations like message queues, where trace context propagation requires special adapters or middleware.