Observability isn’t about seeing more data; it’s about understanding the right data to answer any question you have about your system’s behavior.

Let’s watch a simple microservice interaction unfold. Imagine two services: Frontend and User-Service.

// Request to Frontend: GET /users/123
{
  "traceId": "a1b2c3d4e5f6",
  "spanId": "f6e5d4c3b2a1",
  "service": "Frontend",
  "operation": "getUser",
  "startTime": "2023-10-27T10:00:00Z",
  "duration": 50, // ms
  "tags": {
    "http.method": "GET",
    "http.url": "/users/123",
    "userId": "123"
  }
}

// Frontend makes a call to User-Service
{
  "traceId": "a1b2c3d4e5f6", // Same traceId!
  "spanId": "1a2b3c4d5e6f",
  "parentId": "f6e5d4c3b2a1", // Links back to Frontend's span
  "service": "User-Service",
  "operation": "fetchUserById",
  "startTime": "2023-10-27T10:00:00.040Z", // Slightly later
  "duration": 30, // ms
  "tags": {
    "db.query": "SELECT * FROM users WHERE id = '123'",
    "userId": "123"
  }
}

// User-Service responds to Frontend
{
  "traceId": "a1b2c3d4e5f6",
  "spanId": "1a2b3c4d5e6f",
  "service": "User-Service",
  "operation": "fetchUserById",
  "startTime": "2023-10-27T10:00:00.040Z",
  "duration": 30, // ms
  "tags": {
    "db.query": "SELECT * FROM users WHERE id = '123'",
    "userId": "123",
    "response.status": 200,
    "user": {"id": "123", "name": "Alice"}
  }
}

// Frontend responds to the client
{
  "traceId": "a1b2c3d4e5f6",
  "spanId": "f6e5d4c3b2a1",
  "service": "Frontend",
  "operation": "getUser",
  "startTime": "2023-10-27T10:00:00Z",
  "duration": 50, // ms
  "tags": {
    "http.method": "GET",
    "http.url": "/users/123",
    "userId": "123",
    "response.status": 200
  }
}

This is a simplified trace. Each entry is a "span," representing a unit of work. Notice how traceId links everything together, and parentId shows the call hierarchy. We see the total time for Frontend’s getUser operation was 50ms, and within that, User-Service’s fetchUserById took 30ms. The tags give us context: the HTTP method, the SQL query, the user ID.

Logs are the unstructured or semi-structured narrative of what happened. Metrics are aggregated, numerical measurements over time. Traces are the end-to-end view of a single request’s journey. They work together. A trace might show a high latency for User-Service, and then you’d dive into User-Service’s logs for that specific traceId to see why it was slow (e.g., a database timeout).

The problem observability solves is the "unknown unknowns" in distributed systems. In a monolith, you might attach a debugger or profile a single process. In microservices, a single request can touch dozens of independent services. You need a way to stitch that request’s journey together and understand its performance characteristics across service boundaries.

The core components of observability are:

  • Logs: Detailed, event-based records. Think of them as the "diary" of each service. They should ideally include context like traceId, userId, and serviceName.
    • Example Log Entry (JSON format):
      {
        "timestamp": "2023-10-27T10:00:00.045Z",
        "level": "INFO",
        "message": "Database query executed",
        "service": "User-Service",
        "traceId": "a1b2c3d4e5f6",
        "spanId": "1a2b3c4d5e6f",
        "durationMs": 25,
        "query": "SELECT * FROM users WHERE id = '123'"
      }
      
  • Metrics: Time-series data representing aggregated measurements. These are your "dashboards" showing system health. Examples: request latency (p95, p99), error rates, CPU utilization, queue lengths.
    • Example Metric (Prometheus format):
      http_requests_total{service="User-Service", method="GET", path="/users/{id}", status_code="200"} 1500
      http_request_duration_seconds_bucket{service="User-Service", method="GET", path="/users/{id}", status_code="200", le="0.1"} 1400
      http_request_duration_seconds_bucket{service="User-Service", method="GET", path="/users/{id}", status_code="200", le="0.5"} 1480
      
  • Traces: End-to-end request flows across multiple services, showing the causal relationships and latency contributions. These are your "flight recorders."
    • Example Trace Visualization: A directed acyclic graph (DAG) showing the Frontend service calling User-Service, with timings for each operation.

The key to making these work together is correlation. Every log, metric, and trace span needs common identifiers. The most critical is traceId. Without it, you’re looking at isolated islands of data.

When you instrument your services (using libraries like OpenTelemetry, Jaeger client, or Prometheus client libraries), you configure them to:

  1. Generate Spans: For incoming requests and outgoing calls.
  2. Propagate Context: Crucially, they must pass traceId and parentId in request headers (e.g., traceparent for W3C Trace Context) or message queues.
  3. Export Data: Send spans to a tracing backend, metrics to a time-series database, and logs to a log aggregation system.

The magic happens when these systems are integrated. You can click on a slow trace in your tracing UI and see the associated logs for that specific traceId and spanId, or view the metrics for the User-Service during the time the slow trace occurred.

Here’s a practical example of configuration for propagating trace context using OpenTelemetry in a Go application making an HTTP request:

import (
    "context"
    "net/http"

    "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/propagation"
)

func main() {
    // Set up tracer provider and propagator (omitted for brevity)
    // ...

    // Create an HTTP client that injects trace context into requests
    client := &http.Client{
        Transport: otelhttp.NewTransport(http.DefaultTransport),
    }

    // Create a request
    req, err := http.NewRequestWithContext(context.Background(), "GET", "http://user-service/users/123", nil)
    if err != nil {
        // handle error
    }

    // Inject the current trace context into the request headers
    // This is automatically done by otelhttp.NewTransport,
    // but if you were manually creating requests, you'd do:
    // otel.GetTextMapPropagator().Inject(ctx, propagation.HeaderCarrier(req.Header))

    // Perform the request
    resp, err := client.Do(req)
    if err != nil {
        // handle error
    }
    defer resp.Body.Close()

    // The trace context (traceId, spanId) is now in the outgoing request headers
    // and will be picked up by the user-service if it's also instrumented.
}

The otelhttp.NewTransport wrapper automatically handles injecting the traceparent header (which contains traceId and spanId) into outgoing HTTP requests. If the User-Service is also instrumented with OpenTelemetry, it will automatically extract this header, establish the parent-child relationship, and continue the trace.

The most surprising thing about distributed tracing is how often the absence of a trace or a trace that stops abruptly is the critical signal that something is fundamentally broken at the network or service boundary level, rather than just an application error.

When you’re debugging a latency issue, you often look at http_request_duration_seconds. But a more revealing metric is http_requests_total{status_code="500"}. If this metric spikes, you then correlate it with traces that have a response.status of 500 and look at the logs for those traces to find the root cause.

The next step after mastering logs, metrics, and traces is understanding how to automate alerting based on these signals, especially when dealing with complex service dependencies.

Want structured learning?

Take the full Microservices course →