Structured logging is the key to debugging distributed systems, but it’s surprisingly hard to get right when you have dozens or hundreds of microservices.
Imagine you’ve got a user request that hops between AuthService, UserService, OrderService, and PaymentService. If each service logs in its own format, good luck tracing that single request. You’d be staring at a wall of text, trying to manually correlate timestamps and request IDs.
Here’s a real-world example of what a single request looks like across multiple services when logging is not structured:
AuthService Log:
2023-10-27 10:00:01 INFO User 'alice' authenticated successfully.
UserService Log:
10:00:02.123 INFO Processing request for user ID 12345.
OrderService Log:
[2023-10-27 10:00:03] INFO Received order creation request. Order ID: 98765.
PaymentService Log:
10:00:04.456 WARN Payment failed for order 98765. Reason: Insufficient funds.
Now, try to figure out why the payment failed for Alice’s order. You’d have to:
- Find Alice’s authentication log in
AuthService. - Find the corresponding user ID (12345) in
UserService. - Find the order ID (98765) associated with user ID 12345 in
OrderService. - Finally, see the payment failure for order ID 98765 in
PaymentService.
This is a nightmare. Structured logging aims to solve this by ensuring every log entry, regardless of the service, has a consistent set of key-value pairs.
Let’s say we adopt a JSON-based structured logging format. Our services would now log like this:
AuthService Log (Structured):
{
"timestamp": "2023-10-27T10:00:01.000Z",
"level": "INFO",
"service": "auth-service",
"message": "User authenticated successfully",
"user_id": "alice",
"request_id": "req-abc-123"
}
UserService Log (Structured):
{
"timestamp": "2023-10-27T10:00:02.123Z",
"level": "INFO",
"service": "user-service",
"message": "Processing request for user",
"user_id": "alice",
"request_id": "req-abc-123",
"user_internal_id": "12345"
}
OrderService Log (Structured):
{
"timestamp": "2023-10-27T10:00:03.000Z",
"level": "INFO",
"service": "order-service",
"message": "Received order creation request",
"user_id": "alice",
"request_id": "req-abc-123",
"order_id": "98765"
}
PaymentService Log (Structured):
{
"timestamp": "2023-10-27T10:00:04.456Z",
"level": "WARN",
"service": "payment-service",
"message": "Payment failed",
"user_id": "alice",
"request_id": "req-abc-123",
"order_id": "98765",
"reason": "Insufficient funds"
}
Now, to trace that request, you just search your log aggregation system (like Elasticsearch, Splunk, or Loki) for request_id: "req-abc-123". All the logs from all services for that specific request appear together, sorted by timestamp. You can instantly see the flow and pinpoint the failure.
The core idea is to define a common schema for your logs. This schema usually includes fields like:
timestamp: ISO 8601 format is standard.level: e.g.,DEBUG,INFO,WARN,ERROR.service: The name of the microservice generating the log.request_id(ortrace_id): A unique identifier that flows through all services for a single user request.user_id(oractor): The identifier of the user performing the action.message: A human-readable description of the event.span_id(for distributed tracing): Identifies a specific operation within a trace.- Any domain-specific fields:
order_id,product_id,customer_ip, etc.
You’ll need a logging library in each language your services are written in that supports structured logging. For example, in Go, you might use logrus or zap. In Python, structlog or python-logstash-formatter. In Java, Logback or Log4j2 with a JSON appender.
The critical piece is ensuring that the request_id is propagated correctly across service boundaries. This is typically done by passing it in HTTP headers. When AuthService calls UserService, it includes X-Request-ID: req-abc-123 in its outgoing request. UserService then reads this header and includes it in its outgoing call to OrderService, and so on.
This propagation mechanism is often handled by a common middleware or SDK that all your services depend on. Without it, you’ll have structured logs, but you still won’t be able to correlate them across services.
A common pitfall is inconsistent casing or naming of fields across services. If one service logs request_id and another logs requestId, your log aggregation system might treat them as different fields, hindering your ability to search effectively. Use a linter or a shared configuration to enforce field names.
When you’re setting up your logging, ensure your log aggregation system is configured to parse these JSON logs correctly. For instance, in Fluentd, you’d use a parser directive like parser { @type json }. In Logstash, you’d use a json codec on your input.
The really interesting part of structured logging, beyond just getting JSON out of your app, is how it enables powerful analytics. You can now ask questions like: "What’s the average latency for payment-service when order_id is null?" or "Show me all ERROR logs from user-service in the last hour that don’t have a user_id field." This level of introspection is impossible with plain text logs.
The next step after getting structured logging right is implementing distributed tracing, which builds upon the request_id (or trace_id) to track the end-to-end latency of requests across all services.