OpenTelemetry makes distributed tracing feel like magic, but the real trick is how it weaves a single request’s journey across multiple services into a coherent, actionable story.
Let’s see it in action. Imagine a simple e-commerce checkout flow:
frontend-appreceives a user’s request to checkout.- It calls
order-serviceto create an order. order-servicethen callspayment-serviceto process the payment.- Finally,
order-servicemight callinventory-serviceto update stock.
Without distributed tracing, if the payment-service call times out, you’d see a timeout in order-service. But you wouldn’t know why order-service was trying to talk to payment-service in the first place, or what the original request from the frontend was. OpenTelemetry links these events.
Here’s a simplified Node.js setup for order-service to send traces to a collector.
// order-service/index.js
const express = require('express');
const { trace } = require('@opentelemetry/api');
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-otlp-trace');
// 1. Initialize the OpenTelemetry SDK
const sdk = new NodeSDK({
instrumentations: [
getNodeAutoInstrumentations(), // Auto-instrument common libraries like http, express, pg, etc.
],
traceExporter: new OTLPTraceExporter({
// Default is http://localhost:4318/v1/traces
// Configure if your collector is elsewhere
url: "http://localhost:4318/v1/traces"
}),
});
sdk.start();
const app = express();
const port = 3000;
const tracer = trace.getTracer('order-service-tracer');
app.use(express.json());
app.post('/checkout', async (req, res) => {
const span = tracer.startSpan('checkout-process', {
kind: 1, // SPAN_KIND_SERVER
attributes: { 'http.method': 'POST', 'http.url': '/checkout' }
});
// Use the active span as the parent for subsequent operations
const activeSpan = trace.getActiveSpan();
try {
const orderDetails = req.body;
console.log('Received checkout request:', orderDetails);
// Simulate calling payment service
const paymentResult = await tracer.startActiveSpan('process-payment', {
attributes: { 'service.name': 'payment-service', 'db.system': 'postgresql' }
}, async (paymentSpan) => {
// In a real app, this would be an HTTP call to payment-service
await new Promise(resolve => setTimeout(resolve, 150)); // Simulate network latency
paymentSpan.addEvent('Payment processing started');
// Simulate a successful payment
paymentSpan.setStatus({ code: 2 }); // STATUS_CODE_OK
paymentSpan.addEvent('Payment processed successfully');
return { success: true, transactionId: 'txn_12345' };
});
// Simulate calling inventory service
const inventoryUpdate = await tracer.startActiveSpan('update-inventory', {
attributes: { 'service.name': 'inventory-service' }
}, async (inventorySpan) => {
await new Promise(resolve => setTimeout(resolve, 50)); // Simulate network latency
inventorySpan.addEvent('Inventory update initiated');
// Simulate successful update
inventorySpan.setStatus({ code: 2 }); // STATUS_CODE_OK
inventorySpan.addEvent('Inventory updated');
return { success: true, itemsUpdated: 2 };
});
const order = {
orderId: `ORD-${Math.random().toString(36).substring(2, 9)}`,
...orderDetails,
transactionId: paymentResult.transactionId,
inventoryStatus: inventoryUpdate.success,
timestamp: new Date().toISOString()
};
span.setStatus({ code: 2 }); // STATUS_CODE_OK
span.addEvent('Order created successfully');
res.status(201).json(order);
} catch (error) {
console.error('Error during checkout:', error);
span.setStatus({ code: 3, message: error.message }); // STATUS_CODE_ERROR
span.recordException(error);
res.status(500).json({ error: 'Failed to process checkout' });
} finally {
span.end(); // End the main checkout span
}
});
app.listen(port, () => {
console.log(`Order service listening at http://localhost:${port}`);
});
This setup does a few key things:
NodeSDK: This is the core of OpenTelemetry in Node.js. It orchestrates everything.getNodeAutoInstrumentations(): This is the magic sauce for getting started. It automatically wraps common Node.js libraries (http,express,pg,mongoose, etc.) so that incoming and outgoing requests, database queries, and more are captured as spans without you writing explicit code for them.OTLPTraceExporter: This tells the SDK where to send the collected trace data. OTLP (OpenTelemetry Protocol) is the standard way to send telemetry data to a collector.http://localhost:4318/v1/tracesis the default endpoint for the OpenTelemetry Collector running locally.tracer.startSpanandtracer.startActiveSpan: These are for manual instrumentation.startSpancreates a span that you manually manage (you have toend()it).startActiveSpancreates a span and makes it the "active" span in the current context. This is crucial for creating parent-child relationships in your trace. When you call another instrumented service (like an outgoing HTTP request), the auto-instrumentation will automatically pick up the active span as its parent.
When order-service makes an outgoing HTTP call to payment-service, the @opentelemetry/auto-instrumentations-node for http will detect the active span from tracer.startActiveSpan('process-payment', ...) and automatically inject the trace context (trace ID, span ID) into the outgoing request headers. The payment-service (if also instrumented) will then pick up these headers and continue the trace.
The most surprising thing about distributed tracing is how effectively it reveals latency bottlenecks that are invisible in isolated service logs. You can see that order-service spent 150ms waiting for payment-service, even though payment-service itself might have only taken 50ms of CPU time. This distinction between wall-clock time and actual work is what makes tracing so powerful for performance optimization.
The trace.getActiveSpan() is key to propagating context. When you use tracer.startActiveSpan(), it pushes a new span onto a context stack. Any subsequent instrumented operations within that scope will automatically use this new span as their parent. If you manually create a span with tracer.startSpan() and don’t make it active, you’ll have to manually pass its context to child operations, which is more cumbersome and error-prone.
The next step is typically setting up an OpenTelemetry Collector to receive these traces and then sending them to a backend like Jaeger, Zipkin, or a cloud-based observability platform.