The Linkerd Grafana dashboard isn’t just a collection of pretty charts; it’s a real-time diagnostic tool that exposes the health of your service mesh by visualizing the "golden signals" of latency, traffic, errors, and saturation.

Let’s see it in action. Imagine you’ve got a simple service webapp talking to a user-service. You’ve installed Linkerd, and now you want to see how webapp is performing from the mesh’s perspective.

First, ensure you have Grafana installed and configured to scrape metrics from Linkerd. Linkerd typically exposes Prometheus metrics on port 9994 for each microservice and on port 9995 for the linkerd-proxy. You’d set up your Grafana datasource to point to your Prometheus instance.

Once Grafana is connected to your Prometheus, you can import the Linkerd-provided Grafana dashboard. You can usually find this in the Linkerd documentation or directly within the Linkerd installation’s Grafana configuration. This dashboard is pre-populated with panels designed to show you the golden signals for your meshed services.

Here’s what you’d typically see when you select your webapp service:

Latency:

  • Request Latency (p50, p95, p99): This panel shows the distribution of request latencies for your webapp. You’ll see three lines: the 50th percentile (median), 95th percentile, and 99th percentile of how long requests are taking. A sudden spike in the p99 latency, even if the p50 is stable, indicates that a small but significant number of requests are experiencing severe delays.
  • Success Latency (p50, p95, p99): This is the same latency metric, but only for requests that returned a success status code (typically 2xx or 3xx). This helps distinguish between network latency and application-level processing delays that result in errors.

Traffic:

  • Request Volume: A simple counter showing the total number of requests being processed by the webapp over time. You can often see this broken down by success and failure, or by HTTP status code. This tells you if your service is seeing more or less load.
  • RPS (Requests Per Second): This is the rate of requests. It’s essentially the slope of the Request Volume graph. Seeing a sudden drop in RPS might indicate a problem upstream, while a steep rise could signal an overload.

Errors:

  • HTTP Status Codes: A breakdown of requests by their HTTP status code (e.g., 2xx, 4xx, 5xx). This is crucial for identifying application-level errors (4xx for client errors, 5xx for server errors). A rising trend in 5xx errors is a clear indicator of a failing webapp.
  • Success Rate: This panel often shows the percentage of successful requests (e.g., 2xx/3xx codes) out of the total. A sudden drop in the success rate is a major red flag.

Saturation:

  • (Less direct in default Linkerd dashboards, but often inferred): While Linkerd doesn’t directly expose CPU/memory saturation for your application pods, the golden signals collectively point to saturation. For example, if latency spikes and error rates climb without a corresponding increase in traffic volume, it strongly suggests the application pods are saturated and can no longer keep up. You’d then correlate this with standard Kubernetes metrics (CPU/memory usage per pod) in a separate Grafana dashboard.

The mental model Linkerd’s dashboard builds is that your service’s health is a function of these four signals. When latency increases, traffic patterns shift unexpectedly, error rates climb, or your service starts to appear "saturated" (indicated by the other signals), you have a problem. The dashboard’s value is that it surfaces these signals from the perspective of the network and the proxy, meaning you see issues even before your application code might be aware of them or before they manifest as application-level exceptions. It’s the mesh’s way of telling you, "Something is slow," "Something is failing," or "Something is overloaded."

The most surprising thing is how much detail you can glean about upstream and downstream dependencies by simply observing the golden signals of a single service. If webapp’s latency spikes, and you haven’t changed webapp itself, the dashboard will often reveal if the spike is outgoing (meaning user-service is slow) or incoming (meaning clients are slow to respond to webapp).

Once you’ve addressed latency issues, you’ll likely turn your attention to optimizing the success rate of your most frequent error codes.

Want structured learning?

Take the full Linkerd course →