Network observability isn’t just about seeing that your network is "up" or "down"; it’s about understanding the why behind its behavior, even when everything appears to be functioning normally.
Imagine a busy city intersection. Monitoring tells you if the traffic lights are green or red, and if cars are moving. Observability lets you understand why there’s a traffic jam: maybe a delivery truck is double-parked, a pedestrian is crossing slowly, or a concert just let out. It’s about having the data to ask and answer questions you didn’t even know you needed to ask.
Let’s look at a simplified network scenario. We have two services, frontend-web and backend-api, communicating over HTTP.
# Example: Kubernetes Service definitions
apiVersion: v1
kind: Service
metadata:
name: frontend-web
labels:
app: frontend-web
spec:
selector:
app: frontend-web
ports:
- protocol: TCP
port: 80
targetPort: 8080
---
apiVersion: v1
kind: Service
metadata:
name: backend-api
labels:
app: backend-api
spec:
selector:
app: backend-api
ports:
- protocol: TCP
port: 8080
targetPort: 8080
In a monitoring setup, you might have alerts for frontend-web pod restarts or backend-api HTTP 5xx errors. But what if frontend-web is experiencing intermittent, slow responses to backend-api that don’t quite trigger a 5xx error but degrade user experience? Monitoring might miss this.
Observability tools, however, can capture and analyze network flows, application-level metrics, and distributed traces. This means we can see:
- Network Flows: Which pods are talking to which, on what ports, and how much data is being transferred. Tools like Cilium or eBPF-based agents can provide this.
- Application Metrics: Request rates, latency distributions (p95, p99), error rates, and saturation from the perspective of the application itself. Prometheus and Grafana are common here.
- Distributed Traces: The path a request takes across multiple services, showing the time spent in each hop. Jaeger or Zipkin are typical for this.
Here’s how these pieces come together. A user accesses frontend-web.
frontend-webreceives the request.frontend-webmakes a request tobackend-apion port8080.backend-apiprocesses the request and responds.frontend-webrenders the response to the user.
Monitoring might see:
frontend-webpod CPU usage: 60%backend-apipod network traffic: 100 Mbpsfrontend-webHTTP 200 OK rate: 99.9%
Observability can reveal:
- Network Flow:
frontend-webpod X is sending HTTP requests tobackend-apipod Y on TCP port 8080. Latency of these network connections is averaging 5ms, but with occasional spikes to 50ms. - Application Metrics:
frontend-webtobackend-apip95 latency is 150ms, with p99 at 400ms. Error rate is 0.01%. - Distributed Trace: The trace for a slow request shows 350ms spent within the
backend-apiservice processing the request, and 40ms on the network hop betweenfrontend-webandbackend-api.
This richer context allows us to pinpoint the problem. The network connection itself might be fine, but the backend-api is struggling. Or, perhaps, the backend-api is fast, but the application logic within frontend-web that calls it is slow.
The key difference is the ability to ask arbitrary questions. With monitoring, you define the questions beforehand (e.g., "Is latency > 100ms?"). With observability, you collect rich, granular data (logs, metrics, traces, network flows) and can analyze it to answer questions like: "For requests to backend-api that took longer than 300ms, what was the distribution of CPU utilization on the backend-api pod at that exact moment?"
This is enabled by collecting telemetry data in a structured and correlated way. When a network flow is observed, it’s tagged with Kubernetes pod names, service names, and trace IDs. When an application metric is emitted, it’s similarly tagged. This allows you to pivot from a slow network connection to the specific application process handling it, and then to the logs generated by that process for that particular request.
The most surprising thing is how much network behavior is not about packet loss or bandwidth, but about the subtle interplay of connection establishment, TLS handshake overhead, and application-level backpressure that looks like a healthy connection to a traditional network monitor.
Understanding the full lifecycle of a request, from the network packets to the application code, is what unlocks true network observability. This allows you to debug not just outages, but also performance degradations and subtle anomalies that impact user experience.
The next frontier you’ll explore is how to automate the generation of meaningful queries from observed anomalies.