Observability vs. Monitoring: When to Alert, When to Explore

The most surprising thing about Kubernetes monitoring is that the default monitoring setup often leaves you blind to the most critical failures.

Let’s see what that looks like. Imagine you’ve got a simple Nginx deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:latest
        ports:
        - containerPort: 80

And you’ve deployed Prometheus and Grafana, perhaps using the kube-prometheus-stack Helm chart. You’ll see dashboards showing pod counts, CPU/memory usage, and network traffic. Great, right?

But what if nginx:latest suddenly points to a broken image, or a configuration change within the container causes it to crash loop?

Here’s how Prometheus, running in your cluster, would typically scrape metrics from pods:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: nginx-monitor
  labels:
    release: prometheus # Assuming this label is used by your Prometheus operator
spec:
  selector:
    matchLabels:
      app: nginx
  namespaceSelector:
    matchNames:
    - default # Or wherever your nginx deployment is
  endpoints:
  - port: http-metrics # This refers to a service port name
    interval: 15s

This ServiceMonitor tells Prometheus to look for pods with the label app: nginx and scrape metrics from a specific port defined in a Service. If your Nginx pod is running and exposes a metrics endpoint (which the default Nginx image doesn’t, but let’s assume for a moment it did), Prometheus would pick it up.

The problem is, this setup primarily tells you about the health of the process running inside the container, not necessarily the health of the container itself or its ability to serve traffic.

Let’s build a mental model for what’s actually happening. Prometheus, the time-series database, relies on exporters – small agents that expose metrics in a format Prometheus understands. In Kubernetes, the most common exporter is kube-state-metrics, which provides metrics about the state of Kubernetes objects (Deployments, Pods, Nodes, etc.). Another is node-exporter, which runs on each node and exposes hardware and OS metrics.

Your Nginx pod itself might not be emitting detailed Prometheus metrics. So, even if the pod is technically "running" according to Kubernetes, it could be failing to serve requests. Prometheus, by default, might not know this.

To get a true picture, you need to monitor two layers:

Kubernetes Object State: Are the pods, deployments, and services in the desired state?
Application Health: Is the application inside the pod actually working and responding to requests?

Here’s a concrete example of how you’d get application-level metrics. You’d modify your Nginx deployment to include a sidecar or use a library that exposes metrics. A common pattern is to use nginx-exporter.

First, deploy nginx-exporter as a sidecar:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:latest
        ports:
        - containerPort: 80
      - name: nginx-exporter
        image: nginx/nginx-exporter:0.9.0 # Example image
        ports:
        - containerPort: 9113 # Default metrics port for nginx-exporter

Then, you’d create a Kubernetes Service that targets both the Nginx container and the exporter, or more commonly, a dedicated Service for the exporter:

apiVersion: v1
kind: Service
metadata:
  name: nginx-exporter-service
  labels:
    app: nginx
spec:
  selector:
    app: nginx # Selects pods with this label
  ports:
  - protocol: TCP
    port: 9113
    targetPort: 9113
    name: metrics

And finally, ensure your ServiceMonitor points to this exporter’s port:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: nginx-exporter-monitor
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app: nginx # Matches the Service's selector
  namespaceSelector:
    matchNames:
    - default
  endpoints:
  - port: metrics # Refers to the 'metrics' port name in the Service
    interval: 15s

This setup allows Prometheus to scrape metrics like nginx_http_requests_total directly from the Nginx exporter. Grafana can then visualize this, showing you the actual request volume and status codes.

The real power comes from correlating Kubernetes state with application metrics. For instance, you can alert if nginx_http_requests_total drops to zero for a deployment that should be receiving traffic, even if the Nginx pods are still in a Running state.

The one thing that trips up many users is the assumption that Prometheus automatically knows if your application is responding. It doesn’t. Prometheus scrapes exposed metrics. If your application isn’t emitting metrics about its own health (like request counts, error rates, or latency), Prometheus can only tell you if the container is up and if the Kubernetes objects are configured correctly.

To truly monitor, you need to ask: "Is my application doing its job?" This requires instrumenting your application or using exporters that understand your application’s protocol and can report on its operational status.

The next challenge you’ll face is configuring effective alerting rules that translate these metrics into actionable notifications.