Kubernetes observability isn’t about collecting logs, metrics, and traces; it’s about making them tell a coherent story of distributed system behavior.

Let’s see this in action. Imagine a simple kubectl logs <pod-name> command. You get a stream of text. But what if that log line is a symptom of a network misconfiguration that only shows up when a specific service is under load? Or what if that metric spike corresponds to a slow database query that’s actually a trace of a single, long-running request? The real power comes when these data types are linked.

Here’s a typical stack:

  • Logging: Fluentd or Fluent Bit as a DaemonSet to collect logs from all nodes, then forward them to a central store like Elasticsearch or Loki.

    • Config Example (Fluentd):
      apiVersion: v1
      kind: ConfigMap
      metadata:
        name: fluentd-config
      data:
        fluent.conf: |
          <source>
            @type tail
            path /var/log/containers/*.log
            pos /var/log/fluentd-containers.pos
            tag kubernetes.*
            <parse>
              @type json
            </parse>
          </source>
          <match kubernetes.**>
            @type elasticsearch
            host elasticsearch-master.logging.svc.cluster.local
            port 9200
            logstash_format true
            logstash_prefix kubernetes
            include_tag_key true
            tag_key kubernetes
            flush_interval 5s
          </match>
      
      This configuration tells Fluentd to tail all container logs (/var/log/containers/*.log), parse them as JSON, and send them to an Elasticsearch cluster. The tag kubernetes.* is crucial for filtering and routing later.
  • Metrics: Prometheus, deployed as a Deployment and StatefulSet, scraping metrics endpoints exposed by applications and Kubernetes components. Node Exporter for host-level metrics, kube-state-metrics for cluster object states.

    • Prometheus Configuration Snippet (scrape_configs):
      scrape_configs:
        - job_name: 'kubernetes-pods'
          kubernetes_sd_configs:
            - role: pod
          relabel_configs:
            - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
              action: keep
              regex: true
            - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
              action: replace
              target_label: __address__
              regex: (.+):(.+)
              replacement: ${1}:${2}
            - source_labels: [__meta_kubernetes_namespace]
              action: replace
              target_label: namespace
            - source_labels: [__meta_kubernetes_pod_name]
              action: replace
              target_label: pod
      
      This setup uses Kubernetes Service Discovery to find pods annotated with prometheus.io/scrape: "true" and configures Prometheus to scrape their /metrics endpoint.
  • Tracing: Jaeger or Zipkin, often deployed as a Deployment, with instrumentation added to application code (e.g., using OpenTelemetry SDKs). Sidecar proxies like Envoy can also generate trace data.

    • Example Application Code (Python with OpenTelemetry):
      from opentelemetry import trace
      from opentelemetry.sdk.trace import TracerProvider
      from opentelemetry.exporter.jaeger.thrift import JaegerExporter
      
      provider = TracerProvider()
      trace.set_tracer_provider(provider)
      
      jaeger_exporter = JaegerExporter(
          agent_host_name='localhost', # Or your Jaeger agent/collector address
          agent_port=6831,
      )
      provider.add_span_processor(
          BatchSpanProcessor(jaeger_exporter)
      )
      
      tracer = trace.get_tracer(__name__)
      
      with tracer.start_as_current_span("my-operation"):
          # Your application logic here
          pass
      
      This Python code initializes tracing and sends spans to a Jaeger agent.

The problem this stack solves is the inherent complexity of distributed systems. A single user request might traverse dozens of services, each running in its own container, managed by Kubernetes. If that request fails or is slow, pinpointing the root cause is impossible without correlating data from all these layers.

Internally, the system relies on standardized protocols and formats. Logs are typically JSON or plain text. Metrics adhere to the Prometheus exposition format. Traces follow the OpenTelemetry semantic conventions or specific vendor formats. The "glue" is often metadata: Kubernetes labels, pod names, trace IDs, and span IDs that are embedded within log messages, attached to metrics labels, and form the backbone of trace spans.

The most surprising aspect for many is how much of the "observability" comes from instrumentation and metadata, rather than just the raw data itself. Simply having logs doesn’t mean you know which pod generated them, or which request they belong to. You need to enrich them with Kubernetes metadata (namespace, pod name, labels) and, ideally, a trace ID. Similarly, metrics without context (e.g., http_requests_total without namespace, pod, service, and trace_id labels) are far less useful. The system is designed to automatically inject and propagate this context where possible.

The next hurdle is typically creating effective dashboards and alerts that leverage the correlated data, moving beyond simple monitoring to true incident analysis.

Want structured learning?

Take the full Observability & Monitoring course →