The Linkerd control plane’s health check is actually a distributed system itself, and when it fails, it’s usually because one of its components decided it couldn’t talk to another, and it’s the unusual component that’s failing that’s the key.

Let’s say you’re seeing linkerd-proxy pods in a CrashLoopBackOff state or linkerd-identity reporting connection errors. The first thing you’d check is the health of the core control plane components.

1. linkerd-controller is Unhealthy

  • Diagnosis: kubectl get pods -n linkerd -l app.kubernetes.io/name=controller
  • Common Causes & Fixes:
    • Insufficient Resources: The controller pod is OOMKilled or crashing due to CPU starvation. Check logs for OOM messages: kubectl logs -n linkerd <linkerd-controller-pod-name>. Increase CPU/memory limits in its deployment:
      resources:
        limits:
          cpu: "500m" # Increase from 200m
          memory: "750Mi" # Increase from 500Mi
      
      This works by giving the controller process more headroom to operate, preventing it from being terminated by the Kubernetes scheduler for exceeding its resource allocations.
    • RBAC/Permissions Issues: The controller can’t access necessary Kubernetes API resources. Check its service account permissions: kubectl get sa -n linkerd linkerd-controller -o yaml and associated ClusterRoles/ClusterRoleBindings. Ensure it has get, list, watch permissions on pods, services, endpoints, nodes, namespaces, and endpointslices. Incorrect RBAC prevents the controller from discovering the cluster state it needs to manage.
    • Database Connectivity (if external): If you’re using an external database (like RDS for Prometheus), ensure the controller has network access and correct credentials. Check controller logs for database connection errors. Update the secret containing DB credentials if they’ve changed or are incorrect. This ensures the controller can persist its state and retrieve necessary configuration.
    • Configuration Errors: Invalid arguments in the linkerd-controller deployment. Inspect the args section of the deployment: kubectl get deploy -n linkerd linkerd-controller -o yaml. Look for typos or incorrect flags. Correcting these allows the controller to start with valid parameters.
    • Underlying Kubernetes API Server Issues: While less common, if the Kubernetes API server itself is unhealthy or overloaded, the controller will struggle to function. Check the health of your API server. This is a foundational dependency; if the API server isn’t responding, nothing that relies on it can work.

2. linkerd-identity is Unhealthy

  • Diagnosis: kubectl get pods -n linkerd -l app.kubernetes.io/name=identity
  • Common Causes & Fixes:
    • Certificate Issues: The identity component relies on TLS certificates for secure communication. If the CA certificate is expired or invalid, linkerd-identity will fail to issue new certificates. Check the validity of the linkerd-ca secret: openssl x509 -in <(kubectl get secret -n linkerd linkerd-ca -o jsonpath='{.data.tls\.crt}' | base64 -d) -noout -dates. If expired, you may need to perform a certificate rotation or a full reinstall. This is critical because all control plane components and proxies authenticate using these certificates.
    • linkerd-controller Unreachability: linkerd-identity needs to communicate with linkerd-controller to get cluster information and issue certificates. If linkerd-controller is down or unreachable (e.g., due to network policies), linkerd-identity will fail. Check logs: kubectl logs -n linkerd <linkerd-identity-pod-name>. Ensure the linkerd-controller service is accessible from the linkerd-identity pod. This ensures the identity system can integrate with the rest of the control plane.
    • RBAC/Permissions: Similar to the controller, the identity component needs specific RBAC permissions to interact with the Kubernetes API (e.g., to create Certificate and CertificateRequest custom resources). Verify its service account and associated roles. Incorrect permissions prevent it from managing its own certificate lifecycle.
    • Resource Constraints: linkerd-identity can also be subject to resource limits. Check its logs and resource utilization. Increase CPU/memory limits in its deployment if necessary. Insufficient resources can lead to timeouts and crashes.

3. linkerd-proxy (System Namespace Pods) Issues

  • Diagnosis: kubectl get pods -n linkerd -l app.kubernetes.io/name=proxy (and check logs of your application pods for "linkerd-proxy connection refused" or similar).
  • Common Causes & Fixes:
    • Control Plane Unreachability: The linkerd-proxy (which runs as a sidecar or daemonset) needs to connect to the linkerd-controller and linkerd-identity services. If these are unavailable or blocked by network policies, the proxy will fail to initialize or report errors. Check the proxy logs within the application pod: kubectl logs <your-app-pod> -c linkerd-proxy. Ensure network policies in the linkerd namespace (and your application namespace) allow egress from the proxy pods to linkerd-controller and linkerd-identity services on their respective ports (e.g., 8080, 8443). This is the most common cause of proxy startup failures.
    • Incorrect LINKERD2_CONTROL_URL Environment Variable: The proxy needs to know where to find the control plane. This is usually set via an environment variable. Check the deployment/daemonset that injects the proxy (often your application pods themselves, or a linkerd-init container). Ensure LINKERD2_CONTROL_URL is correctly set, e.g., tcp://linkerd-controller.linkerd.svc.cluster.local:8080. An incorrect URL means the proxy can’t bootstrap.
    • Kubernetes DNS Issues: If the cluster’s DNS resolution is faulty, the proxy won’t be able to resolve the control plane service names. Test DNS resolution from within a pod: kubectl exec <some-pod> -- nslookup linkerd-controller.linkerd.svc.cluster.local. Fix your cluster DNS configuration.
    • Resource Starvation on Node: If the node running the linkerd-proxy daemonset (or the application pod) is low on resources, the proxy container might be terminated. Check kubectl describe node <node-name> for resource pressure.

4. linkerd-destination or linkerd-sp Unhealthy

  • Diagnosis: kubectl get pods -n linkerd -l app.kubernetes.io/name=destination (and sp for linkerd-sp)
  • Common Causes & Fixes:
    • Dependency on linkerd-controller: These components also rely heavily on the linkerd-controller to provide service discovery information. If the controller is unhealthy, these will suffer. Follow the troubleshooting steps for linkerd-controller.
    • Network Policies: Ensure there are no network policies blocking communication between these control plane components. For example, linkerd-destination needs to talk to linkerd-controller.
    • Resource Limits: Similar to other control plane components, ensure adequate resources are allocated.

When all these components are healthy, you’ll typically see linkerd-diagnostic reports turn green. The next error you’ll hit after fixing these is usually related to application-level issues, like a misconfigured ServiceProfile or a pod that’s not receiving traffic because its selectors are wrong.

Want structured learning?

Take the full Linkerd course →