Linkerd upgrades are a common task, but the process of migrating to new versions can be fraught with subtle issues that break your service mesh.

The core problem is that during an upgrade, you have two versions of Linkerd running concurrently: the old version and the new version. This co-existence can lead to unexpected behavior if the control plane components or data plane proxies aren’t compatible with each other. The most common failure mode is the new control plane not being able to manage the old data plane proxies, or vice-versa, leading to traffic not being routed correctly, metrics disappearing, or the control plane itself becoming unstable.

Here’s a breakdown of common pitfalls and how to navigate them:

Incompatible Control Plane and Data Plane Versions

What broke: The new Linkerd control plane components (like the API server or controller) cannot communicate with or manage the existing, older Linkerd data plane proxies (the linkerd-proxy container in your pods). This is because the communication protocols or expected data formats have changed between versions.

Diagnosis: Check the logs of your new linkerd-controller pods in the linkerd namespace. Look for errors indicating a failure to connect to the API server or an inability to reconcile resources related to the data plane. You might see messages like failed to list/watch/get pods or unrecognized API version. Run linkerd check --proxy to see if any data plane proxies are reporting health issues or version mismatches.

Common Causes & Fixes:

  1. Data Plane Not Upgraded: You upgraded the control plane but forgot to upgrade the data plane proxies in your workloads.

    • Diagnosis: linkerd check --proxy will show proxy version mismatch errors for your pods.
    • Fix: Re-inject the data plane into your workloads using the new version’s CLI. For example:
      linkerd inject --proxy-version <new-version> your-app.yaml | kubectl apply -f -
      
      Replace <new-version> with the target version (e.g., 2.13.0). This will restart your pods with the new proxy version.
    • Why it works: The linkerd inject command updates the annotations on your pods to specify the desired proxy version. When the controller sees this, it ensures the correct proxy image is deployed.
  2. Control Plane Upgrade Interrupted: The control plane upgrade process was not fully completed, leaving some old components running alongside new ones in an inconsistent state.

    • Diagnosis: Observe the linkerd namespace using kubectl get pods -n linkerd. You might see a mix of old and new versions of components, or pods stuck in CrashLoopBackOff. Check the logs of linkerd-controller and linkerd-admission-webhooks.
    • Fix: Re-apply the control plane installation manifest for the new version.
      curl -sL https://run.linkerd.io/install | sh
      linkerd install --crds | kubectl apply -f -
      linkerd install | kubectl apply -f -
      
      Ensure you are using the linkerd CLI version that corresponds to your target control plane version.
    • Why it works: This forces a complete re-deployment of the control plane components, ensuring all are running the intended new version and are in a consistent state.
  3. CRD Version Mismatch: The Custom Resource Definitions (CRDs) for Linkerd were not updated to the new version before or during the control plane upgrade. The new control plane expects newer CRD schemas.

    • Diagnosis: Control plane pods will likely log errors about being unable to parse or write to CRDs, or about schema validation failures. kubectl get crd linkerd.io.serviceprofiles.linkerd.io -o yaml might show an older schema version.
    • Fix: Apply the CRDs separately before installing or upgrading the control plane.
      linkerd install --crds | kubectl apply -f -
      
      Then proceed with the control plane upgrade.
    • Why it works: Linkerd CRDs define the structure of its custom resources (like ServiceProfile or ServiceMesh configurations). The control plane needs to use the correct version of these definitions to operate.
  4. Linkerd CLI Version Out of Sync: You are using an older version of the linkerd CLI to perform the upgrade, and it doesn’t understand the new control plane’s configuration or commands.

    • Diagnosis: Commands like linkerd upgrade or linkerd check might report unknown flag errors or behave unexpectedly.
    • Fix: Download and install the linkerd CLI that matches the target version you are upgrading to.
      # Example for Linux/macOS
      curl -sL https://github.com/linkerd/linkerd2/releases/download/<new-version>/linkerd-stable-<new-version>-linux-amd64 | sudo tee /usr/local/bin/linkerd > /dev/null
      chmod +x /usr/local/bin/linkerd
      
      Replace <new-version> with the target version (e.g., 2.13.0).
    • Why it works: The CLI is your primary interface for interacting with Linkerd. It must be compatible with the control plane version to correctly issue commands and interpret responses.
  5. Network Policies Blocking Communication: Kubernetes NetworkPolicies are preventing communication between the new control plane components and the data plane proxies, or between different control plane components.

    • Diagnosis: Check logs for connection refused or timeout errors that aren’t related to resource exhaustion. Use kubectl get networkpolicy -n linkerd and kubectl get networkpolicy -n <your-app-namespace> to inspect policies.
    • Fix: Ensure that NetworkPolicies allow traffic on the necessary ports (e.g., 8080, 4140, 4194) between the control plane pods and your application pods, and between control plane pods themselves.
      # Example: Allow control plane to reach proxies
      apiVersion: networking.k8s.io/v1
      kind: NetworkPolicy
      metadata:
        name: allow-control-plane-to-proxies
        namespace: linkerd
      spec:
        podSelector: {} # Apply to all pods in the linkerd namespace
        policyTypes:
        - Egress
        egress:
        - to:
          - podSelector: {} # Allow to any pod in any namespace
            namespaceSelector: {}
          ports:
          - protocol: TCP
            port: 4140 # Linkerd proxy's in-proxy port
      
      Adjust namespaces and selectors as needed for your specific setup.
    • Why it works: NetworkPolicies enforce network segmentation. During an upgrade, new communication patterns might emerge that require explicit allowance.

Unexpected Behavior During Rollout

What broke: Even if control plane and data plane are compatible, a phased rollout of the new data plane proxies can cause issues if services rely on specific behaviors or configurations that change between versions. For example, tracing propagation, retry logic, or TLS settings might differ.

Diagnosis: Monitor application-level metrics for errors, latency spikes, or missing traces after a batch of pods has been re-injected with the new proxy. Use linkerd tap on affected pods to observe traffic flow and identify anomalies.

Common Causes & Fixes:

  1. Tracing Configuration Drift: The new proxy version has a different default configuration for tracing, or your existing tracing configuration is incompatible with the new version.

    • Diagnosis: Traces disappear or become incomplete for services running the new proxy. Check the linkerd-proxy logs in the affected pods for tracing-related errors.
    • Fix: Re-apply your tracing configuration using linkerd CLI commands or by updating your ServiceProfile resources to match the new proxy’s expectations. For example, ensure the max_trace_context_header_size or trace_context_header_name are correctly set if they’ve changed.
      linkerd config trace --identity-threshold 100% --trace-sample-rate 100% --max-trace-context-header-size 8192 | kubectl apply -f -
      
      Consult the release notes for specific tracing configuration changes.
    • Why it works: Tracing relies on specific headers and configurations. The new proxy version might require updated settings to correctly propagate trace context.
  2. TLS Cipher Suite or Protocol Changes: The new proxy version might prefer or require different TLS cipher suites or protocols, causing handshake failures with older proxies or external services.

    • Diagnosis: Observe connection refused or TLS handshake errors in application logs or linkerd-proxy logs when communicating with other services or external endpoints.
    • Fix: If Linkerd’s mTLS is involved, ensure the control plane is configured to support the necessary cipher suites. For external mTLS, you may need to update your external service’s configuration or, in rare cases, configure the Linkerd proxy via linkerd-proxy.yaml annotations to use specific TLS settings (consult Linkerd documentation for advanced TLS configuration).
    • Why it works: TLS security parameters can evolve. The proxy needs to be able to establish secure connections using compatible cryptographic algorithms.
  3. Retry or Timeout Logic Differences: Default retry counts or timeout durations in the proxy might have changed, leading to premature retries or unexpected connection closures.

    • Diagnosis: Applications might experience more frequent retries or timeouts than before the upgrade, even though the underlying network is stable.
    • Fix: Re-apply your ServiceProfile resources with explicit retry and timeout configurations that match your application’s expectations.
      apiVersion: linkerd.io/v1alpha1
      kind: ServiceProfile
      metadata:
        name: my-service
        namespace: my-namespace
      spec:
        routes:
        - name: /some/path
          timeout: 5s
          retry:
            policy: "sequential"
            maxTotalRetries: 3
      
    • Why it works: ServiceProfiles allow fine-grained control over how the proxy interacts with specific routes, overriding default behaviors.

The Final Step

After successfully upgrading both the control plane and data plane, and verifying all applications are healthy, the next potential hurdle is often related to observability: ensuring that metrics, traces, and logs are consistently aggregated and accessible from the new Linkerd version’s components, especially if you’re integrating with external monitoring systems.

Want structured learning?

Take the full Linkerd course →