Linkerd upgrades are a common task, but the process of migrating to new versions can be fraught with subtle issues that break your service mesh.
The core problem is that during an upgrade, you have two versions of Linkerd running concurrently: the old version and the new version. This co-existence can lead to unexpected behavior if the control plane components or data plane proxies aren’t compatible with each other. The most common failure mode is the new control plane not being able to manage the old data plane proxies, or vice-versa, leading to traffic not being routed correctly, metrics disappearing, or the control plane itself becoming unstable.
Here’s a breakdown of common pitfalls and how to navigate them:
Incompatible Control Plane and Data Plane Versions
What broke: The new Linkerd control plane components (like the API server or controller) cannot communicate with or manage the existing, older Linkerd data plane proxies (the linkerd-proxy container in your pods). This is because the communication protocols or expected data formats have changed between versions.
Diagnosis:
Check the logs of your new linkerd-controller pods in the linkerd namespace. Look for errors indicating a failure to connect to the API server or an inability to reconcile resources related to the data plane. You might see messages like failed to list/watch/get pods or unrecognized API version.
Run linkerd check --proxy to see if any data plane proxies are reporting health issues or version mismatches.
Common Causes & Fixes:
-
Data Plane Not Upgraded: You upgraded the control plane but forgot to upgrade the data plane proxies in your workloads.
- Diagnosis:
linkerd check --proxywill showproxy version mismatcherrors for your pods. - Fix: Re-inject the data plane into your workloads using the new version’s CLI. For example:
Replacelinkerd inject --proxy-version <new-version> your-app.yaml | kubectl apply -f -<new-version>with the target version (e.g.,2.13.0). This will restart your pods with the new proxy version. - Why it works: The
linkerd injectcommand updates the annotations on your pods to specify the desired proxy version. When the controller sees this, it ensures the correct proxy image is deployed.
- Diagnosis:
-
Control Plane Upgrade Interrupted: The control plane upgrade process was not fully completed, leaving some old components running alongside new ones in an inconsistent state.
- Diagnosis: Observe the
linkerdnamespace usingkubectl get pods -n linkerd. You might see a mix of old and new versions of components, or pods stuck inCrashLoopBackOff. Check the logs oflinkerd-controllerandlinkerd-admission-webhooks. - Fix: Re-apply the control plane installation manifest for the new version.
Ensure you are using thecurl -sL https://run.linkerd.io/install | sh linkerd install --crds | kubectl apply -f - linkerd install | kubectl apply -f -linkerdCLI version that corresponds to your target control plane version. - Why it works: This forces a complete re-deployment of the control plane components, ensuring all are running the intended new version and are in a consistent state.
- Diagnosis: Observe the
-
CRD Version Mismatch: The Custom Resource Definitions (CRDs) for Linkerd were not updated to the new version before or during the control plane upgrade. The new control plane expects newer CRD schemas.
- Diagnosis: Control plane pods will likely log errors about being unable to parse or write to CRDs, or about schema validation failures.
kubectl get crd linkerd.io.serviceprofiles.linkerd.io -o yamlmight show an older schema version. - Fix: Apply the CRDs separately before installing or upgrading the control plane.
Then proceed with the control plane upgrade.linkerd install --crds | kubectl apply -f - - Why it works: Linkerd CRDs define the structure of its custom resources (like
ServiceProfileorServiceMeshconfigurations). The control plane needs to use the correct version of these definitions to operate.
- Diagnosis: Control plane pods will likely log errors about being unable to parse or write to CRDs, or about schema validation failures.
-
Linkerd CLI Version Out of Sync: You are using an older version of the
linkerdCLI to perform the upgrade, and it doesn’t understand the new control plane’s configuration or commands.- Diagnosis: Commands like
linkerd upgradeorlinkerd checkmight reportunknown flagerrors or behave unexpectedly. - Fix: Download and install the
linkerdCLI that matches the target version you are upgrading to.
Replace# Example for Linux/macOS curl -sL https://github.com/linkerd/linkerd2/releases/download/<new-version>/linkerd-stable-<new-version>-linux-amd64 | sudo tee /usr/local/bin/linkerd > /dev/null chmod +x /usr/local/bin/linkerd<new-version>with the target version (e.g.,2.13.0). - Why it works: The CLI is your primary interface for interacting with Linkerd. It must be compatible with the control plane version to correctly issue commands and interpret responses.
- Diagnosis: Commands like
-
Network Policies Blocking Communication: Kubernetes NetworkPolicies are preventing communication between the new control plane components and the data plane proxies, or between different control plane components.
- Diagnosis: Check logs for
connection refusedortimeouterrors that aren’t related to resource exhaustion. Usekubectl get networkpolicy -n linkerdandkubectl get networkpolicy -n <your-app-namespace>to inspect policies. - Fix: Ensure that NetworkPolicies allow traffic on the necessary ports (e.g., 8080, 4140, 4194) between the control plane pods and your application pods, and between control plane pods themselves.
Adjust namespaces and selectors as needed for your specific setup.# Example: Allow control plane to reach proxies apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: allow-control-plane-to-proxies namespace: linkerd spec: podSelector: {} # Apply to all pods in the linkerd namespace policyTypes: - Egress egress: - to: - podSelector: {} # Allow to any pod in any namespace namespaceSelector: {} ports: - protocol: TCP port: 4140 # Linkerd proxy's in-proxy port - Why it works: NetworkPolicies enforce network segmentation. During an upgrade, new communication patterns might emerge that require explicit allowance.
- Diagnosis: Check logs for
Unexpected Behavior During Rollout
What broke: Even if control plane and data plane are compatible, a phased rollout of the new data plane proxies can cause issues if services rely on specific behaviors or configurations that change between versions. For example, tracing propagation, retry logic, or TLS settings might differ.
Diagnosis: Monitor application-level metrics for errors, latency spikes, or missing traces after a batch of pods has been re-injected with the new proxy. Use linkerd tap on affected pods to observe traffic flow and identify anomalies.
Common Causes & Fixes:
-
Tracing Configuration Drift: The new proxy version has a different default configuration for tracing, or your existing tracing configuration is incompatible with the new version.
- Diagnosis: Traces disappear or become incomplete for services running the new proxy. Check the
linkerd-proxylogs in the affected pods for tracing-related errors. - Fix: Re-apply your tracing configuration using
linkerdCLI commands or by updating yourServiceProfileresources to match the new proxy’s expectations. For example, ensure themax_trace_context_header_sizeortrace_context_header_nameare correctly set if they’ve changed.
Consult the release notes for specific tracing configuration changes.linkerd config trace --identity-threshold 100% --trace-sample-rate 100% --max-trace-context-header-size 8192 | kubectl apply -f - - Why it works: Tracing relies on specific headers and configurations. The new proxy version might require updated settings to correctly propagate trace context.
- Diagnosis: Traces disappear or become incomplete for services running the new proxy. Check the
-
TLS Cipher Suite or Protocol Changes: The new proxy version might prefer or require different TLS cipher suites or protocols, causing handshake failures with older proxies or external services.
- Diagnosis: Observe
connection refusedor TLS handshake errors in application logs orlinkerd-proxylogs when communicating with other services or external endpoints. - Fix: If Linkerd’s mTLS is involved, ensure the control plane is configured to support the necessary cipher suites. For external mTLS, you may need to update your external service’s configuration or, in rare cases, configure the Linkerd proxy via
linkerd-proxy.yamlannotations to use specific TLS settings (consult Linkerd documentation for advanced TLS configuration). - Why it works: TLS security parameters can evolve. The proxy needs to be able to establish secure connections using compatible cryptographic algorithms.
- Diagnosis: Observe
-
Retry or Timeout Logic Differences: Default retry counts or timeout durations in the proxy might have changed, leading to premature retries or unexpected connection closures.
- Diagnosis: Applications might experience more frequent retries or timeouts than before the upgrade, even though the underlying network is stable.
- Fix: Re-apply your
ServiceProfileresources with explicitretryandtimeoutconfigurations that match your application’s expectations.apiVersion: linkerd.io/v1alpha1 kind: ServiceProfile metadata: name: my-service namespace: my-namespace spec: routes: - name: /some/path timeout: 5s retry: policy: "sequential" maxTotalRetries: 3 - Why it works:
ServiceProfilesallow fine-grained control over how the proxy interacts with specific routes, overriding default behaviors.
The Final Step
After successfully upgrading both the control plane and data plane, and verifying all applications are healthy, the next potential hurdle is often related to observability: ensuring that metrics, traces, and logs are consistently aggregated and accessible from the new Linkerd version’s components, especially if you’re integrating with external monitoring systems.