The Kubernetes API server is failing to connect to etcd, meaning the cluster’s control plane can’t read or write its state, effectively halting all cluster operations.
The most common culprit is network connectivity. The API server needs to reach etcd’s client port (default 2379), and firewalls, security groups, or misconfigured network policies can block this.
Diagnosis: On a node running the API server, run nc -vz <etcd_node_ip> 2379.
Fix: If nc reports "Connection refused" or times out, check your network security rules. For example, in AWS, ensure your Security Group attached to the API server instances allows outbound traffic to the etcd nodes on port 2379, and that the etcd nodes’ Security Groups allow inbound traffic from the API server instances on port 2379. This opens the necessary communication channel.
Another frequent cause is etcd not running or unhealthy. The API server can’t connect to something that isn’t there or is in a bad state.
Diagnosis: On an etcd node, run systemctl status etcd or check the container logs if etcd is run as a container. Look for error messages or a "inactive (dead)" status.
Fix: If etcd is not running, start it with systemctl start etcd. If it’s unhealthy, check its logs for specific errors (e.g., disk full, configuration issues) and address those. Restarting etcd might be necessary: systemctl restart etcd. This ensures etcd is available and operational.
Incorrect API server configuration pointing to the wrong etcd endpoint is also a common mistake, especially after changes or in complex setups.
Diagnosis: Inspect the API server’s static pod manifest (usually /etc/kubernetes/manifests/kube-apiserver.yaml on control plane nodes) and look for the --etcd-servers flag.
Fix: Ensure the --etcd-servers flag lists the correct IP addresses and ports for your etcd cluster. For example, change 192.168.1.10:2379,192.168.1.11:2379 to 10.0.0.5:2379,10.0.0.6:2379 if your etcd nodes moved or were re-IP’d. This tells the API server where to find etcd.
TLS certificate issues between the API server and etcd are a frequent source of silent failures. If the certificates are expired, misconfigured, or the API server doesn’t trust etcd’s certificate, the connection will be rejected.
Diagnosis: Check the API server logs for TLS handshake errors, often mentioning "certificate has expired," "x509: certificate signed by unknown authority," or "remote error: tls: bad record MAC."
Fix: Renew the etcd client certificates and ensure the API server is configured with the correct CA certificate that signed etcd’s certificates. Update the --etcd-cafile flag in the API server manifest to point to the correct CA. This establishes a trusted TLS connection.
Etcd’s resource limits, particularly disk I/O or memory, can cause it to become unresponsive, leading to connection timeouts for the API server.
Diagnosis: Monitor etcd node resource utilization using tools like htop, iostat, or cloud provider monitoring. Look for high CPU, low free memory, or disk I/O wait times exceeding 80%.
Fix: Increase the resources allocated to etcd nodes. This might involve migrating to larger instance types, optimizing disk performance (e.g., using faster SSDs), or tuning etcd’s configuration (like heartbeat-interval or election-timeout) if resource contention is severe. This ensures etcd can keep up with requests.
A misconfigured etcd peer URL (--listen-peer-urls) or client URL (--listen-client-urls) can prevent etcd members from forming a cluster or accepting client connections from the API server.
Diagnosis: On etcd nodes, check the etcd configuration file (often /etc/etcd/etcd.conf.yml or passed via command-line flags in the etcd systemd unit or pod manifest) for the listen-client-urls and listen-peer-urls settings.
Fix: Ensure listen-client-urls includes http://<node_ip>:2379 or https://<node_ip>:2379 (depending on TLS configuration) so the API server can connect. Ensure listen-peer-urls is correctly set for inter-etcd communication. For example, set listen-client-urls: http://0.0.0.0:2379 and listen-peer-urls: http://<node_ip>:2380 if etcd is not using TLS for client connections. This makes etcd listen on the correct interfaces and ports.
The next error you’ll likely encounter after fixing etcd connectivity is an API server failing to start due to issues with its own TLS certificates or configuration.