etcd is the distributed key-value store that Kubernetes uses to store all of its cluster data. If etcd goes down, your Kubernetes cluster is effectively dead.

Let’s see etcd in action by taking a quick backup.

ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  snapshot save backup.db

This command uses etcdctl, the command-line client for etcd. We’re telling it to connect to the etcd endpoint (usually on port 2379), using the cluster’s CA certificate, the etcd server certificate, and its private key. The snapshot save backup.db part tells it to create a snapshot file named backup.db.

Now, let’s imagine we’ve had a catastrophic failure and need to restore this backup.

ETCDCTL_API=3 etcdctl snapshot restore backup.db \
  --data-dir=/var/lib/etcd \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

This command takes the backup.db file and restores it into the etcd data directory, overwriting whatever was there. The flags are similar, ensuring we’re using the correct certificates for authentication.

The core problem etcd backup and restore solves is data durability for your Kubernetes cluster. Without it, a simple disk failure on an etcd node could mean losing all your deployment configurations, running pod states, service definitions, and more. It’s the single source of truth for your entire cluster.

Internally, etcd uses the Raft consensus algorithm to ensure data consistency across multiple nodes. When you take a snapshot, you’re essentially getting a point-in-time copy of the etcd key-value store. The restore process replaces the current etcd data directory with the contents of that snapshot. For a restore to be successful, the etcd service needs to be stopped, the data directory cleared (or replaced), and then etcd restarted with the restored data.

You control the backup process through etcdctl commands, specifying the endpoint, certificates for secure communication, and the output file for the snapshot. For restores, you point etcdctl to the snapshot file and the desired data directory. It’s crucial that the certificates used for backup and restore match those configured for your etcd cluster.

The most surprising thing about etcd backups is that a simple snapshot doesn’t guarantee a fully recoverable cluster if your etcd cluster has multiple nodes that aren’t perfectly in sync or if you’re restoring into a different cluster configuration. A snapshot is a point-in-time dump, but etcd’s distributed nature means you need to ensure the snapshot was taken from a healthy, quorum-achieving state. If you restore a snapshot into a cluster where the existing etcd members have diverged significantly, you might end up with inconsistencies that require further reconciliation.

The next concept you’ll want to understand is how to automate these backups to run regularly, preventing data loss in the first place.

Want structured learning?

Take the full Kubernetes course →