StatefulSets are Kubernetes’ way of making sure your stateful applications, like databases or message queues, behave predictably and reliably. The most surprising thing is that they don’t actually guarantee order of startup or shutdown; they guarantee identity and stable storage.
Let’s see it in action. Imagine you need a simple distributed key-value store. We’ll use a basic etcd cluster, which is a perfect example of a stateful application.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: etcd
spec:
serviceName: "etcd-headless"
replicas: 3
selector:
matchLabels:
app: etcd
template:
metadata:
labels:
app: etcd
spec:
containers:
- name: etcd
image: quay.io/coreos/etcd:v3.4.13
command:
- /usr/local/bin/etcd
args:
- --name=$(POD_NAME)
- --listen-client-urls=http://0.0.0.0:2379
- --advertise-client-urls=http://$(POD_NAME).etcd-headless.$(NAMESPACE).svc.cluster.local:2379
- --listen-peer-urls=http://0.0.0.0:2380
- --initial-advertise-peer-urls=http://$(POD_NAME).etcd-headless.$(NAMESPACE).svc.cluster.local:2380
- --listen-auto-compaction:10000
- --initial-cluster=etcd-0=http://etcd-0.etcd-headless.$(NAMESPACE).svc.cluster.local:2380,etcd-1=http://etcd-1.etcd-headless.$(NAMESPACE).svc.cluster.local:2380,etcd-2=http://etcd-2.etcd-headless.$(NAMESPACE).svc.cluster.local:2380
- --initial-cluster-state=new
- --data-dir=/var/run/etcd
ports:
- containerPort: 2379
name: client
- containerPort: 2380
name: peer
volumeMounts:
- name: etcd-data
mountPath: /var/run/etcd
volumeClaimTemplates:
- metadata:
name: etcd-data
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 1Gi
Here’s what’s happening:
serviceName: "etcd-headless": This is crucial. It defines a headlessService. A headless service doesn’t get a cluster IP. Instead, its DNS entry points directly to the Pod IPs. This allows Pods to discover each other using stable DNS names likeetcd-0.etcd-headless.default.svc.cluster.local.replicas: 3: We want three instances of ouretcdnode.selector: This tells Kubernetes which Pods belong to this StatefulSet.template: This is the standard Pod template, but notice thecommandandargs. We’re using environment variables like$(POD_NAME)and$(NAMESPACE). Kubernetes injects these, and$(POD_NAME)will resolve toetcd-0,etcd-1,etcd-2, etc., based on the Pod’s ordinal index.volumeClaimTemplates: This is where the "stateful" magic happens. For each replica, Kubernetes will create aPersistentVolumeClaim(PVC) based on this template. Each Pod gets its own PVC, namedetcd-data-etcd-0,etcd-data-etcd-1, and so on. This PVC is then bound to aPersistentVolume(PV), ensuring that the data written to/var/run/etcdpersists even if the Pod is rescheduled or restarted.
When you apply this, you’ll see Pods like etcd-0, etcd-1, etcd-2 being created. They will come up in order: etcd-0 first, then etcd-1, then etcd-2. When they shut down, they shut down in reverse order: etcd-2, then etcd-1, then etcd-0. This guarantees that etcd-0 will always have the identity etcd-0 and its associated storage, etcd-1 will always be etcd-1 with its storage, and so on. This stable identity is what allows distributed systems to form clusters reliably.
The etcd configuration uses these stable DNS names to discover peers: $(POD_NAME).etcd-headless.$(NAMESPACE).svc.cluster.local. For etcd-0, this resolves to etcd-0.etcd-headless.default.svc.cluster.local, which Kubernetes’ DNS will map to the IP address of the etcd-0 Pod.
A common misconception is that StatefulSets guarantee startup order for applications that don’t need it. While they do provide ordered deployment and deletion, the primary benefit for stateful apps is the stable, unique network identifier and the stable, persistent storage associated with each Pod.
A crucial detail that trips many people up is how the volumeClaimTemplates work with ReadWriteOnce (RWO) access modes. RWO means the underlying PersistentVolume can only be mounted by a single node at a time. If a StatefulSet Pod is rescheduled to a different node, Kubernetes has to detach the volume from the old node before it can be attached to the new one. This process can take time and is a common bottleneck for stateful application scaling or recovery.
The next concept you’ll likely grapple with is managing upgrades and rollbacks for StatefulSets, especially when dealing with complex stateful applications.