Kubernetes Operators aren’t just about automating deployments; they’re about teaching Kubernetes entirely new concepts by encoding domain-specific knowledge into code that Kubernetes itself can understand and manage.
Let’s see this in action. Imagine we want to manage a distributed database like etcd. Normally, you’d deploy etcd pods, maybe a StatefulSet, and then manually configure clustering, backups, and scaling. An Operator changes this.
Here’s a simplified etcdcluster.yaml Custom Resource Definition (CRD) that an Operator might manage:
apiVersion: mycompany.com/v1alpha1
kind: EtcdCluster
metadata:
name: prod-etcd
spec:
version: 3.5.9
replicas: 3
storageSize: 10Gi
backupSchedule: "0 2 * * *"
When you kubectl apply -f etcdcluster.yaml, the Operator’s controller kicks in. It doesn’t just create pods; it understands that prod-etcd represents a cluster of etcd members.
Internally, the Operator’s controller watches for EtcdCluster objects. When it sees prod-etcd, it performs these actions:
- Creates StatefulSet: It generates a
StatefulSetfor etcd pods, ensuring stable network identities and ordered deployment. - Configures etcd: It injects a ConfigMap with etcd’s configuration, including peer discovery URLs, data directory, and the specified
version. - Manages Backups: It might create a Kubernetes
CronJobresource that runs a backup script (e.g.,etcdctl snapshot save) on the schedule defined inbackupSchedule. - Handles Scaling: If you change
spec.replicasto5and re-apply, the Operator will update the StatefulSet’sreplicasfield. It will then ensure new etcd members are added to the cluster’s peer list. - Reconciliation Loop: The core of an Operator is its reconciliation loop. It constantly compares the desired state (defined in the
EtcdClusterCR) with the actual state of the cluster (the pods, StatefulSets, CronJobs it has created). If there’s a drift (e.g., a pod died and wasn’t replaced by the StatefulSet), the Operator corrects it.
The problem this solves is moving beyond declarative resource management to declarative application management. Instead of just saying "I want 3 pods," you’re saying "I want a production-ready etcd cluster with these specific operational characteristics."
The Operator is built using the Kubernetes client libraries (like client-go for Go, or equivalent for other languages). The controller watches for events on the EtcdCluster CRD (creation, update, deletion) and then uses the Kubernetes API to create, update, or delete the underlying standard Kubernetes resources (StatefulSets, Services, ConfigMaps, CronJobs, etc.) that constitute the etcd cluster.
Here’s a snippet of what the reconciliation logic might look like in Go:
// Inside the controller's Reconcile function
func (r *EtcdClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
// Fetch the EtcdCluster instance
var etcdCluster mycompany.comv1alpha1.EtcdCluster
if err := r.Get(ctx, req.NamespacedName, &etcdCluster); err != nil {
// ... handle error ...
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// --- Desired State vs. Actual State Comparison ---
// Check if the StatefulSet already exists
var statefulSet appsv1.StatefulSet
statefulSetKey := types.NamespacedName{Name: etcdCluster.Name, Namespace: etcdCluster.Namespace}
if err := r.Get(ctx, statefulSetKey, &statefulSet); err != nil {
if errors.IsNotFound(err) {
// Define and create the StatefulSet
newStatefulSet := r.newStatefulSetForEtcdCluster(&etcdCluster)
if err := r.Create(ctx, newStatefulSet); err != nil {
// ... handle error ...
}
// Requeue to ensure everything is set up
return ctrl.Result{Requeue: true}, nil
}
// ... handle other errors ...
}
// If StatefulSet exists, check if it needs updates (e.g., replica count, image version)
if statefulSet.Spec.Replicas != &etcdCluster.Spec.Replicas ||
statefulSet.Spec.Template.Spec.Containers[0].Image != fmt.Sprintf("quay.io/coreos/etcd:%s", etcdCluster.Spec.Version) {
// Update the StatefulSet
statefulSet.Spec.Replicas = &etcdCluster.Spec.Replicas
statefulSet.Spec.Template.Spec.Containers[0].Image = fmt.Sprintf("quay.io/coreos/etcd:%s", etcdCluster.Spec.Version)
if err := r.Update(ctx, &statefulSet); err != nil {
// ... handle error ...
}
}
// ... similar logic for ConfigMaps, CronJobs, etc. ...
return ctrl.Result{}, nil
}
The most surprising aspect is how the Operator pattern decouples the core Kubernetes scheduler and controller-manager from the complex, stateful logic of managing a specific application. This allows Kubernetes to remain a general-purpose orchestrator while individual applications can have their own "brains" embedded directly into the control plane, making them as manageable as stateless deployments.
The next step is understanding how to manage the lifecycle of the Operator itself, often through an OperatorGroup or a similar CRD that defines the scope of the Operator’s watch.