Kubernetes Operators aren’t just about automating deployments; they’re about teaching Kubernetes entirely new concepts by encoding domain-specific knowledge into code that Kubernetes itself can understand and manage.

Let’s see this in action. Imagine we want to manage a distributed database like etcd. Normally, you’d deploy etcd pods, maybe a StatefulSet, and then manually configure clustering, backups, and scaling. An Operator changes this.

Here’s a simplified etcdcluster.yaml Custom Resource Definition (CRD) that an Operator might manage:

apiVersion: mycompany.com/v1alpha1
kind: EtcdCluster
metadata:
  name: prod-etcd
spec:
  version: 3.5.9
  replicas: 3
  storageSize: 10Gi
  backupSchedule: "0 2 * * *"

When you kubectl apply -f etcdcluster.yaml, the Operator’s controller kicks in. It doesn’t just create pods; it understands that prod-etcd represents a cluster of etcd members.

Internally, the Operator’s controller watches for EtcdCluster objects. When it sees prod-etcd, it performs these actions:

  1. Creates StatefulSet: It generates a StatefulSet for etcd pods, ensuring stable network identities and ordered deployment.
  2. Configures etcd: It injects a ConfigMap with etcd’s configuration, including peer discovery URLs, data directory, and the specified version.
  3. Manages Backups: It might create a Kubernetes CronJob resource that runs a backup script (e.g., etcdctl snapshot save) on the schedule defined in backupSchedule.
  4. Handles Scaling: If you change spec.replicas to 5 and re-apply, the Operator will update the StatefulSet’s replicas field. It will then ensure new etcd members are added to the cluster’s peer list.
  5. Reconciliation Loop: The core of an Operator is its reconciliation loop. It constantly compares the desired state (defined in the EtcdCluster CR) with the actual state of the cluster (the pods, StatefulSets, CronJobs it has created). If there’s a drift (e.g., a pod died and wasn’t replaced by the StatefulSet), the Operator corrects it.

The problem this solves is moving beyond declarative resource management to declarative application management. Instead of just saying "I want 3 pods," you’re saying "I want a production-ready etcd cluster with these specific operational characteristics."

The Operator is built using the Kubernetes client libraries (like client-go for Go, or equivalent for other languages). The controller watches for events on the EtcdCluster CRD (creation, update, deletion) and then uses the Kubernetes API to create, update, or delete the underlying standard Kubernetes resources (StatefulSets, Services, ConfigMaps, CronJobs, etc.) that constitute the etcd cluster.

Here’s a snippet of what the reconciliation logic might look like in Go:

// Inside the controller's Reconcile function
func (r *EtcdClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    // Fetch the EtcdCluster instance
    var etcdCluster mycompany.comv1alpha1.EtcdCluster
    if err := r.Get(ctx, req.NamespacedName, &etcdCluster); err != nil {
        // ... handle error ...
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    // --- Desired State vs. Actual State Comparison ---

    // Check if the StatefulSet already exists
    var statefulSet appsv1.StatefulSet
    statefulSetKey := types.NamespacedName{Name: etcdCluster.Name, Namespace: etcdCluster.Namespace}
    if err := r.Get(ctx, statefulSetKey, &statefulSet); err != nil {
        if errors.IsNotFound(err) {
            // Define and create the StatefulSet
            newStatefulSet := r.newStatefulSetForEtcdCluster(&etcdCluster)
            if err := r.Create(ctx, newStatefulSet); err != nil {
                // ... handle error ...
            }
            // Requeue to ensure everything is set up
            return ctrl.Result{Requeue: true}, nil
        }
        // ... handle other errors ...
    }

    // If StatefulSet exists, check if it needs updates (e.g., replica count, image version)
    if statefulSet.Spec.Replicas != &etcdCluster.Spec.Replicas ||
       statefulSet.Spec.Template.Spec.Containers[0].Image != fmt.Sprintf("quay.io/coreos/etcd:%s", etcdCluster.Spec.Version) {
        // Update the StatefulSet
        statefulSet.Spec.Replicas = &etcdCluster.Spec.Replicas
        statefulSet.Spec.Template.Spec.Containers[0].Image = fmt.Sprintf("quay.io/coreos/etcd:%s", etcdCluster.Spec.Version)
        if err := r.Update(ctx, &statefulSet); err != nil {
            // ... handle error ...
        }
    }

    // ... similar logic for ConfigMaps, CronJobs, etc. ...

    return ctrl.Result{}, nil
}

The most surprising aspect is how the Operator pattern decouples the core Kubernetes scheduler and controller-manager from the complex, stateful logic of managing a specific application. This allows Kubernetes to remain a general-purpose orchestrator while individual applications can have their own "brains" embedded directly into the control plane, making them as manageable as stateless deployments.

The next step is understanding how to manage the lifecycle of the Operator itself, often through an OperatorGroup or a similar CRD that defines the scope of the Operator’s watch.

Want structured learning?

Take the full Kubernetes course →