Flux CD, by default, will happily apply your Kubernetes manifests even if the resources they create are unhealthy. This means your deployments could be stuck in ImagePullBackOff or services could be unready, but Flux won’t tell you until you manually check.

Here’s how to make Flux wait for your resources to be healthy before declaring success:

Making Flux Wait for Healthy Resources

Flux’s reconciliation loop is designed to be idempotent: it applies your desired state and moves on. To introduce a health check, we need to leverage Flux’s ability to monitor the status of Kubernetes resources after they’ve been applied.

The core mechanism for this is the health.toolkit.fluxcd.io API, specifically the HealthCheck custom resource. A HealthCheck tells Flux to monitor a specific Kubernetes resource (like a Deployment, StatefulSet, or DaemonSet) and report its health status. Flux then uses this health status as part of its overall reconciliation process.

1. Enabling the Health Controller

First, ensure the health controller is installed. It’s usually part of the standard Flux installation. You can verify this by checking for the flux-health-controller deployment in the flux-system namespace:

kubectl get deployment -n flux-system flux-health-controller

If it’s not there, you’ll need to install or upgrade your Flux components.

2. Creating a HealthCheck Resource

For each critical resource you want Flux to monitor for health, you’ll create a HealthCheck custom resource.

Example: Monitoring a Deployment

Let’s say you have a Deployment named my-app-deployment in the default namespace. You’d create a HealthCheck like this:

apiVersion: health.toolkit.fluxcd.io/v1alpha1
kind: HealthCheck
metadata:
  name: my-app-deployment-health
  namespace: default # Namespace of the resource to monitor
spec:
  # The resource to monitor
  resourceRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app-deployment
  # How often to check the health (e.g., every 30 seconds)
  interval: 30s
  # How long to wait for the resource to become healthy before failing (e.g., 5 minutes)
  timeout: 5m
  # Optional: Define specific conditions for health
  # This example uses the default checks for Deployments (e.g., available replicas)

Explanation of Fields:

  • resourceRef: This points to the Kubernetes resource you want to monitor. It requires apiVersion, kind, and name.
  • interval: How frequently Flux’s health controller should check the status of the referenced resource. 30s is a common starting point.
  • timeout: The maximum time Flux will wait for the resource to become healthy. If it exceeds this, the HealthCheck will be marked as failed. 5m is a reasonable default for most applications.

Common Resource Types and Their Health Checks:

  • Deployments: Flux checks if spec.replicas (desired) matches status.availableReplicas and status.readyReplicas.
  • StatefulSets: Similar to Deployments, it checks for matching replicas.
  • DaemonSets: Checks if status.desiredNumberScheduled matches status.numberReady.
  • Services: Flux can check if a Service has endpoints. This is less common as Services themselves are usually healthy if they exist, but their backing Pods might not be.
  • Custom Resources: You can define custom health checks for your own CRDs by implementing a status subresource with a health field.

3. Integrating HealthChecks with Kustomizations

Now, you need to tell your Flux Kustomization to wait for these HealthCheck resources to succeed. This is done by referencing the HealthCheck in the Kustomization’s spec.dependsOn field.

Example Kustomization:

Assume your Kustomization is defined in clusters/my-cluster/flux-system/kustomization.yaml and it applies the my-app-deployment.

apiVersion: kustomize.toolkit.fluxcd.io/v1beta2
kind: Kustomization
metadata:
  name: my-app
  namespace: flux-system # Namespace of the Kustomization
spec:
  interval: 10m
  path: ./apps/my-app/overlays/production
  prune: true
  sourceRef:
    kind: GitRepository
    name: my-git-repo
  validation: client # Or server, depending on your setup
  # This is the key part: tell Flux to wait for the HealthCheck to pass
  dependsOn:
    - name: my-app-deployment-health # Name of the HealthCheck resource
      namespace: default # Namespace of the HealthCheck resource

Explanation of dependsOn:

  • name: The name of the HealthCheck resource.
  • namespace: The namespace where the HealthCheck resource is defined.

When Flux reconciles this Kustomization, it will first check the status of the my-app-deployment-health HealthCheck. If the HealthCheck is not yet Ready (meaning the my-app-deployment is not healthy within the timeout), the Kustomization’s reconciliation will be paused. Only when the HealthCheck becomes Ready will Flux proceed with marking the Kustomization as applied and healthy.

4. Troubleshooting HealthChecks

If your HealthCheck isn’t becoming Ready, you can inspect its status:

kubectl get healthcheck -n default my-app-deployment-health -o yaml

Look for the .status field. It will indicate Healthy: false and provide a message explaining why. Common reasons include:

  • Deployment not scaling up: Check the Deployment’s Pods for errors (ImagePullBackOff, CrashLoopBackOff).
  • Pods not becoming ready: Ensure containers are starting, passing readiness probes, and not crashing.
  • Timeout exceeded: The timeout in the HealthCheck might be too short for your application’s startup time. Increase it.
  • Incorrect resourceRef: Double-check the apiVersion, kind, and name in the HealthCheck to ensure they exactly match your resource.

5. The Next Problem: Service Availability

Once your Deployments are healthy, the next logical step is to ensure your Services can actually route traffic to those healthy Pods. This often involves checking Service endpoints. You might create a HealthCheck for a Service, but more commonly, you’ll rely on the health of the Pods behind the Service. If your application uses Ingress, you’ll also want to ensure the Ingress controller is healthy and the Ingress resource itself is correctly configured. The next challenge is often ensuring that external access to your services is functioning as expected.

Want structured learning?

Take the full Flux course →