Flux is failing to apply changes to your cluster because the reconciliation loop is stuck trying to process an event, and it can’t get past it.

Here’s how to dig in and fix it:

1. The Event is Stuck in the Kubernetes API Server

Sometimes, the Kubernetes API server itself gets bogged down or has internal issues, preventing events from being processed or even sent. This isn’t directly a Flux problem, but Flux experiences it as a lack of events or a failure to get event notifications.

  • Diagnosis: Check the health of your Kubernetes API server. Look for errors in its logs. You can often see this by checking the kube-apiserver pod logs in the kube-system namespace:
    kubectl logs -n kube-system kube-apiserver-<node-name>
    
    Look for high latency, repeated errors, or indications of resource starvation.
  • Fix: If the API server is unhealthy, you need to address the underlying Kubernetes control plane issue. This might involve scaling up control plane nodes, debugging etcd, or resolving network issues between control plane components.
  • Why it works: A healthy API server is the bedrock of Kubernetes. If it’s struggling, nothing else can function correctly, including Flux’s event-driven reconciliation.

2. Flux Controller Pod is Unhealthy or Restarting

The Flux controllers themselves (like kustomize-controller, helm-controller, source-controller) might be crashing or not running properly. This means they can’t pick up or process events, even if they are being sent.

  • Diagnosis: Check the status and logs of your Flux controller pods.
    kubectl get pods -n flux-system
    kubectl logs -n flux-system <flux-controller-pod-name>
    
    Look for CrashLoopBackOff, Error, or repeated restarts. The logs will often show panics, out-of-memory errors, or configuration issues.
  • Fix: If a controller pod is unhealthy, the most common fix is to restart it. If it’s a persistent issue, investigate the specific error in the logs. It could be an OOMKilled event (meaning it needs more memory), a misconfiguration in its deployment, or a bug in Flux itself. You might need to increase the resource limits for the pod or adjust its configuration.
    kubectl delete pod -n flux-system <flux-controller-pod-name>
    
    If it’s an OOMKilled issue, edit the deployment to increase memory limits:
    kubectl edit deployment -n flux-system <flux-controller-deployment-name>
    
    Find the resources.limits.memory field and increase it, e.g., from 256Mi to 512Mi.
  • Why it works: Flux controllers are the agents that watch for changes (events) and apply them. If they are down or unstable, they can’t do their job.

3. Network Issues Preventing Event Propagation

Flux controllers rely on network connectivity to watch for events from the Kubernetes API server. If there are network policies, firewalls, or general network instability, these events might not reach the Flux controllers.

  • Diagnosis: Ensure that the Flux controller pods can reach the Kubernetes API server endpoint. Check network policies in the flux-system namespace and any cluster-wide network policies.
    kubectl get networkpolicy -n flux-system
    
    You can also use kubectl exec to test connectivity from within a Flux pod to the API server’s internal service IP.
    kubectl exec -it -n flux-system <flux-controller-pod-name> -- curl -k https://kubernetes.default.svc
    
  • Fix: Adjust network policies to allow egress traffic from the Flux controller pods to the Kubernetes API server (typically on port 443). Ensure there are no firewall rules blocking this communication.
  • Why it works: Events are delivered over the network. If the network path is blocked, Flux never knows when something has changed.

4. Flux Source Controller Cannot Fetch or Watch Sources

If Flux cannot fetch or watch the Git repository, Helm repository, or OCI registry that your Kustomization or HelmRelease points to, it will get stuck. The source-controller is responsible for this, and its inability to connect or process source data can halt reconciliation.

  • Diagnosis: Check the status of your GitRepository, HelmRepository, or OCIRepository resources.
    kubectl get gitrepository,helmrepository,ocirepository -n <your-namespace> -o yaml
    
    Look at the status.conditions for errors related to fetching, authentication, or network issues. Also, check the logs of the source-controller pod in the flux-system namespace.
  • Fix:
    • GitRepository: Verify the url, ref.branch/tag/commit, and any secretRef for SSH keys or tokens. Ensure the Flux service account has permissions to access the secret.
      # Example: Correcting a GitRepository URL
      kubectl patch gitrepository <repo-name> -n <your-namespace> --type merge --patch '{"spec": {"url": "git@github.com:your-org/your-repo.git"}}'
      
    • HelmRepository: Ensure the url is correct and accessible. If it’s a private Helm repository, check the secretRef for credentials.
    • OCIRepository: Verify the url and ref (tag/digest). Check authentication secrets if it’s a private registry.
  • Why it works: Flux’s reconciliation is triggered by changes in its sources. If it can’t access or verify the source, it has no new state to apply, and the reconciliation loop for dependent resources will stall.

5. Event Queue Backlog in Flux Controllers

While less common for a single stuck event, if Flux receives a very high volume of events in rapid succession, or if a controller is slow to process them, its internal event queue can become a bottleneck. This can manifest as a general sluggishness or an apparent "stuck" state.

  • Diagnosis: Monitor the resource utilization of your Flux controller pods (CPU, memory). If they are consistently maxed out, they might not be processing events fast enough. Check logs for any repetitive, non-crashing errors that might indicate slow processing.
  • Fix: Increase the resource requests and limits for the Flux controller deployments. This gives them more CPU and memory to process events more quickly.
    kubectl edit deployment -n flux-system <flux-controller-deployment-name>
    
    Adjust resources.requests and resources.limits for CPU and memory. For example:
    resources:
      requests:
        cpu: "200m"
        memory: "256Mi"
      limits:
        cpu: "500m"
        memory: "512Mi"
    
  • Why it works: By providing more resources, the controller can handle the incoming event stream and process them without falling behind, clearing any backlog.

6. Informer Cache Desynchronization

Flux controllers use Kubernetes informers to efficiently watch for changes to resources. If the informer’s local cache becomes desynchronized with the actual state in the API server, the controller might miss events or act on stale information, leading to reconciliation loops that don’t advance.

  • Diagnosis: This is tricky to diagnose directly without deep dives into Flux internals or very verbose logging. However, if you’ve exhausted other options and see intermittent or persistent issues where Flux seems to "forget" about resources or not react to changes, it could be a cache problem. Restarting the relevant controller pod is often the simplest way to force a cache resync.
  • Fix: Delete the problematic Flux controller pod. This forces the controller to restart, re-establish its connection to the API server, and rebuild its informer caches from scratch.
    kubectl delete pod -n flux-system <flux-controller-pod-name>
    
  • Why it works: Restarting the controller causes it to re-list all relevant objects and rebuild its internal state, synchronizing its cache with the current state of the cluster.

Once you resolve the specific event that was causing the issue, Flux should resume its reconciliation process. The next potential problem you might encounter is a ReconcileFailed error if a different, valid change is now stuck.

Want structured learning?

Take the full Flux course →