Flux, the GitOps tool for Kubernetes, doesn’t just apply your manifest files; it continuously reconciles your cluster state with your Git repository. Prometheus, the de facto standard for Kubernetes monitoring, can observe Flux’s inner workings, giving you insights into its health, performance, and the status of your deployments.

Here’s Flux’s main controller, source-controller, fetching a Git repository. Notice the interval is set to 1m. This means source-controller will poll Git every minute.

apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: GitRepository
metadata:
  name: my-repo
  namespace: flux-system
spec:
  interval: 1m
  url: https://github.com/my-org/my-repo.git
  ref:
    branch: main

Prometheus scrapes metrics from Flux components, typically exposed over HTTP on port 8080 for the controllers and 9090 for the notification controller. For example, source-controller exposes metrics like source_reconciliation_duration_seconds which shows how long it takes for Flux to reconcile a source.

Let’s look at a Prometheus query to see the average reconciliation duration for all GitRepository sources over the last hour:

avg_over_time(source_reconciliation_duration_seconds_bucket[1h])

Flux’s kustomize-controller is responsible for applying your manifests. It has metrics like kustomize_controller_reconciliation_errors_total which counts reconciliation errors. A non-zero, increasing value here indicates that Flux is failing to apply your Kubernetes resources.

To alert on these errors, you’d set up a Prometheus Alertmanager rule. Here’s an example rule that fires if kustomize-controller reports any reconciliation errors for more than 5 minutes:

groups:
- name: flux
  rules:
  - alert: FluxKustomizeReconciliationErrors
    expr: sum(rate(kustomize_controller_reconciliation_errors_total[5m])) by (namespace, name) > 0
    for: 5m
    labels:
      severity: warning
    annotations:

      summary: "Flux Kustomize controller reconciliation error on {{ $labels.name }}"


      description: "The kustomize controller has encountered errors reconciling {{ $labels.name }} in namespace {{ $labels.namespace }} for more than 5 minutes."

The kustomize_controller_reconciliation_errors_total metric is a counter. When you see it increasing, it means the controller is trying to apply your manifests but failing. This failure could be due to invalid Kubernetes YAML, insufficient permissions, or other cluster-level issues.

Flux’s helm-controller manages Helm releases. It exposes metrics like helm_controller_reconciliation_errors_total and helm_controller_release_info which provides information about the status of your Helm releases.

A key metric for understanding the health of your Helm releases managed by Flux is helm_controller_release_health_status. This metric has a value of 1 if the release is healthy, and 0 otherwise.

helm_controller_release_health_status{release_name="my-app"}

If this metric is 0 for a specific release, it means Helm itself is reporting the release as unhealthy. This could be due to issues within the Helm chart, failed pods after deployment, or a misconfiguration in the Helm release custom resource.

Flux also provides metrics for its notification controller, which handles sending alerts to external systems like Slack or Microsoft Teams. Metrics like notification_controller_events_total can help you track if notifications are being sent as expected.

The notification_controller_sent_failures_total metric is critical for ensuring your alerts are reaching their destination. If this metric is increasing, it means Flux is attempting to send notifications but failing.

rate(notification_controller_sent_failures_total[5m])

A common reason for notification failures is incorrect webhook URLs, invalid authentication tokens, or network connectivity issues between the Flux notification controller and the target notification service.

The most surprising thing about Flux metrics is how granularly they expose the reason for reconciliation failures, not just that a failure occurred. For instance, kustomize_controller_reconciliation_errors_total increments on each attempt to reconcile that fails, and by inspecting the Flux logs associated with the corresponding object, you can often pinpoint the exact YAML error or API server issue.

When you configure Flux, you specify reconciliation intervals for your sources (like Git repositories) and for the controllers themselves. These intervals directly influence how frequently Prometheus will see updates to metrics like source_reconciliation_duration_seconds and kustomize_controller_reconciliation_errors_total. A shorter interval means more frequent updates and potentially more granular monitoring, but also higher metric cardinality and load on Prometheus.

The controller_runtime_reconciler_ops_total metric, exposed by the underlying controller-runtime library Flux uses, is a treasure trove for understanding controller activity. It breaks down operations by controller name, operation type (e.g., Reconcile, Inject), and result (Success, Error).

sum(rate(controller_runtime_reconciler_ops_total{result="Error"}[5m])) by (controller)

This metric can tell you which specific controller within Flux (e.g., gitrepository, kustomization, helmrelease) is experiencing errors, and with what frequency.

The next thing you’ll want to monitor is the actual state of the Kubernetes resources Flux is managing, using metrics from the Kubernetes API server itself, often via kube-state-metrics.

Want structured learning?

Take the full Flux course →