The most surprising thing about monitoring Keycloak with Grafana is that you’re not just watching login counts; you’re fundamentally observing the health and performance of your entire identity and access management layer.
Let’s see it in action. Imagine a typical scenario: users are complaining about slow logins. We’ll use Keycloak’s built-in metrics endpoint, which exposes Prometheus-compatible data, and then visualize it in Grafana.
First, ensure Keycloak is configured to expose metrics. In your standalone.xml or standalone-ha.xml (or equivalent for Kubernetes), you’ll need to add or modify the jaxrs subsystem to include the metrics endpoint. This usually looks something like this within the <subsystem xmlns="urn:jboss:domain:jaxrs:2.0"> section:
<server name="default-server">
<http-endpoint socket-binding="http" security-realm="ApplicationRealm"/>
<http-endpoint socket-binding="https" security-realm="ApplicationRealm"/>
<jaxrsApplication name="metrics" class="org.keycloak.measurement.MetricsEndpoint" path="/metrics"/>
</server>
You’ll also need to ensure the ApplicationRealm is correctly configured and that the metrics JAX-RS application is enabled and accessible. The default metrics path is /auth/realms/master/metrics.
Once Keycloak is running with metrics enabled, you can scrape this endpoint with Prometheus. A basic Prometheus configuration (prometheus.yml) would include a scrape job:
scrape_configs:
- job_name: 'keycloak'
static_configs:
- targets: ['your-keycloak-host:8080'] # Replace with your Keycloak host and port
metrics_path: '/auth/realms/master/metrics'
scheme: 'http' # or 'https' if you've configured TLS
After Prometheus is scraping, you can set up Grafana. You’ll add Prometheus as a data source. Then, you can build dashboards. A crucial metric to start with is keycloak_sessions_active_total. This counter shows the total number of active user sessions. A sudden, sustained spike here, especially if not correlated with user activity, could indicate session leaks or issues with session termination.
Another vital metric is keycloak_login_requests_total. This counter tracks every login attempt. By looking at the rate of increase (rate(keycloak_login_requests_total[5m])), you can see the real-time login traffic. If this rate is high and keycloak_login_failures_total (another counter) is also increasing proportionally, you’re looking at a potential brute-force attack or a widespread authentication issue.
For performance, keycloak_tokens_issued_total is your friend. The rate of this counter (rate(keycloak_tokens_issued_total[5m])) tells you how many tokens Keycloak is generating per second. A high rate is expected under load, but if it’s accompanied by increased latency in your application’s response to token validation, Keycloak itself might be becoming a bottleneck.
To diagnose slow logins specifically, you’d look at metrics like keycloak_authentication_flow_duration_seconds_bucket and keycloak_authentication_flow_duration_seconds_count. These histogram metrics allow you to see the distribution of authentication flow durations. A significant increase in the +Inf bucket or higher percentiles (e.g., 95th, 99th) indicates that certain authentication steps are taking too long. This could be due to complex authentication flows, slow external identity providers, or database performance issues.
The mental model is that Keycloak, at its core, is a stateful service managing sessions and processing authentication requests. Every metric exposed is a direct window into its internal operations: how many users are connected, how many requests it’s processing, how long those processes take, and where errors occur. By understanding these components – sessions, requests, tokens, and authentication flows – you can trace performance bottlenecks and security incidents directly to their source within Keycloak.
The keycloak_provider_lookup_duration_seconds_bucket metric, often overlooked, reveals the latency introduced by Keycloak when looking up various providers (like identity providers, user storage providers, etc.). If this metric shows high latency, it means Keycloak is spending excessive time finding the right component to handle a request, which can significantly slow down overall operations, especially in complex multi-provider setups.
When you start correlating these metrics with your application’s performance and user experience, you gain a holistic view of your IAM system’s health.
The next logical step is to start building alerting rules in Prometheus based on these Grafana dashboards to proactively identify issues before users report them.