Running Grafana in High Availability (HA) with a PostgreSQL backend is surprisingly simple, but the real magic happens when you realize that Grafana’s HA story isn’t about replicating Grafana itself, but about ensuring the data it relies on is redundant.
Let’s see this in action. Imagine we have two Grafana instances, grafana-a and grafana-b, both pointing to the same PostgreSQL database.
# grafana-a/grafana.ini
[database]
type = postgres
host = db.example.com:5432
name = grafana
user = grafana_user
password = your_secure_password
sslmode = require
[server]
http_port = 3000
root_url = http://grafana-a.example.com:3000
# grafana-b/grafana.ini
[database]
type = postgres
host = db.example.com:5432
name = grafana
user = grafana_user
password = your_secure_password
sslmode = require
[server]
http_port = 3000
root_url = http://grafana-b.example.com:3000
If grafana-a goes down, grafana-b continues to serve dashboards, alerts, and user data because it’s all stored in PostgreSQL. The trick is that PostgreSQL itself needs to be highly available.
The core problem Grafana HA solves is preventing a single point of failure for your monitoring interface and its configuration. If your Grafana instance dies, you lose access to all your dashboards, alert rules, and user management. By running multiple Grafana instances and pointing them to a highly available database, you eliminate this single point of failure. Your Grafana instances become stateless frontends, and all the state lives in PostgreSQL.
Internally, Grafana uses PostgreSQL for everything: user accounts, dashboard definitions (stored as JSON), alert rules, notification channels, and even session management. When you access Grafana, it’s a direct query to this database. The HA setup means that if one Grafana server process crashes or its underlying VM/container fails, another identical instance can pick up the load immediately, as it’s reading from and writing to the same persistent, redundant data store.
The levers you control are primarily:
- The number of Grafana instances: Run at least two for redundancy.
- The load balancing mechanism: Distribute incoming traffic across your Grafana instances. This could be an Nginx, HAProxy, or a cloud provider’s load balancer.
- The PostgreSQL backend’s HA: This is the most critical part. You need a PostgreSQL cluster that can withstand node failures.
Consider this PostgreSQL HA setup: a primary instance and one or more replicas. For automatic failover, you’d typically use tools like Patroni with etcd or Consul for coordination.
# Example of a basic PostgreSQL primary/replica setup (conceptual)
# On primary:
pg_ctl start -D /var/lib/postgresql/14/main
# On replica:
pg_ctl start -D /var/lib/postgresql/14/main -c 'hot_standby = on'
Your load balancer would then direct traffic to the current primary PostgreSQL instance. Grafana instances themselves don’t need to know about the primary/replica status; they just connect to the service name or IP address that the load balancer manages for PostgreSQL.
The most surprising thing about Grafana’s HA is that it doesn’t use any distributed consensus mechanism or internal clustering for the Grafana processes themselves. There’s no gossip protocol between Grafana nodes. If you try to set up Grafana with a shared filesystem for its data directory (e.g., for SQLite, which is not recommended for HA), you’ll run into race conditions and data corruption because Grafana is not designed to be run with a shared data directory across multiple instances. The data directory should be local to each Grafana instance, and the actual state (dashboards, users, etc.) must be in a robust, clustered external database like PostgreSQL.
The next challenge you’ll face is ensuring your load balancer has health checks configured for your Grafana instances, so it stops sending traffic to a dead Grafana server.