Grafana OnCall doesn’t just notify you when something breaks; it actively manages the human response to that break, ensuring the right people get bothered at the right time, and that nobody sleeps through a critical alert.

Let’s say we’ve got a service, web-frontend, that’s spitting out errors. In Grafana OnCall, this translates to an alert rule in Prometheus that fires an alert. This alert then gets sent to Grafana OnCall. The first thing Grafana OnCall does is look at its routing rules. These rules are like a switchboard operator: "If the alert is from web-frontend and has severity critical, send it to the web-frontend-oncall team."

Here’s how that routing rule might look in Grafana OnCall:

- match:
    service: web-frontend
    severity: critical
  to:
    - team_id: web-frontend-oncall

Once the alert is routed to the web-frontend-oncall team, Grafana OnCall consults that team’s escalation policy. This policy defines who gets notified and when. A basic policy might look like this:

name: Web Frontend Critical Escalation
steps:
  -   send_to:
          - type: user
            id: alice@example.com
      timeout: 5m
  -   send_to:
          - type: user
            id: bob@example.com
      timeout: 10m
  -   send_to:
          - type: user
            id: charlie@example.com
      timeout: 15m
      notify_all: true # This is important!

In this example, when an alert hits the web-frontend-oncall team:

  1. Step 1: Alice gets a notification. If she doesn’t acknowledge the alert within 5 minutes, Grafana OnCall moves to the next step.
  2. Step 2: Bob gets a notification. If he also doesn’t acknowledge within 10 minutes (so, 5 minutes after Alice’s timeout), it moves on.
  3. Step 3: Charlie gets a notification. Crucially, notify_all: true means that both Alice and Bob will also be notified again at this stage, along with Charlie. This ensures that if the first two people missed it, everyone involved is now aware.

The timeout values are cumulative in a way. The total time before the next notification goes out is the sum of the timeouts of the preceding steps. So, Bob is notified 5 minutes after the alert initially fired. Charlie (and Alice and Bob again) are notified 15 minutes after Bob was notified, which is 5 + 10 = 15 minutes after the alert fired.

You can also use different notification channels within a step. For instance, you might want to send a Slack message and a PagerDuty alert simultaneously.

name: Web Frontend Critical Escalation
steps:
  -   send_to:
          - type: user
            id: alice@example.com
          - type: pagerduty_service
            id: pd_service_id_12345
      timeout: 5m
  # ... rest of the steps

This id for PagerDuty would correspond to a service integration you’ve set up within Grafana OnCall’s integrations.

The real power comes when you combine these policies with user schedules and overrides. Imagine Alice is on vacation. You can set up an override for her user account that reroutes her alerts to another user for a specific period.

# This would be part of the user configuration
overrides:
  - active_between:
      start: 2023-10-27T09:00:00Z
      end: 2023-10-30T17:00:00Z
    send_to:
      - type: user
        id: dave@example.com

This means during those dates, any alert that would have gone to Alice will instead go to Dave. If Dave is also part of an escalation policy, the system correctly follows his assigned steps.

The most surprising thing about escalation policies is how granularly you can control who gets woken up based on a combination of alert labels and the time of day. You can have a completely different escalation path for a "critical" alert during business hours versus one that fires at 3 AM. This is configured by having multiple routing rules that match on different severity or custom labels, each pointing to a different escalation policy, and then using Grafana OnCall’s built-in time-based scheduling for users within those policies.

For example, your web-frontend-oncall team might have a policy that sends to Alice (primary), then Bob (secondary) during weekdays. But then, a different policy, also routed to the web-frontend-oncall team but perhaps with a nightly: true label on the alert, could first go to Charlie, then David, then loop back to Alice.

This allows you to implement "business hours" and "after-hours" rotations very cleanly, without needing separate teams for each. The key is that an alert can be matched by multiple routing rules, and Grafana OnCall will apply the first matching rule it finds. So, order matters in your routing rules if you have overlapping conditions.

Once you’ve got your escalation policies defined, the next logical step is to manage how users are grouped into these policies, which leads you to understanding Maintenance Windows.

Want structured learning?

Take the full Grafana course →