Loki alerts are a powerful way to get notified about events happening in your system, but routing those alerts to the right people at the right time can feel like a complex puzzle.

Imagine you’ve got a critical error message appearing in your logs, something like {"level": "error", "message": "Database connection failed"}. Loki’s alerting rules are designed to catch these patterns and trigger actions. The core idea is to define a query that identifies the problematic log lines and then specify what should happen when that query returns results.

Let’s see this in action. Suppose you’re running an e-commerce site and want to be alerted immediately if there are more than 10 failed payment attempts in a 5-minute window. Your Loki alert rule might look something like this:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: payment-alerts
  namespace: monitoring
spec:
  groups:
  - name: payment.rules
    rules:
    - alert: HighFailedPaymentAttempts
      expr: |
        sum(rate({job="ecommerce", level="error", message=~"Payment failed.*"} [5m])) by (customer_id)
        > 10
      for: 5m
      labels:
        severity: critical
      annotations:

        summary: "High failed payment attempts for customer {{ $labels.customer_id }}"


        description: "Customer {{ $labels.customer_id }} has experienced more than 10 failed payment attempts in the last 5 minutes."

Here’s how this unfolds:

  1. alert: HighFailedPaymentAttempts: This is the name of your alert. It’s a human-readable identifier.

  2. expr: | sum(rate({job="ecommerce", level="error", message=~"Payment failed.*"} [5m])) by (customer_id) > 10: This is the heart of the alert.

    • {job="ecommerce", level="error", message=~"Payment failed.*"}: This is your Loki query. It’s searching for log lines from the ecommerce job, with a log level of error, and a message that starts with "Payment failed".
    • rate(... [5m]): This calculates the per-second rate of log lines matching the query over the last 5 minutes. This is crucial for detecting bursts of activity.
    • sum(...) by (customer_id): This aggregates the rates by customer_id. So, if multiple payment failures occur for the same customer, they’re counted together.
    • > 10: This is the threshold. The alert will fire if the rate for any customer_id exceeds 10 failed attempts per second (aggregated over the 5 minutes).
  3. for: 5m: This is a "duration" clause. The condition in expr must be true for 5 consecutive minutes before the alert actually fires. This prevents flapping alerts from transient issues.

  4. labels and annotations: These provide metadata about the alert. severity: critical helps route it to high-priority channels. summary and description provide context for the person receiving the notification, often using Go templating ({{ $labels.customer_id }}) to include dynamic data from the log query.

When this alert fires, it doesn’t directly send an email or Slack message. Instead, it sends the alert to a Prometheus Alertmanager. Alertmanager is a separate component responsible for deduplicating, grouping, and routing alerts to various receivers like email, Slack, PagerDuty, Opsgenie, etc.

Your Alertmanager configuration would then define how to handle alerts with severity: critical. For example, you might configure it to:

  • Group alerts: If multiple payment failures happen for the same customer within a short period, they might be grouped into a single notification.
  • Route alerts: Send critical alerts to the on-call PagerDuty rotation, while lower severity alerts might go to a general Slack channel.
  • Silence alerts: Temporarily mute notifications for known incidents or maintenance windows.

The magic of Loki alerts lies in its ability to leverage the full power of LogQL for querying, combined with Prometheus’s robust alerting and Alertmanager’s sophisticated routing capabilities. You can build incredibly granular alerts based on log content, structured metadata, and even combinations of log patterns.

A common point of confusion is that the rate() function in Loki alerts, when used with a duration like [5m], calculates the average rate over that entire period. So, rate(...[5m]) > 10 means the average rate over the last 5 minutes has been greater than 10 per second. It doesn’t mean there were 10 failures right now if the rate has since dropped.

The next step after configuring alerts for critical errors is often to set up alerts for unusual absence of logs, indicating a service might have stopped producing output.

Want structured learning?

Take the full Loki course →