Log-based metrics in Google Cloud Platform let you turn unstructured log data into structured metrics, which you can then use to create alerts.
Let’s see it in action. Imagine you’re running a web service, and you want to know whenever your application logs an ERROR message.
Here’s a sample log entry:
2023-10-27T10:30:00Z app_name=my-web-server severity=ERROR msg="Database connection failed: timeout expired"
You can create a log-based metric that counts these ERROR messages.
First, you need to define the filter that identifies these logs. In this case, it’s a simple filter:
severity="ERROR"
You’d navigate to Logging > Log-based Metrics in the Google Cloud Console. Click Create Metric.
For the metric type, choose Counter.
For the Name, something descriptive like application_errors.
For the Description, Counts occurrences of ERROR severity logs.
For the Filter, enter severity="ERROR".
Once you save this, GCP starts counting every log entry that matches your filter. This count is exposed as a time-series metric.
Now, you can go to Monitoring > Alerting and create a new alert policy.
Select the application_errors metric you just created.
Set the condition: Any time series violates.
For the Threshold, choose is above.
Enter 0.
Set the duration to 5 minutes. This means if there are errors for 5 consecutive minutes, an alert will fire.
For the notification channel, you can select an existing one (like email or Slack) or create a new one.
The problem this solves is the inherent difficulty of getting actionable insights from raw log files. Log files are often verbose and unstructured, making it hard to spot trends or anomalies. By converting specific log patterns into metrics, you can leverage GCP’s powerful monitoring and alerting tools.
Internally, GCP’s Logging agent (or the underlying infrastructure for services that don’t use agents) scans incoming log entries. When a log matches the filter defined for a log-based metric, a counter associated with that metric is incremented. This counter is then exposed to Cloud Monitoring as a standard metric, allowing for time-series analysis, graphing, and alerting.
The exact levers you control are the filter expression and the metric type. The filter is the most critical part – it dictates precisely which log entries contribute to your metric. You can use advanced filters involving specific strings, fields (if your logs are structured, like JSON), or even regular expressions. The metric type (Counter, Distribution, or Gauge) determines how the data is aggregated. For counting events like errors, a Counter is usually appropriate. For measuring latency, a Distribution or Gauge might be better.
A common pitfall is not realizing that log-based metrics are aggregated after the logs are ingested. This means there’s a slight delay between a log occurring and the metric reflecting that change. For very high-volume logs, the aggregation interval can also impact the granularity of your metrics. If you need millisecond-level accuracy for critical events, log-based metrics might not be the primary tool; instead, consider structured logging with direct metric instrumentation.
The next concept you’ll likely explore is using log-based metrics for more complex anomaly detection, beyond simple threshold breaches.