The Loki Pattern Parser, often seen in the context of log aggregation, can seem like magic when it automatically pulls structured data from unstructured logs, but its real power lies in its deterministic, regex-based parsing that can be precisely controlled.
Let’s see it in action. Imagine you have logs like these, coming into Loki:
{"level":"info","ts":"2023-10-27T10:00:00Z","msg":"User logged in","user_id":"user-123","ip":"192.168.1.10"}
{user_id="user-456", level="warn", ts="2023-10-27T10:01:00Z", msg="Failed login attempt", ip="192.168.1.11"}
You want to query for all logs from user-123 or count failed logins. Without parsing, this is a painful string search. With the pattern parser, you can extract user_id, level, msg, and ip as distinct labels, making queries lightning fast.
The core mechanism is a set of patterns defined in your Loki configuration, typically within the scrape_configs section under pipeline_stages. These patterns tell Loki how to break down a log line. There are two primary types of parsers: JSON and logfmt.
For the JSON log, you’d use a json stage:
scrape_configs:
- job_name: my-app
static_configs:
- targets:
- localhost
labels:
job: my-app
pipeline_stages:
- json:
expressions:
level:
ts:
msg:
user_id:
ip:
This configuration tells Loki to look for keys named level, ts, msg, user_id, and ip in the JSON payload and create labels from their values. If a log line is valid JSON and contains these keys, Loki will automatically add them as labels.
For the logfmt log, you’d use a logfmt stage:
scrape_configs:
- job_name: my-app-logfmt
static_configs:
- targets:
- localhost
labels:
job: my-app-logfmt
pipeline_stages:
- logfmt:
mapping: true
The logfmt stage is simpler. If mapping: true is set, it will parse any key-value pairs in logfmt format and turn them into labels. This is incredibly convenient for logs that are already structured in this common format.
What problem does this solve? It transforms unstructured or semi-structured log data into a queryable, indexed format. Instead of grep "user_id=user-123" /var/log/myapp.log, you can run sum(rate({job="my-app", user_id="user-123"} [5m])). This is orders of magnitude faster and more scalable for large log volumes.
Internally, Loki applies these stages sequentially. A log line first arrives, then the json or logfmt stage processes it. If a regex stage were added after the json stage, it would operate on the entire log line, not just the extracted JSON fields. However, if you needed to extract something before JSON parsing, or use regex to select which logs get parsed, you’d place regex stages earlier in the pipeline. The output field in the json stage can also be used to specify which field contains the actual log message, rather than treating the whole line as JSON if your logs are nested.
The most surprising thing is how many different ways you can construct a log line that won’t be parsed correctly, even with the right stage. For instance, a trailing comma in a JSON object, or a space within a quoted logfmt value that isn’t properly escaped, will cause the entire line to be skipped by that stage. Loki doesn’t try to guess or fix malformed input; it’s a strict parser. You can specify a drop_counter_labels in the json stage to avoid seeing a flood of "unparsed_json" metrics when your input is inconsistent.
The next frontier is using the template stage to reformat extracted labels or even construct new log lines based on parsed data.