Fluentd’s regex and grok parsers are your best friends for turning unstructured log data into structured events.

Let’s say you’ve got a log line like this from your application:

2023-10-27 10:30:15 INFO [user-service] User 'alice' logged in from 192.168.1.100

Right now, it’s just a string. Fluentd can’t easily query or filter based on "alice" or "192.168.1.100". We need to break it down.

Here’s how you’d configure Fluentd to parse that line using a regex parser. You’d put this in your fluentd.conf:

<source>
  @type tail
  path /var/log/myapp/app.log
  pos_file /var/log/myapp/app.log.pos
  tag myapp.log
  <parse>
    @type regex
    expression /^(?<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\s+(?<level>\w+)\s+\[(?<service>\w+-service)\]\s+User\s+'(?<user>\w+)'\s+logged\s+in\s+from\s+(?<ip_address>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})$/
  </parse>
</source>

Let’s break down the expression part:

  • ^: Matches the beginning of the line.
  • (?<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}): This is a named capture group. (?<name>...) captures the matched text and assigns it the name name. Here, it captures the date and time. \d{4} matches exactly four digits, and so on.
  • \s+: Matches one or more whitespace characters.
  • (?<level>\w+): Captures the log level (INFO, ERROR, etc.) as level. \w+ matches one or more word characters.
  • \s+\[(?<service>\w+-service)\]: Captures the service name (e.g., user-service) as service. The brackets [ and ] are escaped with \ because they have special meaning in regex.
  • \s+User\s+'(?<user>\w+)': Captures the username as user.
  • \s+logged\s+in\s+from\s+: Matches the literal string " logged in from ".
  • (?<ip_address>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}): Captures the IP address. \d{1,3} matches one to three digits, and \. matches a literal dot.
  • $: Matches the end of the line.

After this configuration, if Fluentd reads that log line, it will transform it into an event like this (in JSON format):

{
  "timestamp": "2023-10-27 10:30:15",
  "level": "INFO",
  "service": "user-service",
  "user": "alice",
  "ip_address": "192.168.1.100"
}

Now you can easily filter by user or ip_address.

The grok parser is built on top of regex but uses pre-defined patterns for common log formats. It’s often more readable and less error-prone for standard structures.

Consider a slightly different log line:

Oct 27 10:30:15 myapp user-service: User 'alice' logged in from 192.168.1.100

Here’s how you’d use grok:

<source>
  @type tail
  path /var/log/myapp/app.log
  pos_file /var/log/myapp/app.log.pos
  tag myapp.log
  <parse>
    @type grok
    pattern %{SYSLOGTIMESTAMP:timestamp}\s+%{WORD:service_name}\s+%{WORD:service_subname}:\s+User\s+'%{WORD:user}'\s+logged\s+in\s+from\s+%{IPORHOST:ip_address}
    overwrite_tags true # This will overwrite the tag with the 'service_name' field if it exists
  </parse>
</source>

Let’s look at the pattern:

  • %{SYSLOGTIMESTAMP:timestamp}: This is a grok pattern. SYSLOGTIMESTAMP is a built-in pattern that matches common syslog timestamp formats (like Oct 27 10:30:15). It’s captured as timestamp.
  • %{WORD:service_name}: Matches a single word and captures it as service_name.
  • %{WORD:service_subname}: Another word, captured as service_subname.
  • :%{SPACE}: Matches a colon and a space.
  • User\s+'%{WORD:user}': Matches the literal "User '", captures a word as user, then matches "'".
  • %{IPORHOST:ip_address}: This grok pattern matches either an IP address or a hostname and captures it as ip_address.

The grok parser comes with many built-in patterns like NUMBER, IP, GREEDYDATA, TIMESTAMP_ISO8601, and more. You can find a comprehensive list in the grok documentation.

The overwrite_tags true option is interesting. If a field named service_name is successfully parsed, Fluentd will use its value (e.g., myapp) to rename the event’s tag. This can be useful for routing events based on parsed content.

The real power comes from combining grok patterns or using custom ones. For instance, if you have a complex field you want to extract, you can define your own grok pattern:

<filter myapp.log>
  @type grok
  <pattern>
    expression %{IPORHOST:clientip} %{USER:ident} %{USER:auth} \[%{HTTPDATE:timestamp}\] "%{WORD:verb} %{URIPATHPARAM:request} HTTP/%{NUMBER:httpversion}" %{NUMBER:response} %{NUMBER:bytes}
    custom_pattern "%{IPORHOST} %{USER} %{USER} \[%{DATESTAMP:date} %{TIME:time}\] \"%{WORD:method} %{URIPATH:path} HTTP/%{NUMBER:version}\" %{NUMBER:status} %{NUMBER:size}"
  </pattern>
  <tag_map>
    timestamp ${date} ${time}
  </tag_map>
</filter>

In this example, we’re using a common HTTP log format.

  • custom_pattern allows you to define your own grok patterns if the built-in ones aren’t sufficient. Here, we’ve defined DATESTAMP and TIME.
  • tag_map is a way to combine parsed fields into a new one, or to rename fields. Here, we’re creating a timestamp field by combining the parsed date and time fields.

The key takeaway is that regex gives you raw power with explicit control, while grok offers a more abstract, pattern-based approach that can simplify common parsing tasks. Understanding both allows you to handle virtually any log format Fluentd might encounter.

You can chain parsers too. If a regex parser doesn’t match, Fluentd can try another. This is incredibly useful when dealing with logs that might have varying formats within the same file.

The next step is often dealing with multi-line logs, like stack traces, which require a different configuration strategy.

Want structured learning?

Take the full Fluentd course →