It’s actually not about indexing logs into Elasticsearch at all, but about transforming them into a structured format Elasticsearch can actually query efficiently.

Let’s watch a real log entry and see how it morphs.

Imagine this raw log line from an application:

2023-10-27 10:30:05.123 INFO [com.example.MyApp] User 'alice' logged in from 192.168.1.100. Request ID: abc-123

If you just shove that into Elasticsearch as a single string, good luck searching for all logins from a specific IP or all INFO messages. It’s a black box.

Fluentd’s job here is to act as the translator. It’s a data collector, but more importantly, a data processor. It sits between your application (the log source) and Elasticsearch (the destination).

Here’s a simplified Fluentd configuration (fluentd.conf) that does the heavy lifting:

<source>
  @type tail
  path /var/log/myapp/app.log
  pos_file /var/log/td-agent/myapp.pos
  tag myapp.log
  <parse>
    @type regexp
    # This regex is the key to parsing the raw line
    expression /^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d{3})\s+(INFO|WARN|ERROR)\s+\[(.*?)\]\s+(.*?)(?: Request ID: (.*))?$/
    time_key timestamp
    time_format %Y-%m-%d %H:%M:%S.%L
  </parse>
</source>

<match myapp.log>
  @type elasticsearch
  host localhost
  port 9200
  logstash_format true
  logstash_prefix myapp-logs
  include_tag_key true
  tag_key @log_name
  <buffer tag, time>
    @type file
    path /var/log/td-agent/buffer/myapp
    flush_interval 10s
  </buffer>
</match>

Let’s break down what’s happening:

  1. <source> section: This tells Fluentd where to get the logs.
    • @type tail: It’s going to "tail" a file, just like the tail -f command.
    • path /var/log/myapp/app.log: The actual log file.
    • tag myapp.log: This is an internal identifier for this stream of logs.
    • <parse> section: This is where the magic happens.
      • @type regexp: We’re using a regular expression to break down the line.
      • expression /^(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d{3})\s+(INFO|WARN|ERROR)\s+\[(.*?)\]\s+(.*?)(?: Request ID: (.*))?$/: This is the core parser. It defines capture groups for different parts of the log line.
        • Group 1: The timestamp (2023-10-27 10:30:05.123)
        • Group 2: The log level (INFO)
        • Group 3: The logger name (com.example.MyApp)
        • Group 4: The message itself (User 'alice' logged in from 192.168.1.100.)
        • Group 5: The Request ID (abc-123) - this is optional due to (?: ... )?.
      • time_key timestamp: Tells Fluentd which captured group should be treated as the event timestamp.
      • time_format %Y-%m-%d %H:%M:%S.%L: How to interpret that timestamp string.

After Fluentd processes that line, it’s no longer a single string. It becomes a JSON object that looks something like this before it hits Elasticsearch:

{
  "timestamp": "2023-10-27 10:30:05.123",
  "level": "INFO",
  "logger_name": "com.example.MyApp",
  "message": "User 'alice' logged in from 192.168.1.100.",
  "request_id": "abc-123"
}
  1. <match> section: This tells Fluentd what to do with logs that have the myapp.log tag.
    • @type elasticsearch: Send it to Elasticsearch.
    • host localhost, port 9200: Where Elasticsearch is running.
    • logstash_format true: This makes the output compatible with Logstash’s default index naming conventions, which is useful for Kibana.
    • logstash_prefix myapp-logs: This will create an index like myapp-logs-YYYY.MM.DD.
    • include_tag_key true, tag_key @log_name: Adds a field named @log_name with the value myapp.log to the JSON document.
    • <buffer> section: This is crucial for reliability. If Elasticsearch is slow or down, Fluentd won’t lose data. It writes logs to a buffer file (/var/log/td-agent/buffer/myapp) and tries to flush them to Elasticsearch every 10 seconds.

Now, when this structured JSON hits Elasticsearch, you can do powerful queries:

  • Find all INFO logs from com.example.MyApp: GET /myapp-logs-*/_search { "query": { "bool": { "filter": [ { "term": { "level.keyword": "INFO" } }, { "term": { "logger_name.keyword": "com.example.MyApp" } } ] } } }
  • Find logins from a specific IP: GET /myapp-logs-*/_search { "query": { "wildcard": { "message": "*from 192.168.1.100*" } } } (or better, if you parse the IP out separately)

The most surprising thing is how fragile parsing is, and how many different types of parsers Fluentd offers beyond regexp. You have parsers for JSON, CSV, Apache logs, Nginx logs, and even grok patterns which are like enhanced regular expressions designed specifically for log parsing. If your log format changes even slightly, your parser breaks.

The next thing you’ll grapple with is handling logs from multiple applications, each with its own unique format, and managing those distinct Fluentd parsers and Elasticsearch index patterns.

Want structured learning?

Take the full Fluentd course →