Fluentd deduplication isn’t about throwing away logs; it’s about making sure the important stuff doesn’t get buried under a mountain of repetitive noise.
Let’s watch a real-time example. Imagine a web server that’s spitting out the same "health check" 200 OK every second. Without deduplication, your log aggregation system would be flooded.
Here’s a tiny Fluentd configuration that captures this:
<source>
@type tail
path /var/log/nginx/access.log
pos_file /var/log/td-agent/nginx.pos
tag nginx.access
<parse>
@type nginx
</parse>
</source>
<filter nginx.access>
@type record_transformer
enable_ruby true
<record>
# Add a unique identifier to each log line based on its content
dedupe_key "#{Digest::MD5.hexdigest([time.to_i, record['remote'], record['method'], record['path'], record['status']].join)}"
</record>
</filter>
<filter nginx.access>
@type dedupe
hash_key dedupe_key
# Keep only the most recent log line for each unique key in the last 60 seconds
delay 60
</filter>
<match nginx.access>
@type stdout
</match>
When Fluentd processes logs from /var/log/nginx/access.log, it first tags them nginx.access. The record_transformer filter then adds a dedupe_key to each record. This key is an MD5 hash generated from a combination of the timestamp, remote IP, HTTP method, path, and status code. If these elements are identical for multiple log entries, they’ll get the same dedupe_key.
The dedupe filter then acts on this dedupe_key. It holds onto events for a specified delay (60 seconds in this case). If it sees another event with the same dedupe_key within that 60-second window, it discards the new one, keeping only the first (or, depending on configuration, the most recent) occurrence. The <match> block simply sends the processed (and potentially deduplicated) logs to standard output so we can see what’s happening.
The core problem this solves is signal-to-noise ratio. When you have thousands of identical, low-value log entries (like repetitive health checks, successful but uninteresting operations, or routine system status updates), they can overwhelm your monitoring and analysis tools. This makes it harder to spot actual errors or critical events. By deduplicating, you filter out the redundant chatter, leaving a cleaner, more actionable log stream.
Internally, the dedupe plugin maintains an in-memory cache of seen hash_key values and their associated timestamps. When a new event arrives, it calculates the hash_key. If the key is already in the cache and the event’s timestamp is within the delay period of the cached event’s timestamp, the new event is dropped. Otherwise, it’s added to the cache and forwarded. The cache is periodically pruned to remove stale entries.
The hash_key is the absolute critical lever. Most people just throw message in there, but that’s a terrible idea. If any part of the message changes (like a timestamp that’s slightly different or a session ID), the hash changes, and you get no deduplication. You must construct a hash from the semantically important, stable fields that define a "duplicate" event for your use case. For health checks, it’s often just the path and status. For application errors, it might be the error code and a few key context fields.
The delay parameter is also crucial. Setting it too low means you won’t catch enough duplicates. Setting it too high consumes more memory and might delay the forwarding of the first instance of a legitimately new, but similar, event. Tuning this to your log’s frequency and your definition of "noise" is key.
If you’re seeing logs you thought were deduplicated but aren’t, it’s almost certainly because your hash_key isn’t specific enough and is changing between events that look similar to you but are different to the hash function.
The next thing you’ll likely want to tackle is enriching your logs with context before or after deduplication to make the remaining logs even more valuable.