Fluentd’s forward protocol lets you send logs from many sources to a central aggregator, but getting it right means understanding how the protocol handles retries and buffering.
Let’s watch Fluentd in action. Imagine we have two application servers, app1 and app2, and a central log aggregator server, aggregator.
app1.conf (on application server 1):
<source>
@type tail
path /var/log/myapp/app.log
pos_file /var/log/td-agent/myapp_app1.pos
tag app.log.app1
<parse>
@type json
</parse>
</source>
<match app.log.app1>
@type forward
flush_interval 10s
<server>
host aggregator.example.com
port 24224
</server>
</match>
app2.conf (on application server 2):
<source>
@type tail
path /var/log/myapp/app.log
pos_file /var/log/td-agent/myapp_app2.pos
tag app.log.app2
<parse>
@type json
</parse>
</source>
<match app.log.app2>
@type forward
flush_interval 10s
<server>
host aggregator.example.com
port 24224
</server>
</match>
aggregator.conf (on the aggregator server):
<source>
@type forward
port 24224
bind 0.0.0.0
</source>
<match app.log.**>
@type stdout
</match>
When app1 and app2 start processing their respective log files and the tail input plugin detects new entries, the forward output plugin on each application server will batch these logs. By default, it tries to send them every 10 seconds (flush_interval 10s). The aggregator server listens on port 24224 and, upon receiving data, matches it with app.log.** and prints it to standard output.
The core problem Fluentd’s forward protocol solves is reliable, high-throughput log aggregation. Instead of each application directly writing to a central system (which can be fragile and complex to manage), they send logs to a local Fluentd instance. This local instance then uses the forward protocol to push logs to a dedicated aggregator Fluentd. This decouples the application from the backend logging infrastructure, allowing for easier scaling and management of the log processing pipeline.
The forward protocol is essentially a custom TCP-based protocol designed for this purpose. It uses a persistent connection, chunking for efficiency, and built-in mechanisms for acknowledging receipt of data. When a client (the application server’s Fluentd) sends data, it waits for an acknowledgment from the server (the aggregator Fluentd). If an acknowledgment isn’t received within a timeout, the client assumes the data might have been lost and initiates a retry. This retry logic is crucial for ensuring data durability.
The flush_interval on the client side controls how often the client tries to send buffered data. A smaller interval means lower latency but potentially more frequent, smaller network requests. A larger interval can improve efficiency by sending larger batches but increases latency. The aggregator’s port and bind directives determine where it listens for incoming connections. The tag app.log.** on the aggregator ensures it receives logs from all sources tagged with app.log. followed by anything.
The retry_max_times and retry_wait parameters in the <buffer> section of the client’s forward output are critical for reliability. By default, Fluentd has a retry mechanism. If a connection fails, it will retry sending the buffer. For example, to retry sending a buffer up to 10 times with a 5-second delay between retries, you’d configure:
<match app.log.app1>
@type forward
flush_interval 10s
<buffer tag,time>
flush_interval 10s
retry_max_times 10
retry_wait 5s
</buffer>
<server>
host aggregator.example.com
port 24224
</server>
</match>
This ensures that transient network issues or temporary unavailability of the aggregator don’t result in permanent log loss. The retry_wait value is the base wait time; Fluentd uses exponential backoff, so subsequent retries will wait longer than 5 seconds.
A common misunderstanding is that the flush_interval in the <match> block is the only mechanism for sending data. In reality, it’s a trigger for flushing the buffer, but the buffer itself has its own lifespan and retry mechanisms controlled by <buffer>. If the buffer fills up before the flush_interval is reached, it will also attempt to flush.
If you’re seeing data loss, it’s often due to insufficient retry configurations or network saturation. The client’s buffer is the last line of defense before data is truly lost. If the network is completely down for an extended period, and retry_max_times is exhausted, the buffer will eventually be discarded.
The next step in building a robust logging pipeline is often handling failures at the aggregator itself, perhaps by buffering to disk before sending to a final destination like Elasticsearch or S3.