Fluentd, a popular log collector, can archive its logs to Amazon S3, but the real magic happens when you partition those archives by time. This isn’t just about dumping files; it’s about structuring your data for efficient retrieval and analysis later.

Let’s see this in action. Imagine Fluentd is collecting web server logs. Without partitioning, you’d have a massive S3 bucket with thousands of files, all mixed together. Finding logs from a specific hour on a specific day would be a nightmare.

Here’s a simplified Fluentd configuration (fluentd.conf) that sends logs to S3 with time-based partitioning:

<source>
  @type tail
  path /var/log/nginx/access.log
  pos_file /var/log/td-agent/nginx-access.log.pos
  tag nginx.access
  <parse>
    @type nginx
  </parse>
</source>

<match nginx.access>
  @type s3
  # Replace with your actual S3 bucket name
  bucket your-log-bucket-name
  # Use a prefix for better organization, including time-based partitioning
  # %Y/%m/%d/%H/ corresponds to Year/Month/Day/Hour
  prefix logs/nginx/%Y/%m/%d/%H/
  # Choose a region that matches your S3 bucket
  region ap-southeast-1
  # How often to flush buffered data to S3 (in seconds)
  flush_interval 300 # 5 minutes
  # How often to buffer data before flushing (in seconds)
  buffer_chunk_limit 8m
  buffer_queue_limit 32
  buffer_total_limit_size 1g
  # Format of the output files
  format json
  # Compression for smaller file sizes and faster uploads
  compress gzip
  # Number of workers to use for writing to S3
  num_threads 4
  # Specify how to generate the filename, including time
  # %Y%m%d%H%M%S is YearMonthDayHourMinuteSecond
  # Use a unique identifier to prevent overwrites if multiple Fluentd instances
  # are writing to the same partition.
  time_slice_format %Y%m%d%H%M%S
  time_slice_wait 10
  utc
</match>

This configuration tells Fluentd to tail Nginx access logs, parse them, and then send them to an S3 bucket named your-log-bucket-name. The prefix directive is where the time-based partitioning happens.

Let’s break down the core components and the mental model:

  • Source (<source>): This is where Fluentd starts collecting data. In our example, it’s tailing a file (/var/log/nginx/access.log). You could also use sources for Kafka, Syslog, TCP, UDP, and many others.
  • Tag (tag): A logical name assigned to the data stream. This is used to route data to specific outputs. nginx.access is our tag here.
  • Match (<match>): This is the destination for the tagged data. Our s3 plugin is configured here.
  • S3 Plugin (@type s3): This is the workhorse. It handles buffering, flushing, and uploading data to S3.
  • Bucket (bucket): The S3 bucket where your logs will be stored.
  • Prefix (prefix): This is the key to partitioning. The prefix logs/nginx/%Y/%m/%d/%H/ tells Fluentd to create a directory structure within your S3 bucket.
    • %Y: Four-digit year (e.g., 2023)
    • %m: Two-digit month (e.g., 10)
    • %d: Two-digit day (e.g., 26)
    • %H: Two-digit hour (e.g., 14) So, logs from October 26, 2023, between 2 PM and 3 PM, would land in s3://your-log-bucket-name/logs/nginx/2023/10/26/14/. This makes querying by date and hour incredibly fast because S3 can prune the search space.
  • format: The structure of the log data within the file (e.g., json, ltsv, msgpack). json is common for structured logs.
  • compress: Applying compression like gzip significantly reduces storage costs and upload times.
  • time_slice_format and time_slice_wait: These control how Fluentd groups data into "slices" before flushing. time_slice_format %Y%m%d%H%M%S ensures that each file is named based on a specific timestamp, and time_slice_wait 10 gives Fluentd a 10-second buffer to collect more data for that slice before closing and uploading it.
  • utc: This is crucial for consistent time-based partitioning across different time zones. It ensures that the %Y, %m, %d, %H directives use Coordinated Universal Time, preventing confusion.

The true power here is that S3’s object naming scheme directly maps to a hierarchical file system. When you query S3, especially with tools like Athena or Redshift Spectrum, providing a prefix like s3://your-log-bucket-name/logs/nginx/2023/10/26/ tells S3 to only scan objects within that specific path, drastically reducing scan times and costs.

A common pitfall is neglecting the utc setting when dealing with logs originating from or being processed across different time zones, leading to misaligned partitions.

Once you have your logs partitioned, the next logical step is to analyze them efficiently, often using AWS Athena to query the S3 data directly.

Want structured learning?

Take the full Fluentd course →