Fluentd, a log collector, is sending logs to Google BigQuery, a data warehouse, for analysis.

Let’s see this in action. Imagine a web server generating access logs.

{"remotehost": "192.168.1.100", "user": "-", "method": "GET", "path": "/index.html", "code": 200, "size": 1024, "referer": "http://example.com/", "ua": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"}
{"remotehost": "192.168.1.101", "user": "-", "method": "POST", "path": "/submit", "code": 201, "size": 50, "referer": "-", "ua": "curl/7.54.0"}

Fluentd, running on the web server, picks these up. Its configuration tells it to parse these into structured fields. Then, it sends them to BigQuery.

Here’s a simplified Fluentd configuration (fluentd.conf):

<source>
  @type tail
  path /var/log/nginx/access.log
  pos_file /var/log/td-agent/nginx.pos
  tag nginx.access
  <parse>
    @type nginx
  </parse>
</source>

<match nginx.access>
  @type google_bigquery
  method insert
  auth_method json_key
  json_key /etc/td-agent/google_cloud.json
  project_id your-gcp-project-id
  dataset_id nginx_logs
  table_id access_logs
  auto_create_table true
  time_key time
  flush_interval 10s
  <buffer tag, time>
    @type file
    path /var/log/td-agent/buffer/nginx
    flush_interval 10s
  </buffer>
</match>

When Fluentd receives a log record, it applies the nginx parser. This transforms the raw text into a JSON object with keys like remotehost, method, code, etc. The <match> directive then takes these structured records. It uses the google_bigquery output plugin to send them to a BigQuery table. The insert method means it’s appending new rows. auto_create_table true is handy; if the table doesn’t exist, Fluentd will create it based on the schema of the first record it sends. The time_key time tells Fluentd which field in the record to use for BigQuery’s time-partitioning (if enabled). The <buffer> section ensures that logs aren’t lost if BigQuery is temporarily unavailable, by writing them to disk first.

The problem this solves is turning raw, often unstructured, log files into a queryable, structured dataset. Instead of greping through gigabytes of text files, you can run SQL queries like SELECT COUNT(*) FROM nginx_logs.access_logs WHERE code = 500 directly in BigQuery. This makes log analysis vastly more efficient and powerful, enabling real-time dashboards, anomaly detection, and complex trend analysis.

The google_bigquery plugin handles schema detection and table creation, but it’s crucial to understand how it infers the schema. It inspects the fields within the first few records that pass through it. If your logs have varying field structures, or if critical fields are sometimes missing, BigQuery might create a schema that doesn’t perfectly capture all your data or might even fail if a required field for a specific data type is missing. For instance, if a log record has a field duration as a string like "1.23s" and another record has it as an integer 1230, the plugin will likely default to a string type to accommodate both, potentially requiring casting later for numerical analysis.

The google_bigquery output plugin by default uses the INSERT streaming API for writing data. This means each batch of logs is sent as individual rows to BigQuery, making them available for querying almost immediately. However, this also means you are subject to BigQuery’s streaming insert quotas and costs. For high-volume scenarios, you might consider switching to a batch load method, which involves writing logs to Google Cloud Storage and then loading them into BigQuery periodically. This can be more cost-effective and bypasses streaming quotas, but introduces a delay in data availability.

The next step is often to optimize BigQuery’s schema and partitioning for performance and cost.

Want structured learning?

Take the full Fluentd course →