InfluxDB’s schema-on-read flexibility is a double-edged sword, often leading to inconsistent data and complex queries until you realize you can, and absolutely should, enforce schemas.

Let’s watch InfluxDB in action with a typical dynamic schema scenario. Imagine we’re collecting IoT sensor data from a fleet of devices.

# First, send some data with a 'temperature' field
curl -XPOST 'http://localhost:8086/write?db=mydb' -d 'sensor,device=dev001 temperature=25.5'

# Now, send data for the same device, but with a 'humidity' field instead
curl -XPOST 'http://localhost:8086/write?db=mydb' -d 'sensor,device=dev001 humidity=60.2'

If you query this data with SELECT * FROM sensor, InfluxDB will happily return both records. However, if you try to query for SELECT temperature FROM sensor WHERE device='dev001', you’ll only get the first record. The second record doesn’t have a temperature field, so it’s implicitly NULL for that query. This implicit NULL behavior is where the pain begins. Over time, with dozens of fields appearing and disappearing across thousands of devices, your queries become a minefield of OR clauses and COALESCE functions trying to account for all possible permutations.

The problem InfluxDB solves here is the need for a highly scalable, time-series database that can ingest data rapidly without requiring upfront schema definitions. This is ideal for scenarios with unpredictable data sources or rapidly evolving metrics. Internally, InfluxDB uses a columnar storage format optimized for time-series data. When you write data, it’s appended to existing files or new ones are created. The schema isn’t enforced at write time; it’s inferred when you query. This "schema-on-read" approach allows for high ingest rates because InfluxDB doesn’t need to validate every incoming point against a predefined structure.

The levers you control are primarily through your application’s data ingestion logic and, more importantly for enforcement, through InfluxDB’s Continuous Queries and potentially external validation layers.

To enforce explicit schemas, you’ll leverage InfluxDB’s Continuous Queries (CQs) to transform and standardize data as it arrives. The core idea is to have a CQ that runs periodically, selects data that matches a desired schema, and writes it to a new measurement (or overwrites the old one, though creating new ones is generally safer for historical integrity).

Let’s say we want to enforce that the sensor measurement must have a temperature field, and we want to store it in a new measurement called sensor_temperature. We’ll also want to handle cases where temperature might be missing.

First, ensure you have a database created:

curl -XPOST 'http://localhost:8086/query?q=CREATE DATABASE mydb'

Now, let’s create a Continuous Query that runs every minute, selects temperature from sensor, and writes it to sensor_temperature. This CQ will automatically handle missing temperature fields by not including those points in the output for sensor_temperature.

curl -XPOST 'http://localhost:8086/query?db=mydb' \
  --data-urlencode 'q=CREATE CONTINUOUS QUERY sensor_temp_cq ON mydb BEGIN SELECT temperature INTO sensor_temperature FROM sensor GROUP BY time(1m), device END'

This query does a few things:

  1. CREATE CONTINUOUS QUERY sensor_temp_cq ON mydb: Defines a CQ named sensor_temp_cq within the mydb database.
  2. BEGIN ... END: Encloses the query logic.
  3. SELECT temperature: Specifies that we only care about the temperature field.
  4. INTO sensor_temperature: Directs the output of this query to a new measurement called sensor_temperature.
  5. FROM sensor: Indicates that the source of data is the sensor measurement.
  6. GROUP BY time(1m), device: This is crucial. It aggregates data into 1-minute intervals for each unique device. This means if a device sends multiple temperature readings within a minute, they’ll be grouped. The SELECT temperature will then pick one of these readings (typically the last one in the interval by default, though you can specify aggregation functions like mean(), max(), min(), first(), last()). This implicitly enforces a single temperature value per device per minute in sensor_temperature.

After this CQ is active, any data written to sensor will be processed by the CQ every minute. If a point in sensor has a temperature field, it will be included in sensor_temperature. If it doesn’t, it won’t.

To query your now-schema-enforced data:

curl -G 'http://localhost:8086/query?db=mydb' --data-urlencode 'q=SELECT temperature FROM sensor_temperature WHERE device='dev001''

This query will only return records that definitely have a temperature field, as they were explicitly selected and written to sensor_temperature.

The most powerful aspect of this approach is that the CQ acts as a data pipeline. You can chain CQs to create complex transformations and enforce multiple schemas. For instance, you could have one CQ for temperature, another for humidity, and a third that combines them into a single "environment" measurement if both exist.

The implicit GROUP BY in a Continuous Query can be a source of confusion if you expect individual data points. When you use GROUP BY time(1m), InfluxDB selects one value per group. If you don’t specify an aggregation function like mean(), last(), or first(), InfluxDB uses a default behavior that might not always be the last or first. To truly control this, always specify an aggregation function.

The next step is usually to consider how to handle missing required fields at the application level, or to use InfluxDB’s retention policies to prune raw data that you no longer need after it’s been processed by your CQs.

Want structured learning?

Take the full Influxdb course →