InfluxDB doesn’t actually store data as Parquet; it stores it in its own highly optimized, time-series specific columnar format.
Let’s see it in action. Imagine you have a simple InfluxDB setup and you want to export some metrics.
# First, set up some sample data
influx -execute 'CREATE DATABASE my_metrics'
influx -execute 'USE my_metrics'
influx -execute 'CREATE RETENTION POLICY thirty_days ON my_metrics DURATION 30d REPLICATION 1'
influx -execute 'INSERT cpu,host=server01,region=us-east-1 usage=0.5,idle=0.5'
influx -execute 'INSERT cpu,host=server01,region=us-east-1 usage=0.6,idle=0.4 1678886400' # March 15, 2023 12:00:00 UTC
influx -execute 'INSERT cpu,host=server01,region=us-east-1 usage=0.7,idle=0.3 1678886460' # March 15, 2023 12:01:00 UTC
Now, to get this data out into Parquet, you’d typically use a tool that can query InfluxDB and write to Parquet. A common pattern is to use a data processing framework like Apache Spark or a dedicated export tool. Here’s a conceptual Spark job snippet (actual implementation details would vary based on your Spark setup and object storage connector):
import org.apache.spark.sql.SparkSession
import org.influxdata.spark.InfluxDBSpark
val spark = SparkSession.builder()
.appName("InfluxDBtoParquet")
.master("local[*]") // Or your cluster manager
.getOrCreate()
// InfluxDB connection properties
val influxdbOptions = Map(
"host" -> "localhost",
"port" -> "8086",
"database" -> "my_metrics",
"username" -> "admin", // If authentication is enabled
"password" -> "password" // If authentication is enabled
)
// Read data from InfluxDB
val df = spark.read
.format("org.influxdata.spark")
.options(influxdbOptions)
.option("measurement", "cpu") // Specify the measurement to read
.load()
// Show the schema and some data
df.printSchema()
df.show(5, truncate = false)
/*
Output might look like:
root
|-- time: timestamp (nullable = true)
|-- host: string (nullable = true)
|-- region: string (nullable = true)
|-- usage: double (nullable = true)
|-- idle: double (nullable = true)
+-------------------+---------+----------+-------+----+
|time |host |region |usage |idle|
+-------------------+---------+----------+-------+----+
|2023-03-15 12:00:00|server01 |us-east-1 |0.5 |0.5 |
|2023-03-15 12:00:00|server01 |us-east-1 |0.6 |0.4 |
|2023-03-15 12:01:00|server01 |us-east-1 |0.7 |0.3 |
+-------------------+---------+----------+-------+----+
*/
// Write to Parquet in Object Storage (e.g., S3)
val parquetOutputPath = "s3a://my-bucket/influxdb-exports/cpu/"
df.write.mode("overwrite").parquet(parquetOutputPath)
spark.stop()
This process addresses a fundamental challenge: InfluxDB is optimized for fast writes and range queries on time-series data, but it’s not designed for large-scale analytical processing or integration with traditional big data ecosystems that favor formats like Parquet. Parquet is a columnar storage format that offers excellent compression and encoding schemes, making it efficient for analytical workloads and compatible with tools like Spark, Presto, and Hive. By exporting to Parquet, you’re essentially translating InfluxDB’s specialized time-series structure into a format that’s more amenable to broader data analysis and long-term archival.
The core problem this solves is bridging the gap between a specialized time-series database and a general-purpose data lake or analytical platform. InfluxDB excels at ingesting and querying high-volume, high-velocity time-series data. However, when you need to perform complex aggregations, join time-series data with other relational or semi-structured data sources, or leverage machine learning frameworks that operate on tabular data, InfluxDB’s native format can be a bottleneck. Parquet, with its schema evolution capabilities and efficient columnar storage, becomes the lingua franca for such analytical tasks. The process involves querying InfluxDB to retrieve the desired time-series data and then transforming it into the Parquet format, typically by using a query engine or data processing framework that can both read from InfluxDB and write to Parquet.
When you’re exporting data, especially from measurements with many tags or fields, the default behavior of the InfluxDB Spark connector might flatten all tags and fields into top-level columns. This is generally desirable for Parquet, as it creates a standard tabular structure. However, if you have a very high cardinality of tags, this can lead to a very wide table. The connector often handles this by making all tags and fields top-level columns. The time field is always preserved as the primary time index.
The most surprising thing about this export is how the schema is handled. InfluxDB’s internal representation is not strictly tabular with fixed columns for all data points across all series. When you query it (especially via tools like the Spark connector), it resolves a schema based on the data it finds for your query. Tags and fields that exist for a particular series at a particular time become columns. If a tag or field is missing for a specific data point, that column will be null for that row in the resulting DataFrame/Parquet file. This means the Parquet file will have a schema that reflects the union of all tags and fields encountered for the queried measurement, which is crucial for analytical processing.
The next logical step after archiving your data in Parquet is often to query it using a distributed SQL engine like Presto or Trino, or to load it into a data warehouse.