InfluxDB can grind to a halt not because it’s slow, but because it’s too eager to store every single variation of a tag value.
Let’s see this in action. Imagine we’re collecting metrics about user activity.
# Start InfluxDB and create a database
influxd &
influx
> CREATE DATABASE metrics
# Write some data with a high-cardinality tag ('user_id')
> USE metrics
> INSERT cpu,host=server1,user_id=user_abc value=0.5
> INSERT cpu,host=server1,user_id=user_xyz value=0.6
> INSERT cpu,host=server1,user_id=user_123 value=0.7
> INSERT cpu,host=server1,user_id=user_abc value=0.55
> INSERT cpu,host=server1,user_id=user_xyz value=0.65
> INSERT cpu,host=server1,user_id=user_123 value=0.75
Now, let’s say we have millions of unique user_id values. This is where the trouble starts. Each unique user_id value requires InfluxDB to maintain an index. When you have millions of these, the index itself becomes massive, consuming huge amounts of RAM and CPU. Queries that might seem simple, like SELECT value FROM cpu WHERE host='server1', suddenly have to traverse this colossal index, leading to extreme slowdowns or even outright crashes. The system isn’t failing to write data; it’s failing to read it efficiently because the index for that data is unmanageable.
The core problem is that InfluxDB, by default, treats every distinct tag value as a separate entry in its inverted index. This is fantastic for filtering on a small, known set of tags (like host, region, environment). But when a tag can have a near-infinite number of unique values (like user_id, request_id, session_id, transaction_id), this index explodes. Memory usage skyrockets, cache hit rates plummet, and query performance degrades to unusable levels. The system becomes choked on its own metadata.
To combat this, you need to identify which tags are causing this cardinality explosion and adjust your data model accordingly.
Here are the common culprits and how to address them:
1. User IDs or Session IDs as Tags:
- Diagnosis: Run
SHOW TAG VALUES FROM <measurement> WITH KEY = "user_id"(replace<measurement>and"user_id"). If you get millions of distinct values, this is a problem. - Fix: Stop tagging with these high-cardinality fields. Instead, use them as fields in your measurements. For example, write
cpu,host=server1 user_id="user_abc" value=0.5instead ofcpu,host=server1,user_id="user_abc" value=0.5. - Why it works: Fields are not indexed in the same way as tags. They are stored directly with the time series data, significantly reducing index size and memory pressure.
2. Request/Transaction IDs as Tags:
- Diagnosis: Similar to user IDs,
SHOW TAG VALUES FROM <measurement> WITH KEY = "request_id". - Fix: Again, move these to fields.
http_requests,host=server1 request_id="abc123xyz" status=200,duration_ms=50is better thanhttp_requests,host=server1,request_id="abc123xyz" status=200,duration_ms=50. - Why it works: Reduces the number of indexed keys InfluxDB has to manage, allowing it to focus on more general filtering.
3. Extremely Granular Event Identifiers:
- Diagnosis: If you’re tagging on things like
error_code="ERR_12345_USER_SPECIFIC_MSG"where the "USER_SPECIFIC_MSG" part changes per event. - Fix: Tag on the type of event or a generalized identifier, not the specific instance. Tag
error_type="USER_LOGIN_FAILED"and put the specific error message in a field. - Why it works: It groups similar events under a single tag value, drastically reducing cardinality while still allowing for filtering by event category.
4. Dynamic Hostnames or Service Names in Tags:
- Diagnosis:
SHOW TAG VALUES FROM <measurement> WITH KEY = "service_name". If you have thousands of ephemeral containers or microservices, each with a unique name. - Fix: If your infrastructure uses a stable identifier (like a Kubernetes pod name, a VM ID, or a deployment name), use that. If not, consider if the service name is truly necessary for filtering or if it belongs in a field.
- Why it works: Promotes the use of stable, less-frequently changing identifiers for indexing, which InfluxDB can handle efficiently.
5. Incorrect Schema Design (e.g., Tagging Timestamp-like Data):
- Diagnosis: You might have a tag like
event_timestamp="2023-10-27T10:00:01Z"and you’re trying to filter by it. - Fix: Timestamps are inherent to time-series data and should never be stored as tags. They are already part of the data structure. If you need to filter by a specific time, use InfluxDB’s time-based query functions.
- Why it works: Avoids creating a massive index for data that InfluxDB already manages intrinsically.
6. Using Geographic Coordinates Directly as Tags:
- Diagnosis: Tagging with
latitude="34.0522"andlongitude="-118.2437". - Fix: If you need to query by location, consider geohashing or creating broader region tags (e.g.,
city="Los Angeles",country="USA"). - Why it works: Geohashing compresses coordinates into a single string, significantly reducing cardinality. Broader region tags also group many individual points.
7. Over-Tagging Generic Attributes:
- Diagnosis: Tagging every possible attribute of an object, like
device_model="XYZ-1000",device_version="v2.1.3",device_serial="SN123456789". - Fix: Prioritize which attributes are truly used for filtering. If
device_modelanddevice_versionare common filter criteria, keep them. Ifdevice_serialis unique to each device and rarely filtered on, move it to a field. - Why it works: Focuses the index on the most frequently queried dimensions, making lookups faster.
The key principle is to reserve tags for dimensions that are used to group and filter sets of measurements, not for unique identifiers or high-frequency changing attributes.
Once you’ve corrected your schema and potentially re-ingested data, you’ll need to restart InfluxDB for the changes to take full effect on the index structures. The next error you might encounter is related to data retention policies if you haven’t configured them, leading to disk space exhaustion.