Kapacitor can detect anomalies in your InfluxDB data by comparing current data points against historical patterns or defined thresholds.
Let’s say you’re monitoring CPU usage for a fleet of servers and want to be alerted when a server’s CPU spikes unexpectedly, not just when it exceeds a static 90% threshold, but when it’s unusually high for that specific server at that specific time of day.
Here’s a simplified InfluxDB query and a Kapacitor TICK script to achieve this:
InfluxDB Query (Conceptual):
SELECT mean("usage_system")
FROM "cpu"
WHERE time >= now() - 1h GROUP BY "host", time(1m)
This query fetches the average system CPU usage per host, grouped into 1-minute intervals over the last hour.
Kapacitor TICK Script:
// Define a stream to process data from InfluxDB
stream
// Specify the InfluxDB connection and query
.influxDBOut()
.database('telegraf')
.measurement('cpu')
.groupBy('host') // Group by host to analyze each server independently
.period(1m) // Process data in 1-minute intervals
.every(1m) // Check every 1 minute
.groupBy('host') // Ensure grouping by host is maintained
// Select the relevant field and apply a transformation
|> selectField('usage_system')
// Detect anomalies using the `anomaly` function
// This compares the current data point to historical data
// `n=60` means consider the last 60 data points (1 hour if period is 1m)
// `k=2` means alert if the current value is more than 2 standard deviations away
|> anomaly(n=60, k=2.0)
// Map the anomaly event to a specific alert
// We're looking for 'high' anomalies, meaning significantly above the historical average
|> map(lambda: "value" > 0) // Filter for positive anomalies
// Define the alert and its properties
|> alert()
.crit(lambda: True) // Trigger critical alert if the condition above is met
.message('High CPU anomaly detected on {{ .host }}: {{ .value | printf "%.2f" }}%')
.id('{{ .host }}-cpu-anomaly') // Unique ID for the alert
.tags({'host': '{{ .host }}'})
// Define where to send the alert (e.g., Slack, PagerDuty, email)
// For demonstration, we'll just log it. In a real scenario, you'd use .slack(), .pagerduty(), etc.
|> log()
Explanation of the TICK Script:
-
stream.influxDBOut(): This is the entry point, telling Kapacitor to pull data from InfluxDB. You specify the database (telegraf) and measurement (cpu) you’re interested in.groupBy('host')is crucial here; it ensures that Kapacitor processes each server’s data independently, allowing for host-specific anomaly detection..period(1m)and.every(1m)define how Kapacitor chunks the incoming data for processing. -
selectField('usage_system'): This selects the specific data field we want to analyze, in this case,usage_system. -
anomaly(n=60, k=2.0): This is the core of the anomaly detection.n=60: This parameter defines the window of historical data Kapacitor uses to establish a baseline. If yourperiodis 1 minute,n=60means it looks at the last 60 minutes of data.k=2.0: This parameter sets the sensitivity. Kapacitor calculates the mean and standard deviation of the historical data within thenwindow. An anomaly is flagged if the current data point deviates from the mean by more thanktimes the standard deviation.k=2.0is a common starting point, meaning it will flag values more than 2 standard deviations above or below the mean.
-
map(lambda: "value" > 0): Theanomalyfunction outputs both positive and negative anomalies. Thismapstatement filters for positive anomalies (values significantly higher than the historical average), which is typical for CPU usage spikes. If you wanted to detect unusually low usage, you might adjust this or look for negative anomalies. -
alert(): This defines the alert itself..crit(lambda: True): This means that if the data reaches this point in the pipeline (i.e., it’s been identified as a positive anomaly by the previous steps), a critical alert is triggered..message(...): This customizes the alert message, including the hostname and the anomalous value..id(...): Assigns a unique identifier to the alert, useful for managing alerts and preventing duplicates..tags(...): Attaches relevant tags to the alert, such as the hostname, which can be used for routing or filtering alerts in your notification system.
-
log(): In this example, we’re just logging the alert to Kapacitor’s logs. In a production setup, you would replace this with.slack(),.pagerduty(),.email(), or other notification integrations.
The true power here is that anomaly doesn’t rely on static thresholds. It learns what "normal" looks like for each host over time and alerts you when something deviates significantly from that specific host’s learned behavior. This is far more robust than a simple threshold alert, which would trigger the same way for a server that normally idles at 70% CPU and one that normally idles at 10%, if both hit 90%.
Once you’ve set up Kapacitor to send alerts to a notification endpoint like Slack, the next step is often to configure routing rules within Slack or a dedicated alert management tool to ensure critical alerts reach the right people immediately.