Bloom filters are Loki’s secret weapon for making your log queries blaze, but they’re not an index in the traditional sense.
Let’s watch a query unfold with bloom filters. Imagine we have logs from two distinct applications, app-a and app-b, flowing into Loki. Each log line has a job label indicating its origin.
// Log line from app-a
{ "job": "app-a", "level": "info", "message": "User logged in", "ts": 1678886400000000000 }
// Log line from app-b
{ "job": "app-b", "level": "warn", "message": "Database connection failed", "ts": 1678886401000000000 }
When these logs arrive, Loki doesn’t just dump them into a giant file. It processes them, and for each unique combination of label values (like job="app-a" or job="app-b"), it builds a bloom filter. This filter is a compact, probabilistic data structure.
Now, you want to query for all logs where job="app-a" and level="info". Your query looks like this:
{job="app-a", level="info"}
Loki first consults its index. This index contains metadata about where the data for each label combination is stored. Crucially, it also has a pointer to the bloom filter associated with the job="app-a" label set.
Loki then takes your query’s filter (job="app-a") and hashes it multiple times. Each hash result points to a specific bit in the bloom filter for job="app-a". If any of those bits are not set in the bloom filter, Loki knows with 100% certainty that no log entry with job="app-a" can possibly match your query’s criteria. It can immediately discard the entire chunk of data associated with that job="app-a" set without even looking at the actual log lines. This is the acceleration.
If all the bits are set, it means there’s a possibility that logs matching your query exist within that chunk. Loki then proceeds to check the actual log data for that chunk. This is the probabilistic part – bloom filters can have false positives (saying something might be there when it’s not), but never false negatives (saying something isn’t there when it is).
The problem Loki solves is the sheer volume of log data. Without bloom filters, Loki would have to scan every single log line for every query, which is prohibitively slow. Bloom filters allow Loki to make an educated guess very, very quickly, eliminating vast swathes of data that cannot possibly contain the desired logs.
Internally, Loki stores these bloom filters alongside its index files. When Loki ingests data, it maintains these filters. For a given label set (e.g., job="app-a"), it calculates hash values for the values of other labels present in logs with that label set. For instance, if logs with job="app-a" also have level="info" and level="warn", the bloom filter for job="app-a" will have bits set corresponding to hashes of "info" and "warn".
When you query {job="app-a", level="info"}, Loki checks the bloom filter for job="app-a" using the hash of "info". If that bit is set, it proceeds. If your query was {job="app-a", level="debug"}, and "debug" was never seen with job="app-a", the corresponding bit in the bloom filter would likely be unset, and Loki would skip that chunk.
The actual bloom filter implementation in Loki is typically based on the bloom package, using multiple hash functions and a bit array. The size of the bit array and the number of hash functions are configurable parameters that balance memory usage against the false positive rate.
The most surprising aspect is how Loki leverages this probabilistic structure to guarantee correctness by always having a fallback to scan the actual data. The bloom filter is purely an optimization; if it says "maybe," Loki still verifies. It never relies solely on the bloom filter for a definitive "yes." This is why you’ll sometimes see queries that seem slow even with bloom filters – if the filter says "yes" for a large chunk of data, Loki still has to do the work of examining that data.
The next logical step is understanding how Loki’s querier component optimizes reads after the bloom filter has passed a chunk, specifically how it uses min/max index data to prune even further.