Loki’s topk and sort expressions are your go-to tools for wrangling massive streams of logs, letting you zoom in on the most frequent errors or the slowest requests without drowning in data.

Let’s see this in action. Imagine you’ve got a Kubernetes cluster spitting out logs, and you want to find the top 5 most frequent error messages from your my-app deployment.

topk(5, count_over_time({app="my-app"} |= "error" [5m]))

This query does a few things:

  1. {app="my-app"}: This is your basic label selector, filtering logs to only those originating from your my-app service.
  2. |= "error": This filters those logs further, only keeping lines that contain the string "error". This is a content-based filter.
  3. [5m]: This is the range parameter. For every 5-minute interval, Loki will look at the logs.
  4. count_over_time(...): This function counts the number of log lines that matched the selectors within each 5-minute interval.
  5. topk(5, ...): Finally, this takes the results from count_over_time and returns the top 5 distinct log lines (based on their content, after the |= "error" filter) that had the highest counts within those 5-minute intervals. Loki automatically groups by the content of the log lines when count_over_time is used this way without explicit grouping.

The output might look something like this:

{app="my-app", level="error"} 150
{app="my-app", level="error"} 120
{app="my-app", level="error"} 95
{app="my-app", level="error"} 88
{app="my-app", level="error"} 70

Here, the numbers represent the counts for each unique log message that contained "error" within any 5-minute window. The keys are the labels associated with those logs.

Now, what if you want to find the slowest requests? Let’s say your logs have a latency field and you want the top 3 requests by average latency over the last hour.

topk(3, avg by (request) (sum by (request) (latency_seconds_count) / sum by (request) (latency_seconds_sum)))

This is a bit more involved, using the Prometheus-style metric functions available in Loki.

  1. {job="my-api"}: Selects logs from your API job.
  2. latency_seconds_count and latency_seconds_sum: These are hypothetical metric names that Loki can generate if you’ve configured it to parse and expose metrics from your logs. A typical log line might look like level=info request=/users latency_seconds=0.123. Loki can transform this into a counter (latency_seconds_count) and a sum (latency_seconds_sum) for each request label.
  3. sum by (request) (latency_seconds_count) and sum by (request) (latency_seconds_sum): These aggregate the counts and sums for each unique request path.
  4. sum by (request) (latency_seconds_count) / sum by (request) (latency_seconds_sum): This calculates the average latency for each request path by dividing the total count of latency measurements by the sum of those measurements.
  5. avg by (request) (...): This part is a bit subtle. If you have multiple log streams contributing to the same request, this avg function ensures you’re getting the overall average latency.
  6. topk(3, ...): Finally, this picks the top 3 request paths with the highest average latencies.

The output might look like:

{request="/admin/users"} 1.52
{request="/api/v1/orders"} 0.98
{request="/public/status"} 0.75

This tells you that /admin/users is experiencing the highest average latency.

The sort expression works similarly but orders all results based on a given metric, rather than just picking the top N. For instance, to see all requests sorted by their average latency in descending order:

sort desc by (avg_latency) (
  avg by (request) (
    sum by (request) (latency_seconds_count) / sum by (request) (latency_seconds_sum)
  )
)

Here, sort desc by (avg_latency) takes the results of the average latency calculation and orders them from highest to lowest. The name avg_latency is just an alias you give to the expression for clarity in the output.

A crucial detail often missed is how Loki handles grouping and aggregation with topk and sort when no explicit by (...) clause is present in the aggregation function. If you use count_over_time({app="my-app"} |= "error" [5m]) without by (message), Loki will implicitly group by the content of the log line after the filter. This means it’s counting occurrences of unique error messages, not just grouping by labels. When you add by (request) to avg_latency calculation, you’re explicitly telling it to group by the request label.

The next logical step after identifying your slowest requests is to dive into the actual log lines for those specific requests to understand why they are slow, perhaps by filtering logs for a particular request and looking for specific error messages or unusually long processing times within those logs.

Want structured learning?

Take the full Loki course →