Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Update approx_topk documentation #16223

Merged
merged 4 commits into from
Mar 11, 2025
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 17 additions & 3 deletions docs/sources/query/metric_queries.md
Original file line number Diff line number Diff line change
Expand Up @@ -156,11 +156,25 @@ Examples:

## Probabilistic aggregation

The `topk` keyword lets you find the largest 1,000 elements in a data stream by sample size. When `topk` hits the maximum series limit, LogQL also supports using a probable approximation; `approx_topk` is a drop-in replacement when `topk` hits the maximum series limit.
LogQL's `approx_topk` function provides a probabilistic approximation of `topk`. It is a drop-in replacement for `topk` that is great for when `topk` queries time out or hit the maximum series limit. This tends to happen when the list of values that you're sorting through in order to find the most frequent values is very large. `approx_topk` is also great in cases where a faster, approximate answer is preferred to a slower, more accurate one.

The function is of the form:

```logql
approx_topk(k, <vector expression>)
```

It is only supported for instant queries and does not support grouping. It is useful when the cardinality of the inner
vector is too high, for example, when it uses an aggregation by a structured metadata label.
`approx_topk` is only supported for instant queries. Grouping is also not supported and should be handled by an inner `sum by` or `sum without` even though this might not be the same behavior as `topk by`.

Under the hood, `approx_topk` is implemented using sharding. The expression `approx_topk(k,inner)` becomes

```
topk(
k,
eval_cms(
__count_min_sketch__(inner, shard=1) ++ __count_min_sketch__(inner, shard=2)...
)
)
```

`__count_min_sketch__` is calculated for each shard and merged on the frontend. Then `eval_cms` iterates through the labels list and determines the count for each. Then `topk` selects the top items.
Loading