[Traceql Metrics] PR 4 - Sampling #3275
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What this PR does:
This PR introduces subsampling hints for TraceQL metrics queries. I.e. inspect a random 50% of data and then scale the resulting rates/counts by 2x. This allows us to trade a customizable amount of accuracy for speed. In internal testing it can be surprisingly accurate down to even ~10%. Of course it depends on the query.
It is controlled through a new system of query hints which can be applied to any query. The new hint is named
sample
and takes a float:Sample 100% (default behavior):
{ } | rate()
Sample 10%:
{ } | rate() with(sample=0.1)
Overall design
The sampling rate is enacted by manipulating the existing job shards to cover less data. For example, take the job for shard 1 of 10. This covers 10% of trace IDs. To sample this by 50%, we convert it to be shard 2 of 20. Now it only covers 5% of traces IDs (exactly the latter half of previous range). Then the results are multiplied back to get the final metrics. I like this approach because it maintains the uniform inspection of data across the entire range, and it happens purely through the query-frontend layer. A drawback is that the total number of jobs doesn't change. This will continue to be a bottleneck for large requests.
Hints
The new hints system is added generically. You can add
with(key=val, key2=val2...)
to any query. There is no validation, so unsupported hints or usingsample=...
with a non-metrics query is simply ignored. It supports all TraceQL value types, so we can have hints with strings, ints, etc. I think this could be useful and I already have a few ideas for future hints.Other Changes
mode
so we can do sampling rates on the generators too.Alternatives
There are other ways to inspect only 50% of the data (for example).
Notes
This is one entry in a set of chained PRs. The full body of work has been split into separate buckets to make reviews and updates more manageable.
Which issue(s) this PR fixes:
n/a
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]