[Traceql Metrics] PR 4 - Sampling #3275

mdisibio · 2024-01-09T19:53:19Z

What this PR does:
This PR introduces subsampling hints for TraceQL metrics queries. I.e. inspect a random 50% of data and then scale the resulting rates/counts by 2x. This allows us to trade a customizable amount of accuracy for speed. In internal testing it can be surprisingly accurate down to even ~10%. Of course it depends on the query.

It is controlled through a new system of query hints which can be applied to any query. The new hint is named sample and takes a float:

Sample 100% (default behavior):
{ } | rate()

Sample 10%:
{ } | rate() with(sample=0.1)

Overall design
The sampling rate is enacted by manipulating the existing job shards to cover less data. For example, take the job for shard 1 of 10. This covers 10% of trace IDs. To sample this by 50%, we convert it to be shard 2 of 20. Now it only covers 5% of traces IDs (exactly the latter half of previous range). Then the results are multiplied back to get the final metrics. I like this approach because it maintains the uniform inspection of data across the entire range, and it happens purely through the query-frontend layer. A drawback is that the total number of jobs doesn't change. This will continue to be a bottleneck for large requests.

Hints
The new hints system is added generically. You can add with(key=val, key2=val2...) to any query. There is no validation, so unsupported hints or using sample=... with a non-metrics query is simply ignored. It supports all TraceQL value types, so we can have hints with strings, ints, etc. I think this could be useful and I already have a few ideas for future hints.

Other Changes

Discovered that 8-byte trace IDs may be 63-bits of randomness only. Fixing the trace ID sharding to account for this gives a huge accuracy boost when sampling because shards are correctly weighted. Very important to the usefulness of it.
Generator request has a new param mode so we can do sampling rates on the generators too.

Alternatives
There are other ways to inspect only 50% of the data (for example).

We could drop half of the jobs. Instead of executing all 10 shards, we only execute 1 through 5. The drawbacks to the approach are: (a) statistically less accurate because we wouldn't be sampling the traceID space uniformly. (b) the performance gains don't materialize as well because the minimum job size remains the same. Increased parallelization and scale won't necessarily make it faster.
modulus() or rand(). This would read every-other-trace on average. This sounds great in practice but it has no reduction in I/O, and would end up being both less accurate and no faster.

Notes
This is one entry in a set of chained PRs. The full body of work has been split into separate buckets to make reviews and updates more manageable.

Which issue(s) this PR fixes:
n/a

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

…to the generator. Fix sampling rate rounding

zalegrala

This looks good to me. The lint has a few things to say and I left a comment about the query mode string.

modules/frontend/query_range_sharding.go

mdisibio requested review from joe-elliott, annanay25, mapno, yvrhdn, zalegrala, electron0zero, ie-pham and stoewer as code owners January 9, 2024 19:53

mdisibio changed the title ~~Traceql metrics 4 sampling~~ [Traceql Metrics] PR 4 - Sampling Jan 9, 2024

mdisibio force-pushed the traceql-metrics-3-sharding branch from f79e36b to 08cb7a8 Compare January 12, 2024 14:47

Base automatically changed from traceql-metrics-3-sharding to main January 12, 2024 17:35

mdisibio added 4 commits January 12, 2024 12:37

Add general purpose with(hints) to traceql

161c685

draft changes to support sampling rate hint

bd07f8d

Add query mode parameter so we can apply sharding and sampling rules …

bf1de06

…to the generator. Fix sampling rate rounding

Switch to last shard

afcd24c

mdisibio force-pushed the traceql-metrics-4-sampling branch from 1ae2ac2 to afcd24c Compare January 12, 2024 18:03

zalegrala approved these changes Jan 12, 2024

View reviewed changes

modules/frontend/query_range_sharding.go Outdated Show resolved Hide resolved

mdisibio added 2 commits January 12, 2024 14:08

Make query recent mode a const

69afb9d

lint

2b50f39

mdisibio merged commit 2ca7265 into main Jan 12, 2024
15 checks passed

mdisibio deleted the traceql-metrics-4-sampling branch January 12, 2024 19:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Traceql Metrics] PR 4 - Sampling #3275

[Traceql Metrics] PR 4 - Sampling #3275

mdisibio commented Jan 9, 2024 •

edited

Loading

zalegrala left a comment

[Traceql Metrics] PR 4 - Sampling #3275

[Traceql Metrics] PR 4 - Sampling #3275

Conversation

mdisibio commented Jan 9, 2024 • edited Loading

zalegrala left a comment

Choose a reason for hiding this comment

mdisibio commented Jan 9, 2024 •

edited

Loading