Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Traceql Metrics] PR 4 - Sampling #3275

Merged
merged 6 commits into from
Jan 12, 2024
Merged

[Traceql Metrics] PR 4 - Sampling #3275

merged 6 commits into from
Jan 12, 2024

Conversation

mdisibio
Copy link
Contributor

@mdisibio mdisibio commented Jan 9, 2024

What this PR does:
This PR introduces subsampling hints for TraceQL metrics queries. I.e. inspect a random 50% of data and then scale the resulting rates/counts by 2x. This allows us to trade a customizable amount of accuracy for speed. In internal testing it can be surprisingly accurate down to even ~10%. Of course it depends on the query.

It is controlled through a new system of query hints which can be applied to any query. The new hint is named sample and takes a float:

Sample 100% (default behavior):
{ } | rate()

Sample 10%:
{ } | rate() with(sample=0.1)

Overall design
The sampling rate is enacted by manipulating the existing job shards to cover less data. For example, take the job for shard 1 of 10. This covers 10% of trace IDs. To sample this by 50%, we convert it to be shard 2 of 20. Now it only covers 5% of traces IDs (exactly the latter half of previous range). Then the results are multiplied back to get the final metrics. I like this approach because it maintains the uniform inspection of data across the entire range, and it happens purely through the query-frontend layer. A drawback is that the total number of jobs doesn't change. This will continue to be a bottleneck for large requests.

Hints
The new hints system is added generically. You can add with(key=val, key2=val2...) to any query. There is no validation, so unsupported hints or using sample=... with a non-metrics query is simply ignored. It supports all TraceQL value types, so we can have hints with strings, ints, etc. I think this could be useful and I already have a few ideas for future hints.

Other Changes

  1. Discovered that 8-byte trace IDs may be 63-bits of randomness only. Fixing the trace ID sharding to account for this gives a huge accuracy boost when sampling because shards are correctly weighted. Very important to the usefulness of it.
  2. Generator request has a new param mode so we can do sampling rates on the generators too.

Alternatives
There are other ways to inspect only 50% of the data (for example).

  1. We could drop half of the jobs. Instead of executing all 10 shards, we only execute 1 through 5. The drawbacks to the approach are: (a) statistically less accurate because we wouldn't be sampling the traceID space uniformly. (b) the performance gains don't materialize as well because the minimum job size remains the same. Increased parallelization and scale won't necessarily make it faster.
  2. modulus() or rand(). This would read every-other-trace on average. This sounds great in practice but it has no reduction in I/O, and would end up being both less accurate and no faster.

Notes
This is one entry in a set of chained PRs. The full body of work has been split into separate buckets to make reviews and updates more manageable.

  1. [Traceql Metrics] PR 1 - Engine #3251
  2. [Traceql Metrics] PR 2 - API #3252
  3. [Traceql Metrics] PR 3 - Trace ID sharding #3258
  4. [Traceql Metrics] PR 4 - Sampling #3275

Which issue(s) this PR fixes:
n/a

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@mdisibio mdisibio changed the title Traceql metrics 4 sampling [Traceql Metrics] PR 4 - Sampling Jan 9, 2024
@mdisibio mdisibio force-pushed the traceql-metrics-3-sharding branch from f79e36b to 08cb7a8 Compare January 12, 2024 14:47
Base automatically changed from traceql-metrics-3-sharding to main January 12, 2024 17:35
@mdisibio mdisibio force-pushed the traceql-metrics-4-sampling branch from 1ae2ac2 to afcd24c Compare January 12, 2024 18:03
Copy link
Contributor

@zalegrala zalegrala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. The lint has a few things to say and I left a comment about the query mode string.

@mdisibio mdisibio merged commit 2ca7265 into main Jan 12, 2024
15 checks passed
@mdisibio mdisibio deleted the traceql-metrics-4-sampling branch January 12, 2024 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants