[TraceQL Metrics] New baseline comparison function #3695

mdisibio · 2024-05-21T03:44:48Z

What this PR does:
This adds a new metrics function compare which is used to split the stream of spans into two groups: a selection and a baseline. Then it returns time-series for all attributes found on the spans to highlight the differences between the two groups. This is kind of hard to describe so there are some example outputs below:

Function signature:
The function is used like other metrics functions, which it is placed after any search query, and converts it into a metrics query:
...any spanset pipeline... | compare({subset filters}, <topN>, <start timestamp>, <end timestamp>)

Example:
{ resource.service.name="a" && span.http.path="/myapi" } | compare({status=error})

Parameters:

Required. The first parameter is a spanset filter for choosing the subset of spans. This filter is executed against the incoming spans. If it matches, then the span is considered to be part of the selection. Otherwise it is part of the baseline. Common filters are expected to be things like {status=error} (what is different about errors?) or {duration>1s} (what is different about slow spans?)
Optional. The second parameter is the top N values to return per attribute. If an attribute exceeds this limit in either the selection group or baseline group, then only the top N values (based on frequency) are returned, and an error indicator for the attribute is included output (see below). Defaults to 10.
Optional. Start and end timestamps in unix nanoseconds, which can be used to additionally subset the spans in time. These timestamps must both be given, or neither. These parameters are unlike any others in traceql and therefore kind of clunky. Maybe in the future we can fix this by adding the ability to check span:startTime directly in the language, so it could be part of the filter.

Output:
The outputs are flat time-series for each attribute/value found in the spans. This function has a built-in select(*) so there can be a lot. Each series has a label __meta_type which denotes which group it is in, either selection or baseline.

Example output series:

{ __meta_type="baseline", resource.cluster="prod" } 123
{ __meta_type="baseline", resource.cluster="qa" } 124
{ __meta_type="selection", resource.cluster="prod" } 456   <--- significant difference detected
{ __meta_type="selection", resource.cluster="qa" } 125
{ __meta_type="selection", resource.cluster="dev"} 126  <--- cluster=dev was found in the highlighted spans but not in the baseline

When an attribute reaches the cardinality limit there will also be present an error indicator. This example means the attribute resource.cluster had too many values.

{ __meta_error="__too_many_values__", resource.cluster=<nil> }

Remaining Work

Not 100% settled on the meta labels and indication of attributes that reached max cardinality. Would appreciate feedback.
This function has a built-in select(*) to select all attributes of all spans (yes all). So it is likely to exceed gRPC payloads when run as a range query. We don't have official support for instant queries, but you can emulate it by setting step equal to end-start, so effectively it is a range query that returns a single datapoint.

Which issue(s) this PR fixes:
Fixes #

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

…alues

adrapereira · 2024-05-21T12:42:09Z

Instead of requiring max cardinality what if we also make it optional and default to a sensible value?

I'm thinking it would be better to return the values Tempo got until max and an error instead of returning an error and nil when it reaches max cardinality. This way that attribute would still show some value, instead of none. Think of a graph with no data and an error label vs a graph with some data and an error/warning label, what would you prefer?

I agree that the timestamps in the function are ugly, would be more TraceQL-y to have it as span:startTime as you mention.

mdisibio · 2024-05-21T13:20:16Z

Instead of requiring max cardinality what if we also make it optional and default to a sensible value?

Yep can do that. I think 10 is a sensible default.

I'm thinking it would be better to return the values Tempo got until max and an error instead of returning an error and nil when it reaches max cardinality. This way that attribute would still show some value, instead of none. Think of a graph with no data and an error label vs a graph with some data and an error/warning label, what would you prefer?

The main rationale was to avoid computing the exact topN values, which requires continuing to count and pass all values up to the query-frontend. There are two alternatives that are lossy but should be workable:

Lossy topN - each job performs topN, and then the query-frontend performs topN again. This is lossy because low-rate but omnipresent values that might be the actual topN get overshadowed by bursty values.
FirstN - also good performance, but not sure the usefulness.

adrapereira · 2024-05-21T13:30:24Z

My proposal was inline with your FirstN idea so either of your options would work for me, but curious about other opinions.

pkg/traceql/engine_metrics_compare.go

tempodb/encoding/vparquet4/block_traceql.go

pkg/traceql/engine_metrics_compare.go

mapno

LGTM

mdisibio added 9 commits May 20, 2024 16:39

Initial working version of compare

a0eb86b

Clean/rename

f8e3830

Redo meta labels for type and error. Add required parameter for max v…

fb0f7ee

…alues

Merge branch 'main' into baseline-compare

cd04ee9

Add selectAll support to vParquet4

4b9f50d

vp2 unsupported

4809721

comment out test for now

0afea30

Rename select all field

c78ecbc

lint

fd51d40

mdisibio added 10 commits May 21, 2024 13:40

compare() return topN and make it optional

5989ad4

Add callback version of AllAttributes to avoid map alloc

498fdc7

Fix lookup table

f33dc49

Hideous but working version with totals per attribute and classification

f295662

less hideous, and restore full time series processing

8aa1f37

instant-ish query

494feaf

add selectall attributes

201c93f

Merge branch 'main' into baseline-compare

29c72ac

Adding partial finished test for selectAll, blocked for now

56c86e5

lint

c814e50

mdisibio marked this pull request as ready for review June 17, 2024 13:17

mdisibio requested review from joe-elliott, annanay25, mapno, yvrhdn, zalegrala, electron0zero and ie-pham as code owners June 17, 2024 13:17

mdisibio requested a review from stoewer as a code owner June 17, 2024 13:17

mapno reviewed Jun 18, 2024

View reviewed changes

pkg/traceql/engine_metrics_compare.go Show resolved Hide resolved

pkg/traceql/engine_metrics_compare.go Outdated Show resolved Hide resolved

pkg/traceql/engine_metrics_compare.go Show resolved Hide resolved

mapno reviewed Jun 20, 2024

View reviewed changes

tempodb/encoding/vparquet4/block_traceql.go Outdated Show resolved Hide resolved

pkg/traceql/engine_metrics_compare.go Show resolved Hide resolved

pkg/traceql/engine_metrics_compare.go Show resolved Hide resolved

mdisibio added 4 commits June 20, 2024 10:50

Review feedback, reenable using traceid and traceDuration in the filter

fe0cd9e

changelog

629ce92

Finish vp4 selectall test, refactor some methods to share with test

27ded60

Fix comment and remove test hacks

7a49497

mapno approved these changes Jun 21, 2024

View reviewed changes

mdisibio merged commit 6b2c0b1 into grafana:main Jun 21, 2024
14 checks passed

knylander-grafana mentioned this pull request Aug 28, 2024

[DOC] Add doc for compare function for metrics doc #4024

Merged

3 tasks

github-actions bot mentioned this pull request Aug 29, 2024

[release-v2.6] [DOC] Add doc for compare function for metrics doc #4035

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TraceQL Metrics] New baseline comparison function #3695

[TraceQL Metrics] New baseline comparison function #3695

mdisibio commented May 21, 2024 •

edited

Loading

adrapereira commented May 21, 2024

mdisibio commented May 21, 2024

adrapereira commented May 21, 2024

mapno left a comment

[TraceQL Metrics] New baseline comparison function #3695

[TraceQL Metrics] New baseline comparison function #3695

Conversation

mdisibio commented May 21, 2024 • edited Loading

adrapereira commented May 21, 2024

mdisibio commented May 21, 2024

adrapereira commented May 21, 2024

mapno left a comment

Choose a reason for hiding this comment

mdisibio commented May 21, 2024 •

edited

Loading