Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query Frontend: Job weights #4076

Merged
merged 23 commits into from
Oct 11, 2024
Merged

Conversation

joe-elliott
Copy link
Member

@joe-elliott joe-elliott commented Sep 12, 2024

What this PR does:
The query frontend treats all jobs as the same size when it farms them out to the queriers. This can cause querier instability b/c some jobs actually require quite a bit more resources to execute. By assigning weights to jobs we can reduce the amount each querier is asked to do will hopefully:

  1. reduce querier OOMs/timeouts/retries
  2. reduce querier latency
  3. increase total throughput

Other changes

  • Removed the roundtripper httpgrpc bridge and pushed the concept of pipeline.Request all the way down into the cortex frontend code. This can be a nice perf improvement b/c translating http -> httpgrpc is costly and we are pushing it to the last moment. Currently for some queries we are translating thousands of jobs and then throwing them away.
  • Removed redundant parseQuery and createFetchSpansRequest to consolidate on the Compile function in pkg/traceql
  • Check for context error before going through retry logic in retryWare. This causes retry metrics to be more accurate in the event of many cancelled jobs.

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

}
totalWeight += weight

if totalWeight >= requestedCount {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense. I suppose what we're saying here is that we request of this batch a certain high water mark of work that we're willing to take, and the weight increases the notion of complexity for a single item above this threshold. Implicitly here I suppose is that weight and requestedCount are of the same unit of measure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implicitly here I suppose is that weight and requestedCount are of the same unit of measure.

yes! currently all jobs fill a single "slot" in the batch. the "weight" is basically just making it fill more slots.

}
}

if conditions > 4 { // yay, magic!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A fine starting point. I was wonder if each condition is weight++, and maybe each regex is weight+2 or some such. It means for the queue logic that if any condition is present, we'll never consume the entire requested batch. 🤔

@javiermolinar javiermolinar marked this pull request as ready for review October 9, 2024 15:04
modules/frontend/config.go Show resolved Hide resolved
modules/frontend/pipeline/pipeline.go Show resolved Hide resolved
modules/frontend/metrics_query_range_sharder.go Outdated Show resolved Hide resolved
modules/frontend/pipeline/async_weight_middleware.go Outdated Show resolved Hide resolved
modules/frontend/pipeline/pipeline.go Outdated Show resolved Hide resolved
@javiermolinar javiermolinar self-requested a review October 11, 2024 14:47
@javiermolinar javiermolinar merged commit 5aef523 into grafana:main Oct 11, 2024
16 checks passed
knylander-grafana pushed a commit to knylander-grafana/tempo-doc-work that referenced this pull request Oct 11, 2024
The query frontend treats all jobs as the same size when it farms them out to the queriers. This can cause querier instability b/c some jobs actually require quite a bit more resources to execute. By assigning weights to jobs we can reduce the amount each querier is asked to do will hopefully:

reduce querier OOMs/timeouts/retries
reduce querier latency
increase total throughput
Other changes

Removed the roundtripper httpgrpc bridge and pushed the concept of pipeline.Request all the way down into the cortex frontend code. This can be a nice perf improvement b/c translating http -> httpgrpc is costly and we are pushing it to the last moment. Currently for some queries we are translating thousands of jobs and then throwing them away.
Removed redundant parseQuery and createFetchSpansRequest to consolidate on the Compile function in pkg/traceql
Check for context error before going through retry logic in retryWare. This causes retry metrics to be more accurate in the event of many cancelled jobs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants