-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ui: surface flow control metrics in overload dashboard #110135
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi and @irfansharif)
-- commits
line 11 at r1:
Are requests_admitted or tokens_deducted useful when aggregated across all destination stores?
I can see some value provided by wait_duration, requests_waiting, blocked_stream_count since as aggregated gauges or deltas they provide at a glance an upper bound on how bad things are.
pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx
line 163 at r1 (raw file):
<Metric key={nid} name="cr.node.kvadmission.flow_controller.regular_requests_admitted"
Not having these metrics per destination store is going to make it hard to understand what is happening when we see queuing at a node.
I realize this is not in scope for this PR, but can we support Prometheus metrics with the destination store label? If yes, it should be a priority to add them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi and @sumeerbhola)
Previously, sumeerbhola wrote…
Are requests_admitted or tokens_deducted useful when aggregated across all destination stores?
I can see some value provided by wait_duration, requests_waiting, blocked_stream_count since as aggregated gauges or deltas they provide at a glance an upper bound on how bad things are.
No strong opinions; removed.
pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx
line 163 at r1 (raw file):
Previously, sumeerbhola wrote…
Not having these metrics per destination store is going to make it hard to understand what is happening when we see queuing at a node.
I realize this is not in scope for this PR, but can we support Prometheus metrics with the destination store label? If yes, it should be a priority to add them.
472deb2
to
cee842e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi and @irfansharif)
pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx
line 161 at r2 (raw file):
<Metric key={nid} name="cr.node.kvadmission.flow_controller.regular_requests_admitted"
did you forget to remove these *requests_admitted and *tokens_deducted metrics here?
cee842e
to
425d59b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bors r+
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aadityasondhi and @sumeerbhola)
pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx
line 161 at r2 (raw file):
Previously, sumeerbhola wrote…
did you forget to remove these *requests_admitted and *tokens_deducted metrics here?
I'd forgotten to push out the SHA.
Build failed (retrying...): |
bors r- |
Canceled. |
Some of this new flow control machinery changes the game for IO admission control. This commits surfaces relevant metrics to the overload dashboard: - kvadmission.flow_controller.{regular,elastic}_wait_duration-p75 - kvadmission.flow_controller.{regular,elastic}_requests_waiting - kvadmission.flow_controller.{regular,elastic}_blocked_stream_count While here, we replace the storage.l0-{sublevels,num-files} metrics with the admission.io.overload instead. The former showed the raw counts instead of normalizing it based on AC target thresholds. And the y-axis scales for sublevels vs. files are an order of magnitude apart, so slightly more annoying to distinguish. Release note: None
425d59b
to
d2184f9
Compare
Some eslint-ey thing. Trying again. bors r+ |
Build succeeded: |
Updated release note in description and added to docs in cockroachdb/docs#18193 |
Some of this new flow control machinery changes the game for IO admission control. This commits surfaces relevant metrics to the overload dashboard:
While here, we replace the storage.l0-{sublevels,num-files} metrics with the admission.io.overload instead. The former showed the raw counts instead of normalizing it based on AC target thresholds. And the y-axis scales for sublevels vs. files are an order of magnitude apart, so slightly more annoying to distinguish.
Part of #82743.
Release note: - The Overload Dashboard page now includes the following graphs to monitor admission control:
- IO Overload - Charts normalized metric based on admission control target thresholds. Replaces LSM L0 Health graph which used raw metrics.
- KV Admission Slots Exhausted - Replaces KV Admission Slots graph.
- Flow Tokens Wait Time: 75th percentile - Use to monitor the new replication admission control feature.
- Requests Waiting For Flow Tokens - Use to monitor the new replication admission control feature.
- Blocked Replication Streams - Use to monitor the new replication admission control feature.