ui: surface flow control metrics in overload dashboard #110135

irfansharif · 2023-09-06T19:51:32Z

Some of this new flow control machinery changes the game for IO admission control. This commits surfaces relevant metrics to the overload dashboard:

kvadmission.flow_controller.{regular,elastic}_wait_duration-p75
kvadmission.flow_controller.{regular,elastic}_requests_waiting
kvadmission.flow_controller.{regular,elastic}_blocked_stream_count

While here, we replace the storage.l0-{sublevels,num-files} metrics with the admission.io.overload instead. The former showed the raw counts instead of normalizing it based on AC target thresholds. And the y-axis scales for sublevels vs. files are an order of magnitude apart, so slightly more annoying to distinguish.

Part of #82743.

Release note: - The Overload Dashboard page now includes the following graphs to monitor admission control:
- IO Overload - Charts normalized metric based on admission control target thresholds. Replaces LSM L0 Health graph which used raw metrics.
- KV Admission Slots Exhausted - Replaces KV Admission Slots graph.
- Flow Tokens Wait Time: 75th percentile - Use to monitor the new replication admission control feature.
- Requests Waiting For Flow Tokens - Use to monitor the new replication admission control feature.
- Blocked Replication Streams - Use to monitor the new replication admission control feature.

cockroach-teamcity · 2023-09-06T19:51:41Z

This change is

irfansharif · 2023-09-06T19:53:15Z

Looks like this. I might be crowding the Overload dashboard a bit much so I'm happy to cull things out.

sumeerbhola

Reviewed all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi and @irfansharif)

-- commits line 11 at r1:
Are requests_admitted or tokens_deducted useful when aggregated across all destination stores?

I can see some value provided by wait_duration, requests_waiting, blocked_stream_count since as aggregated gauges or deltas they provide at a glance an upper bound on how bad things are.

pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 163 at r1 (raw file):

            <Metric
              key={nid}
              name="cr.node.kvadmission.flow_controller.regular_requests_admitted"

Not having these metrics per destination store is going to make it hard to understand what is happening when we see queuing at a node.
I realize this is not in scope for this PR, but can we support Prometheus metrics with the destination store label? If yes, it should be a priority to add them.

irfansharif

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi and @sumeerbhola)

-- commits line 11 at r1:

Previously, sumeerbhola wrote…

Are requests_admitted or tokens_deducted useful when aggregated across all destination stores?

I can see some value provided by wait_duration, requests_waiting, blocked_stream_count since as aggregated gauges or deltas they provide at a glance an upper bound on how bad things are.

No strong opinions; removed.

pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 163 at r1 (raw file):

Previously, sumeerbhola wrote…

Not having these metrics per destination store is going to make it hard to understand what is happening when we see queuing at a node.
I realize this is not in scope for this PR, but can we support Prometheus metrics with the destination store label? If yes, it should be a priority to add them.

#111011.

sumeerbhola

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi and @irfansharif)

pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 161 at r2 (raw file):

            <Metric
              key={nid}
              name="cr.node.kvadmission.flow_controller.regular_requests_admitted"

did you forget to remove these *requests_admitted and *tokens_deducted metrics here?

irfansharif

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aadityasondhi and @sumeerbhola)

pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 161 at r2 (raw file):

Previously, sumeerbhola wrote…

did you forget to remove these *requests_admitted and *tokens_deducted metrics here?

I'd forgotten to push out the SHA.

craig · 2023-09-21T14:39:12Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

irfansharif · 2023-09-21T14:39:36Z

bors r-

craig · 2023-09-21T14:39:38Z

Canceled.

Some of this new flow control machinery changes the game for IO admission control. This commits surfaces relevant metrics to the overload dashboard: - kvadmission.flow_controller.{regular,elastic}_wait_duration-p75 - kvadmission.flow_controller.{regular,elastic}_requests_waiting - kvadmission.flow_controller.{regular,elastic}_blocked_stream_count While here, we replace the storage.l0-{sublevels,num-files} metrics with the admission.io.overload instead. The former showed the raw counts instead of normalizing it based on AC target thresholds. And the y-axis scales for sublevels vs. files are an order of magnitude apart, so slightly more annoying to distinguish. Release note: None

irfansharif · 2023-09-21T14:49:36Z

Some eslint-ey thing. Trying again.

bors r+

craig · 2023-09-21T15:57:23Z

Build succeeded:

Bazel Essential CI (Cockroach)

florence-crl · 2024-01-04T21:23:46Z

Updated release note in description and added to docs in cockroachdb/docs#18193

irfansharif requested review from aadityasondhi, sumeerbhola and a team September 6, 2023 19:51

sumeerbhola requested changes Sep 6, 2023

View reviewed changes

irfansharif commented Sep 21, 2023

View reviewed changes

irfansharif force-pushed the 230905.flowcontrol-ui branch from 472deb2 to cee842e Compare September 21, 2023 01:08

sumeerbhola approved these changes Sep 21, 2023

View reviewed changes

irfansharif force-pushed the 230905.flowcontrol-ui branch from cee842e to 425d59b Compare September 21, 2023 14:24

irfansharif commented Sep 21, 2023

View reviewed changes

irfansharif force-pushed the 230905.flowcontrol-ui branch from 425d59b to d2184f9 Compare September 21, 2023 14:47

craig bot merged commit f4269d4 into cockroachdb:master Sep 21, 2023

irfansharif deleted the 230905.flowcontrol-ui branch September 21, 2023 17:27

exalate-issue-sync bot mentioned this pull request Dec 21, 2023

Feedback: Overload Dashboard cockroachdb/docs#18177

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ui: surface flow control metrics in overload dashboard #110135

ui: surface flow control metrics in overload dashboard #110135

irfansharif commented Sep 6, 2023 •

edited by florence-crl

Loading

cockroach-teamcity commented Sep 6, 2023

irfansharif commented Sep 6, 2023

sumeerbhola left a comment

irfansharif left a comment

sumeerbhola left a comment

irfansharif left a comment

craig bot commented Sep 21, 2023

irfansharif commented Sep 21, 2023

craig bot commented Sep 21, 2023

irfansharif commented Sep 21, 2023

craig bot commented Sep 21, 2023

florence-crl commented Jan 4, 2024

ui: surface flow control metrics in overload dashboard #110135

ui: surface flow control metrics in overload dashboard #110135

Conversation

irfansharif commented Sep 6, 2023 • edited by florence-crl Loading

cockroach-teamcity commented Sep 6, 2023

irfansharif commented Sep 6, 2023

sumeerbhola left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

craig bot commented Sep 21, 2023

irfansharif commented Sep 21, 2023

craig bot commented Sep 21, 2023

irfansharif commented Sep 21, 2023

craig bot commented Sep 21, 2023

florence-crl commented Jan 4, 2024

irfansharif commented Sep 6, 2023 •

edited by florence-crl

Loading