Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ui: surface flow control metrics in overload dashboard #110135

Merged
merged 1 commit into from
Sep 21, 2023

Conversation

irfansharif
Copy link
Contributor

@irfansharif irfansharif commented Sep 6, 2023

Some of this new flow control machinery changes the game for IO admission control. This commits surfaces relevant metrics to the overload dashboard:

  • kvadmission.flow_controller.{regular,elastic}_wait_duration-p75
  • kvadmission.flow_controller.{regular,elastic}_requests_waiting
  • kvadmission.flow_controller.{regular,elastic}_blocked_stream_count

While here, we replace the storage.l0-{sublevels,num-files} metrics with the admission.io.overload instead. The former showed the raw counts instead of normalizing it based on AC target thresholds. And the y-axis scales for sublevels vs. files are an order of magnitude apart, so slightly more annoying to distinguish.

Part of #82743.

Release note: - The Overload Dashboard page now includes the following graphs to monitor admission control:
- IO Overload - Charts normalized metric based on admission control target thresholds. Replaces LSM L0 Health graph which used raw metrics.
- KV Admission Slots Exhausted - Replaces KV Admission Slots graph.
- Flow Tokens Wait Time: 75th percentile - Use to monitor the new replication admission control feature.
- Requests Waiting For Flow Tokens - Use to monitor the new replication admission control feature.
- Blocked Replication Streams - Use to monitor the new replication admission control feature.

@irfansharif irfansharif requested review from aadityasondhi, sumeerbhola and a team September 6, 2023 19:51
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@irfansharif
Copy link
Contributor Author

Looks like this. I might be crowding the Overload dashboard a bit much so I'm happy to cull things out.

image image image image

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi and @irfansharif)


-- commits line 11 at r1:
Are requests_admitted or tokens_deducted useful when aggregated across all destination stores?

I can see some value provided by wait_duration, requests_waiting, blocked_stream_count since as aggregated gauges or deltas they provide at a glance an upper bound on how bad things are.


pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 163 at r1 (raw file):

            <Metric
              key={nid}
              name="cr.node.kvadmission.flow_controller.regular_requests_admitted"

Not having these metrics per destination store is going to make it hard to understand what is happening when we see queuing at a node.
I realize this is not in scope for this PR, but can we support Prometheus metrics with the destination store label? If yes, it should be a priority to add them.

Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi and @sumeerbhola)


-- commits line 11 at r1:

Previously, sumeerbhola wrote…

Are requests_admitted or tokens_deducted useful when aggregated across all destination stores?

I can see some value provided by wait_duration, requests_waiting, blocked_stream_count since as aggregated gauges or deltas they provide at a glance an upper bound on how bad things are.

No strong opinions; removed.


pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 163 at r1 (raw file):

Previously, sumeerbhola wrote…

Not having these metrics per destination store is going to make it hard to understand what is happening when we see queuing at a node.
I realize this is not in scope for this PR, but can we support Prometheus metrics with the destination store label? If yes, it should be a priority to add them.

#111011.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi and @irfansharif)


pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 161 at r2 (raw file):

            <Metric
              key={nid}
              name="cr.node.kvadmission.flow_controller.regular_requests_admitted"

did you forget to remove these *requests_admitted and *tokens_deducted metrics here?

Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aadityasondhi and @sumeerbhola)


pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 161 at r2 (raw file):

Previously, sumeerbhola wrote…

did you forget to remove these *requests_admitted and *tokens_deducted metrics here?

I'd forgotten to push out the SHA.

@craig
Copy link
Contributor

craig bot commented Sep 21, 2023

Build failed (retrying...):

@irfansharif
Copy link
Contributor Author

bors r-

@craig
Copy link
Contributor

craig bot commented Sep 21, 2023

Canceled.

Some of this new flow control machinery changes the game for IO
admission control. This commits surfaces relevant metrics to the
overload dashboard:
- kvadmission.flow_controller.{regular,elastic}_wait_duration-p75
- kvadmission.flow_controller.{regular,elastic}_requests_waiting
- kvadmission.flow_controller.{regular,elastic}_blocked_stream_count

While here, we replace the storage.l0-{sublevels,num-files} metrics with
the admission.io.overload instead. The former showed the raw counts
instead of normalizing it based on AC target thresholds. And the y-axis
scales for sublevels vs. files are an order of magnitude apart, so
slightly more annoying to distinguish.

Release note: None
@irfansharif
Copy link
Contributor Author

Some eslint-ey thing. Trying again.

bors r+

@craig
Copy link
Contributor

craig bot commented Sep 21, 2023

Build succeeded:

@craig craig bot merged commit f4269d4 into cockroachdb:master Sep 21, 2023
@irfansharif irfansharif deleted the 230905.flowcontrol-ui branch September 21, 2023 17:27
@florence-crl
Copy link

Updated release note in description and added to docs in cockroachdb/docs#18193

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants