Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvflowcontrol: surface per-replication stream metrics #111011

Open
irfansharif opened this issue Sep 21, 2023 · 3 comments
Open

kvflowcontrol: surface per-replication stream metrics #111011

irfansharif opened this issue Sep 21, 2023 · 3 comments
Labels
A-admission-control C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-premortem Issues identified during premortem exercise. T-admission-control Admission Control

Comments

@irfansharif
Copy link
Contributor

irfansharif commented Sep 21, 2023

Is your feature request related to a problem? Please describe.

There's a lack of metrics for every <source-node, dest-store> pair in kvflowcontrol, which can make it hard to diagnose exactly how writes are being being shaped. Log statements printed every 30s could miss token exhaustion.

Some notes from internal discussions:

  • If we have a 10-node cluster, we’re adding 10*10 metrics at most, since most cluster setups are small.
  • Look into adding more into the logs themselves. Maybe capture all streams that were blocked at any point in the last 30s, throughput achieved in that 30s delta, etc. So we have something historical.

Jira issue: CRDB-31711

@irfansharif irfansharif added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-admission-control O-premortem Issues identified during premortem exercise. labels Sep 21, 2023
@sumeerbhola
Copy link
Collaborator

We shouldn't forget tenant-id, since that's also part of the key.

@aadityasondhi
Copy link
Collaborator

Summarizing some offline discussions.

These metrics are too high to always be on for all clusters. https://cockroachlabs.slack.com/archives/C01CNRP6TSN/p1695138592720659.

We essentially have three options:

  1. Export these separately on a different Prometheus endpoint. This can be problematic for customers since they will need to set up Prometheus to scrape multiple endpoints. Not included in tsdump/debug zip.
  2. Doing an opt-in cluster setting, which defaults to off. It would trigger registering/unregistering of these metrics.
  3. System table with a TTL and storing snapshots of these metrics.

After discussing with AC folks, I will try implementing Option 2. It is opt-in, so not always on. It also allows us to easily get this data through our regular metrics workflow (i.e. it will be in TSDB). Turning this cluster setting off would stop us from recording any further data for these metrics into TSDB.

Option 1 is tedious from a customer standpoint and will not let us access these metrics for self-hosted. The system table approach seems weird for this kind of data.

@aadityasondhi
Copy link
Collaborator

De prioritized for now. We have logs that surface the information. Unclear if metrics are required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-admission-control C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-premortem Issues identified during premortem exercise. T-admission-control Admission Control
Projects
None yet
Development

No branches or pull requests

3 participants