kvflowcontrol: surface per-replication stream metrics #111011

irfansharif · 2023-09-21T01:08:17Z

Is your feature request related to a problem? Please describe.

There's a lack of metrics for every <source-node, dest-store> pair in kvflowcontrol, which can make it hard to diagnose exactly how writes are being being shaped. Log statements printed every 30s could miss token exhaustion.

Some notes from internal discussions:

If we have a 10-node cluster, we’re adding 10*10 metrics at most, since most cluster setups are small.
Look into adding more into the logs themselves. Maybe capture all streams that were blocked at any point in the last 30s, throughput achieved in that 30s delta, etc. So we have something historical.

Jira issue: CRDB-31711

sumeerbhola · 2023-09-21T13:31:02Z

We shouldn't forget tenant-id, since that's also part of the key.

aadityasondhi · 2023-09-26T19:56:52Z

Summarizing some offline discussions.

These metrics are too high to always be on for all clusters. https://cockroachlabs.slack.com/archives/C01CNRP6TSN/p1695138592720659.

We essentially have three options:

Export these separately on a different Prometheus endpoint. This can be problematic for customers since they will need to set up Prometheus to scrape multiple endpoints. Not included in tsdump/debug zip.
Doing an opt-in cluster setting, which defaults to off. It would trigger registering/unregistering of these metrics.
System table with a TTL and storing snapshots of these metrics.

After discussing with AC folks, I will try implementing Option 2. It is opt-in, so not always on. It also allows us to easily get this data through our regular metrics workflow (i.e. it will be in TSDB). Turning this cluster setting off would stop us from recording any further data for these metrics into TSDB.

Option 1 is tedious from a customer standpoint and will not let us access these metrics for self-hosted. The system table approach seems weird for this kind of data.

aadityasondhi · 2023-10-23T16:15:56Z

De prioritized for now. We have logs that surface the information. Unclear if metrics are required.

irfansharif added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-admission-control O-premortem Issues identified during premortem exercise. labels Sep 21, 2023

irfansharif assigned aadityasondhi and sumeerbhola Sep 21, 2023

irfansharif mentioned this issue Sep 21, 2023

ui: surface flow control metrics in overload dashboard #110135

Merged

aadityasondhi added the T-admission-control Admission Control label Oct 3, 2023

aadityasondhi unassigned aadityasondhi and sumeerbhola Oct 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvflowcontrol: surface per-replication stream metrics #111011

kvflowcontrol: surface per-replication stream metrics #111011

irfansharif commented Sep 21, 2023 •

edited by cockroach-jira-scripts

Loading

sumeerbhola commented Sep 21, 2023

aadityasondhi commented Sep 26, 2023

aadityasondhi commented Oct 23, 2023

kvflowcontrol: surface per-replication stream metrics #111011

kvflowcontrol: surface per-replication stream metrics #111011

Comments

irfansharif commented Sep 21, 2023 • edited by cockroach-jira-scripts Loading

sumeerbhola commented Sep 21, 2023

aadityasondhi commented Sep 26, 2023

aadityasondhi commented Oct 23, 2023

irfansharif commented Sep 21, 2023 •

edited by cockroach-jira-scripts

Loading