Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: add queue size metric to RaftTransport #85885

Merged
merged 2 commits into from
Aug 10, 2022

Conversation

pav-kv
Copy link
Collaborator

@pav-kv pav-kv commented Aug 10, 2022

Currently, there is a RaftEnqueuePending metric in StoreMetrics which exposes
the RaftTransport outgoing queue size. However, this is a per-Store metric, so
it duplicates the same variable across all the stores. The metric should be
tracked on a per-node/transport basis.

This PR introduces a per-transport SendQueueSize metric to replace the old
RaftEnqueuePending. It also does the necessary plumbing to include it to
the exported metrics, since it's the first metric in RaftTransport.

Screenshot 2022-08-10 at 15 37 47

Touches #83917

Release note: None

@pav-kv pav-kv requested a review from tbg August 10, 2022 13:19
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@pav-kv pav-kv marked this pull request as ready for review August 10, 2022 14:13
@pav-kv pav-kv requested a review from a team August 10, 2022 14:13
@pav-kv pav-kv requested review from a team as code owners August 10, 2022 14:13
@pav-kv
Copy link
Collaborator Author

pav-kv commented Aug 10, 2022

@tbg PTAL

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r1, 4 of 4 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pavelkalinnikov)


pkg/kv/kvserver/raft_transport.go line 305 at r2 (raw file):

Previously, pavelkalinnikov (Pavel Kalinnikov) wrote…

Alternatively, can do a Register(r *metric.Registry) method, to avoid giving the caller access to the individual metrics.

I think as-is is fine, since it's the pattern we use across the board.


pkg/kv/kvserver/raft_transport_metrics.go line 23 at r1 (raw file):

	t.metrics = &RaftTransportMetrics{
		SendQueueSize: metric.NewFunctionalGauge(metric.Metadata{
			Name: "raft.transport.send-queue-size",

I was thinking we'd add this metric under the same name, to avoid churn and continue letting any Grafana dashboards that are out there ~probably continue to work.
I also would avoid the suffix "size" since it has a good chance to be misinterpreted as "size in bytes". Unfortunately we don't have the best naming conventions in place, so if you look at metrics names across the board you will notice that they're not principled.

Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/kv/kvserver/raft_transport_metrics.go line 23 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I was thinking we'd add this metric under the same name, to avoid churn and continue letting any Grafana dashboards that are out there ~probably continue to work.
I also would avoid the suffix "size" since it has a good chance to be misinterpreted as "size in bytes". Unfortunately we don't have the best naming conventions in place, so if you look at metrics names across the board you will notice that they're not principled.

Technically, this metric still moves from per-store to per-node "keyspace" (see the screenshot in the PR description which shows the full metric path), so whoever relies on it currently can still find it missing if the sub-name is the same?

Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/kv/kvserver/raft_transport_metrics.go line 23 at r1 (raw file):

Previously, pavelkalinnikov (Pavel Kalinnikov) wrote…

Technically, this metric still moves from per-store to per-node "keyspace" (see the screenshot in the PR description which shows the full metric path), so whoever relies on it currently can still find it missing if the sub-name is the same?

I anticipate having a byte size metric next to this one? How would you suggest to name them?

I was thinking of:

  • "raft.transport.send-queue-size" for the size in messages
  • "raft.transport.send-queue-bytes" for the size in bytes.

Also, the metric meta contains the unit COUNT and the "Messages", which for the bytes metric would be BYTES and "Bytes" correspondingly.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pavelkalinnikov)


pkg/kv/kvserver/raft_transport_metrics.go line 23 at r1 (raw file):
Sort of the only thing that matters is what it looks like to prometheus. The source node or store becomes a label. So if you re-use the old name, the metric goes from foo{store=X} to foo{node=X} and all your aggregation rules still work (as long as they didn't particularly depend on the store) label.

How would you suggest to name them?

We don't have a convention so sort of anything goes. But looking at the metrics we have right now, I think we tend to name the "counts" without a suffix (i.s. like raft.enqueued.pending) and then give the corresponding bytes metric a -bytes suffix, for example here:

metaRangeSnapshotsAppliedByNonVoter = metric.Metadata{
Name: "range.snapshots.applied-non-voter",
Help: "Number of snapshots applied by non-voter replicas",
Measurement: "Snapshots",
Unit: metric.Unit_COUNT,
}
metaRangeSnapshotRcvdBytes = metric.Metadata{
Name: "range.snapshots.rcvd-bytes",
Help: "Number of snapshot bytes received",
Measurement: "Bytes",
Unit: metric.Unit_COUNT,
}

I think send-queue-{size,bytes} is fine, and I don't think it matters much that we rename this metric, as there should be very little depending on it. So we can use the name you suggested.


pkg/ui/workspaces/cluster-ui/src/store/nodes/nodes.fixtures.ts line 809 at r2 (raw file):

          "queue.tsmaintenance.processingnanos": 174301000,
          "raft.commandsapplied": 0,
          "raft.enqueued.pending": 0,

Unclear to me if we also need to add new metrics here - I don't think we need to, but giving a courtesy tag to @cockroachdb/kv-obs-prs as they'll hopefully know if something needs to be done. (No need to wait if CI passes, I think).

Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pavelkalinnikov)


pkg/kv/kvserver/raft_transport_metrics.go line 26 at r2 (raw file):

			Help: `Number of pending outgoing messages in the Raft Transport queue.

The queue is bounded in size, so instead of unbounded growth one would observe a

BTW, this seems not true. This metric accumulates all per-peer channel sizes, so is limited by totalNodes * 10k. It can be hard to analyse. E.g. if this metric is 10k on a cluster of 3 nodes, it can be ok (if the load is uniform, i.e. 5k per peer), or bad (all 10k backlog is on one particular peer).

Code quote:

The queue is bounded in size, so instead of unbounded growth one would observe a
ceiling value in the tens of thousands.`,

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pavelkalinnikov)


pkg/kv/kvserver/raft_transport_metrics.go line 26 at r2 (raw file):

Previously, pavelkalinnikov (Pavel Kalinnikov) wrote…

BTW, this seems not true. This metric accumulates all per-peer channel sizes, so is limited by totalNodes * 10k. It can be hard to analyse. E.g. if this metric is 10k on a cluster of 3 nodes, it can be ok (if the load is uniform, i.e. 5k per peer), or bad (all 10k backlog is on one particular peer).

Good point, everything generally over-indexes on the one store one node case, but this is not always the case. (We have customers running with 8-12 stores per node). Mind updating the verbiage here, maybe saying that values in the thousands or even tens of thousands could indicate that there are issues streaming messages to at least one peer.

Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pavelkalinnikov and @tbg)


pkg/kv/kvserver/raft_transport_metrics.go line 26 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Good point, everything generally over-indexes on the one store one node case, but this is not always the case. (We have customers running with 8-12 stores per node). Mind updating the verbiage here, maybe saying that values in the thousands or even tens of thousands could indicate that there are issues streaming messages to at least one peer.

Done.

pav-kv added 2 commits August 10, 2022 18:00
Currently, there is a RaftEnqueuePending metric in StoreMetrics which exposes
the RaftTransport outgoing queue size. However, this is a per-Store metric, so
it duplicates the same variable across all the stores. The metric should be
tracked on a per-node/transport basis.

This commit introduces such metric, and does the necessary plumbing to include
it to other existing metrics, since it's the first metric in RaftTransport.

Release note: None
This metric was replaced by SendQueueSize of RaftTransportMetircs.

Release note: None
@pav-kv
Copy link
Collaborator Author

pav-kv commented Aug 10, 2022

bors r=tbg

@craig
Copy link
Contributor

craig bot commented Aug 10, 2022

Build succeeded:

@craig craig bot merged commit 0ad97e4 into cockroachdb:master Aug 10, 2022
@pav-kv pav-kv deleted the per_node_metrics branch August 10, 2022 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants