Add prometheus metrics to CSI external-provisioner #386

saad-ali · 2019-12-16T11:41:47Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds 3 metrics to the external-provisioner for the CreateVolume and DeleteVolume CSI calls:

csi_external_provisioner_operations_count counter
csi_external_provisioner_operations_seconds histogram
csi_external_provisioner_errors_count counter

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Add prometheus metrics to CSI external-provisioner

k8s-ci-robot · 2019-12-16T11:41:59Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: saad-ali

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [saad-ali]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

saad-ali · 2019-12-16T11:42:19Z

/assign @msau42
CC @verult

msau42 · 2019-12-16T17:50:24Z

pkg/controller/metrics/metrics.go

+		pmm.csiErrorMetric.WithLabelValues(
+			driverName, operationName).Inc()
+	} else {
+		// Observe duration for successful operations


I think we should record duration for failed operations too. Would it be strange to combine error code and latency into one metric?

Done. Will keep latency separate from count, but will also add error codes as a label.

msau42 · 2019-12-16T17:53:17Z

pkg/controller/metrics/metrics.go

+		driverName, operationName).Inc()
+
+	if operationFailed {
+		// Observe error count metric in case of error


We should log error code too so that we can distinguish between configuration errors vs system errors. Check out kubernetes/kubernetes#75750

Made grpc error code a dimension. And dropped separate error counter

msau42 · 2019-12-16T17:54:34Z

pkg/controller/metrics/metrics.go

+	CSIDeleteVolumeOperationName = "DeleteVolume"
+
+	// Common metric strings
+	subsystem             = "csi_external_provisioner"


Thinking about some potential state in the future where we combine all sidecars into one process, should we have a more generic name like "csi_sidecar" as the component? We can figure out which sidecar it is based on the operation name.

msau42 · 2019-12-16T17:59:13Z

cmd/csi-provisioner/csi-provisioner.go

@@ -71,6 +74,8 @@ var (
 	featureGates        map[string]bool
 	provisionController *controller.ProvisionController
 	version             = "unknown"
+	metricsEndpoint     = "/metrics"
+	metricsPort         = "8080"


Should we make the port configurable, considering that we will eventually want all sidecars to export metrics?

+1. Can't this clash with existing, used ports?

Made endpoint (including port) and path both configurable parameters.

msau42 · 2019-12-16T19:03:30Z

cc @logicalhan

logicalhan

How does versioning work for this component? Will it follow along the kubernetes versions?

logicalhan · 2019-12-16T20:58:03Z

cmd/csi-provisioner/csi-provisioner.go

@@ -71,6 +74,8 @@ var (
 	featureGates        map[string]bool
 	provisionController *controller.ProvisionController
 	version             = "unknown"
+	metricsEndpoint     = "/metrics"
+	metricsPort         = "8080"


+1. Can't this clash with existing, used ports?

logicalhan · 2019-12-16T20:59:36Z

pkg/controller/metrics/metrics.go

+	labelCSIOperationName = "operation_name"
+
+	// CSI Operation Total - Count Metric
+	operationsMetricName = "operations_count"


_total as per prometheus naming practices.

@logicalhan wdyt of combining both latency and error metrics into one histogram? The dimensions would be:

plugin name

operation

status

I think that sounds okay to me. What is the cardinality of plugin name and operation look like, respectively?

In a single cluster, plugin_name would be 1 for the most part. Across an entire fleet, there could be 5-10.

Operation name would probably be around 10-15.

Status would probably be around 5.

That sounds reasonable.

Renamed _count to _total. Reduced from 3 metrics to 2: one counter and one histogram. Both have status as a dimension.

@logicalhan do you see benefit to having both a counter and a histogram? The histogram dimensions contains everything the counter has

Histograms actually have a counter metric embedded into them, so having another counter is probably unnecessary.

msau42 · 2019-12-16T22:49:43Z

pkg/controller/controller.go

 	rep, err = p.csiClient.CreateVolume(ctx, &req)

+	p.metricsManager.RecordMetrics(


What do you think of adding the metrics in the common grpc handler, so that we automatically record metrics for every single csi call?

https://github.com/kubernetes-csi/external-provisioner/blob/master/vendor/github.com/kubernetes-csi/csi-lib-utils/connection/connection.go#L179

That seems like a good idea.

I'll work on another PR to do this. If I can finish that in the next couple of days, we can abandon this PR. Otherwise, we can merge this to unblock the larger effort.

saad-ali · 2019-12-17T02:38:52Z

How does versioning work for this component? Will it follow along the kubernetes versions?

Versioning for side-car containers is independent of Kubernetes versions. See https://kubernetes-csi.github.io/docs/external-provisioner.html for example.

saad-ali · 2019-12-17T02:42:01Z

Comments addressed. PTAL. I'll work on moving this to kubernetes-csi/csi-lib-utils/connection/connection.go so we can reuse it from all sidecars with minimal work.

msau42 · 2019-12-17T02:56:16Z

cmd/csi-provisioner/csi-provisioner.go

@@ -68,6 +71,9 @@ var (
 	leaderElectionNamespace = flag.String("leader-election-namespace", "", "Namespace where the leader election resource lives. Defaults to the pod namespace if not set.")
 	strictTopology          = flag.Bool("strict-topology", false, "Passes only selected node topology to CreateVolume Request, unlike default behavior of passing aggregated cluster topologies that match with topology keys of the selected node.")

+	metricsAddress = flag.String("metrics-address", ":8080", "The TCP network address address where the prometheus metrics endpoint will listen. Default is ':8080'.")


I wonder if we should make this opt-in instead of default, in case there are other containers in the pod listening on the same address?

Probably good to make it opt-in in general. Changing.

msau42 · 2019-12-17T02:58:06Z

pkg/controller/metrics/metrics.go

+	// Common metric strings
+	subsystem             = "csi_sidecar"
+	labelCSIDriverName    = "driver_name"
+	labelCSIOperationName = "operation_name"


operation name or csi method? (ie ControllerPublish, etc)

I prefer operation over method. But let me know if you feel strongly can change.

msau42 · 2019-12-17T03:00:29Z

pkg/controller/metrics/metrics.go

+	labelCSIOperationName = "operation_name"
+
+	// CSI Operation Total - Count Metric
+	operationsMetricName = "operations_count"


@logicalhan do you see benefit to having both a counter and a histogram? The histogram dimensions contains everything the counter has

saad-ali · 2019-12-18T01:46:58Z

Feedback addressed PTAL

k8s-ci-robot · 2019-12-18T01:49:38Z

@saad-ali: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-kubernetes-csi-external-provisioner-unit	`7266237`	link	`/test pull-kubernetes-csi-external-provisioner-unit`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

saad-ali · 2019-12-24T01:15:18Z

This PR is superseded by #388 which uses the new CSI metrics library kubernetes-csi/csi-lib-utils#35

add dependabot github action for auto dependency update

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 16, 2019

k8s-ci-robot requested review from lpabon and sbezverk December 16, 2019 11:41

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Dec 16, 2019

k8s-ci-robot assigned msau42 Dec 16, 2019

saad-ali force-pushed the addMetrics branch from 71619c0 to ab7ed10 Compare December 16, 2019 11:44

msau42 reviewed Dec 16, 2019

View reviewed changes

logicalhan reviewed Dec 16, 2019

View reviewed changes

msau42 reviewed Dec 16, 2019

View reviewed changes

saad-ali force-pushed the addMetrics branch from ab7ed10 to 84bf6f4 Compare December 17, 2019 02:31

msau42 reviewed Dec 17, 2019

View reviewed changes

saad-ali added 2 commits December 17, 2019 17:45

Add metrics to CSI external provisioner

53737df

Update dependencies for metrics

7266237

saad-ali force-pushed the addMetrics branch from 84bf6f4 to 7266237 Compare December 18, 2019 01:46

msau42 mentioned this pull request Dec 23, 2019

Introduce a CSI Metrics Library kubernetes-csi/csi-lib-utils#35

Merged

saad-ali closed this Dec 24, 2019

saad-ali mentioned this pull request Mar 20, 2020

Add Metrics to the snapshot controller kubernetes-csi/external-snapshotter#142

Closed

kbsonlong pushed a commit to kbsonlong/external-provisioner that referenced this pull request Dec 29, 2023

Merge pull request kubernetes-csi#386 from humblec/dependabot

821b6da

add dependabot github action for auto dependency update

		rep, err = p.csiClient.CreateVolume(ctx, &req)

		p.metricsManager.RecordMetrics(

Add prometheus metrics to CSI external-provisioner #386

Add prometheus metrics to CSI external-provisioner #386

Conversation

saad-ali commented Dec 16, 2019

k8s-ci-robot commented Dec 16, 2019

saad-ali commented Dec 16, 2019

Choose a reason for hiding this comment

saad-ali Dec 16, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msau42 commented Dec 16, 2019

logicalhan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msau42 Dec 16, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saad-ali commented Dec 17, 2019

saad-ali commented Dec 17, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saad-ali commented Dec 18, 2019

k8s-ci-robot commented Dec 18, 2019 • edited Loading

saad-ali commented Dec 24, 2019

saad-ali Dec 16, 2019 •

edited

Loading

msau42 Dec 16, 2019 •

edited

Loading

k8s-ci-robot commented Dec 18, 2019 •

edited

Loading