-
Notifications
You must be signed in to change notification settings - Fork 346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add prometheus metrics to CSI external-provisioner #386
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: saad-ali The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
pkg/controller/metrics/metrics.go
Outdated
pmm.csiErrorMetric.WithLabelValues( | ||
driverName, operationName).Inc() | ||
} else { | ||
// Observe duration for successful operations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should record duration for failed operations too. Would it be strange to combine error code and latency into one metric?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Will keep latency separate from count, but will also add error codes as a label.
pkg/controller/metrics/metrics.go
Outdated
driverName, operationName).Inc() | ||
|
||
if operationFailed { | ||
// Observe error count metric in case of error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should log error code too so that we can distinguish between configuration errors vs system errors. Check out kubernetes/kubernetes#75750
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made grpc error code a dimension. And dropped separate error counter
pkg/controller/metrics/metrics.go
Outdated
CSIDeleteVolumeOperationName = "DeleteVolume" | ||
|
||
// Common metric strings | ||
subsystem = "csi_external_provisioner" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about some potential state in the future where we combine all sidecars into one process, should we have a more generic name like "csi_sidecar" as the component? We can figure out which sidecar it is based on the operation name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
@@ -71,6 +74,8 @@ var ( | |||
featureGates map[string]bool | |||
provisionController *controller.ProvisionController | |||
version = "unknown" | |||
metricsEndpoint = "/metrics" | |||
metricsPort = "8080" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make the port configurable, considering that we will eventually want all sidecars to export metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Can't this clash with existing, used ports?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made endpoint (including port) and path both configurable parameters.
cc @logicalhan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does versioning work for this component? Will it follow along the kubernetes versions?
@@ -71,6 +74,8 @@ var ( | |||
featureGates map[string]bool | |||
provisionController *controller.ProvisionController | |||
version = "unknown" | |||
metricsEndpoint = "/metrics" | |||
metricsPort = "8080" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Can't this clash with existing, used ports?
pkg/controller/metrics/metrics.go
Outdated
labelCSIOperationName = "operation_name" | ||
|
||
// CSI Operation Total - Count Metric | ||
operationsMetricName = "operations_count" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_total
as per prometheus naming practices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@logicalhan wdyt of combining both latency and error metrics into one histogram? The dimensions would be:
- plugin name
- operation
- status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that sounds okay to me. What is the cardinality of plugin name and operation look like, respectively?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a single cluster, plugin_name would be 1 for the most part. Across an entire fleet, there could be 5-10.
Operation name would probably be around 10-15.
Status would probably be around 5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed _count
to _total
. Reduced from 3 metrics to 2: one counter and one histogram. Both have status as a dimension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@logicalhan do you see benefit to having both a counter and a histogram? The histogram dimensions contains everything the counter has
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Histograms actually have a counter metric embedded into them, so having another counter is probably unnecessary.
rep, err = p.csiClient.CreateVolume(ctx, &req) | ||
|
||
p.metricsManager.RecordMetrics( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of adding the metrics in the common grpc handler, so that we automatically record metrics for every single csi call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems like a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll work on another PR to do this. If I can finish that in the next couple of days, we can abandon this PR. Otherwise, we can merge this to unblock the larger effort.
Versioning for side-car containers is independent of Kubernetes versions. See https://kubernetes-csi.github.io/docs/external-provisioner.html for example. |
Comments addressed. PTAL. I'll work on moving this to |
@@ -68,6 +71,9 @@ var ( | |||
leaderElectionNamespace = flag.String("leader-election-namespace", "", "Namespace where the leader election resource lives. Defaults to the pod namespace if not set.") | |||
strictTopology = flag.Bool("strict-topology", false, "Passes only selected node topology to CreateVolume Request, unlike default behavior of passing aggregated cluster topologies that match with topology keys of the selected node.") | |||
|
|||
metricsAddress = flag.String("metrics-address", ":8080", "The TCP network address address where the prometheus metrics endpoint will listen. Default is ':8080'.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should make this opt-in instead of default, in case there are other containers in the pod listening on the same address?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably good to make it opt-in in general. Changing.
// Common metric strings | ||
subsystem = "csi_sidecar" | ||
labelCSIDriverName = "driver_name" | ||
labelCSIOperationName = "operation_name" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
operation name or csi method? (ie ControllerPublish, etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer operation over method. But let me know if you feel strongly can change.
pkg/controller/metrics/metrics.go
Outdated
labelCSIOperationName = "operation_name" | ||
|
||
// CSI Operation Total - Count Metric | ||
operationsMetricName = "operations_count" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@logicalhan do you see benefit to having both a counter and a histogram? The histogram dimensions contains everything the counter has
Feedback addressed PTAL |
@saad-ali: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
This PR is superseded by #388 which uses the new CSI metrics library kubernetes-csi/csi-lib-utils#35 |
add dependabot github action for auto dependency update
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR adds 3 metrics to the external-provisioner for the CreateVolume and DeleteVolume CSI calls:
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?: