-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stacks_failing metric may not be documented or implemented correctly #399
Comments
|
The stacks_failing metric is created as a GaugeVec in the Go code, which represents a set of time series distinguished by labels (in this case, "namespace" and "name"). But each of these time series are of type `gauge`, so the documentation is misleading in referring to them as `gaugevec` (which is not a kind of metric). I've simplified the verbiage a little, in passing. Addresses #399.
#402 fixes the type of the stacks_failing metric in the documentation, which should prevent some confusion. I can see one pretty obvious problem with how stacks_failing is recorded. The scheme is this:
This works at all because the labels identify an individual stack, so it doesn't have to keep track of a count, just to say whether the single stack in question qualifies as a failed stack or not. A query of the Except: the $ kubectl get stacks
No resources found in default namespace.
$ curl http://localhost:8383/metrics | grep stacks_failing
# HELP stacks_failing Number of stacks currently registered where the last reconcile failed
# TYPE stacks_failing gauge
stacks_failing{name="podinfo-autoapi",namespace="default"} 1 |
* Clarify type and meaning of stacks_* metrics The stacks_failing metric is created as a GaugeVec in the Go code, which represents a set of time series distinguished by labels (in this case, "namespace" and "name"). But each of these time series are of type `gauge`, so the documentation is misleading in referring to them as `gaugevec` (which is not a kind of metric). I've simplified the verbiage a little, in passing. Addresses #399. * Reset stacks_failed gauge when stack deleted The stacks_failed metric is a set of gauges, each labelled with the namespace and name of a Stack object. The controller sets a gauge to `1` when its Stack object is given a state of "failed", and `0` for "succeeded". A query aggregating over the labels will get the count of failed stacks. However: once a Stack is deleted, the gauge remains with the last value -- and if it was failing, it will still be included in the count. So, this commit resets the gauge to `0` when a Stack is deleted (if it had a state at all). Signed-off-by: Michael Bridgen <[email protected]>
I've had word that fixing this has removed false positives for a production user. So, on the basis that the documentation is corrected, and the reported problem with it is fixed, I'm going to close this. |
What happened?
If one sets up port-forwarding from the pulumi operator pod on 8383/metrics one sees something like:
The
# TYPE stacks_failing gauge
line implies it's agauge
While the documentation here:
https://github.com/pulumi/pulumi-kubernetes-operator/blob/master/docs/metrics.md#metrics-overview
indicates it's a
gaugevec
metric.Steps to reproduce
The code here should be able to be used to set up an environment to test what is emitted by the operator metrics:
https://github.com/MitchellGerdisch/pulumi-work/tree/master/pulumi-operator
Expected Behavior
The docs and the output from metrics should be in sync.
Actual Behavior
One says stacks_failing is a
gauge
and one says it'sgaugevec
Output of
pulumi about
No response
Additional context
No response
Contributing
Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).
The text was updated successfully, but these errors were encountered: