Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stacks_failing metric may not be documented or implemented correctly #399

Closed
MitchellGerdisch opened this issue Jan 11, 2023 · 3 comments
Closed
Assignees
Labels
customer/feedback Feedback from customers kind/bug Some behavior is incorrect or out of spec resolution/fixed This issue was fixed
Milestone

Comments

@MitchellGerdisch
Copy link

What happened?

If one sets up port-forwarding from the pulumi operator pod on 8383/metrics one sees something like:

# HELP stacks_active Number of stacks currently tracked by the Pulumi Kubernetes Operator
# TYPE stacks_active gauge
stacks_active 1
# HELP stacks_failing Number of stacks currently registered where the last reconcile failed
# TYPE stacks_failing gauge
stacks_failing{name="core-vault",namespace="pulumi-operator"} 0

The # TYPE stacks_failing gauge line implies it's a gauge

While the documentation here:
https://github.com/pulumi/pulumi-kubernetes-operator/blob/master/docs/metrics.md#metrics-overview
indicates it's a gaugevec metric.

Steps to reproduce

The code here should be able to be used to set up an environment to test what is emitted by the operator metrics:
https://github.com/MitchellGerdisch/pulumi-work/tree/master/pulumi-operator

Expected Behavior

The docs and the output from metrics should be in sync.

Actual Behavior

One says stacks_failing is a gauge and one says it's gaugevec

Output of pulumi about

No response

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

@MitchellGerdisch MitchellGerdisch added kind/bug Some behavior is incorrect or out of spec needs-triage Needs attention from the triage team labels Jan 11, 2023
@squaremo
Copy link
Contributor

squaremo commented Jan 11, 2023

  • gaugevec is not a kind of metric -- instead, there are a set of stacks_failing gauges, differentiated by name and namespace labels. This should just be changed in the docs to say "gauge".
  • stacks_active should really be a set of gauges in the same way. In fact, arguably, there should just be one gauge, with failure and success as a label. This makes it easier to aggregate over all stacks.

@guineveresaenger guineveresaenger removed the needs-triage Needs attention from the triage team label Jan 13, 2023
squaremo added a commit that referenced this issue Jan 18, 2023
The stacks_failing metric is created as a GaugeVec in the Go code, which represents a set of time series distinguished by labels (in this case, "namespace" and "name"). But each of these time series are of type `gauge`, so the documentation is misleading in referring to them as `gaugevec` (which is not a kind of metric).

I've simplified the verbiage a little, in passing.

Addresses #399.
@squaremo
Copy link
Contributor

#402 fixes the type of the stacks_failing metric in the documentation, which should prevent some confusion.

I can see one pretty obvious problem with how stacks_failing is recorded. The scheme is this:

  • there's a set of gauge metrics using the labels namespace and name, meaning each gauge in the set is particular to an individual stack;
  • when a stack is updated, if its new end state is "failed", set its gauge to 1; if its new end state is "succeeded", set the gauge to 0

This works at all because the labels identify an individual stack, so it doesn't have to keep track of a count, just to say whether the single stack in question qualifies as a failed stack or not. A query of the stacks_failing metric will usually aggregate over the set of time series to get a count. This is a bit of an abuse of labels, because you're generally not supposed to name individual things (since then it can generate arbitrary numbers of bitty time series). But it does what you expect, at least.

Except: the stacks_failing gauge isn't set to 0 when the stack is deleted. So, if you delete the stack while it's failing, there will be a stacks_failing{namespace=..., name=...} 1 left behind:

$ kubectl get stacks
No resources found in default namespace.

$ curl http://localhost:8383/metrics | grep stacks_failing
# HELP stacks_failing Number of stacks currently registered where the last reconcile failed
# TYPE stacks_failing gauge
stacks_failing{name="podinfo-autoapi",namespace="default"} 1

squaremo added a commit that referenced this issue Jan 24, 2023
* Clarify type and meaning of stacks_* metrics

The stacks_failing metric is created as a GaugeVec in the Go code, which represents a set of time series distinguished by labels (in this case, "namespace" and "name"). But each of these time series are of type `gauge`, so the documentation is misleading in referring to them as `gaugevec` (which is not a kind of metric).

I've simplified the verbiage a little, in passing.

Addresses #399.

* Reset stacks_failed gauge when stack deleted

The stacks_failed metric is a set of gauges, each labelled with the
namespace and name of a Stack object. The controller sets a gauge to `1`
when its Stack object is given a state of "failed", and `0` for
"succeeded". A query aggregating over the labels will get the count of
failed stacks.

However: once a Stack is deleted, the gauge remains with the last value
-- and if it was failing, it will still be included in the count. So,
this commit resets the gauge to `0` when a Stack is deleted (if it had a
state at all).

Signed-off-by: Michael Bridgen <[email protected]>
@mikhailshilkov mikhailshilkov added the customer/feedback Feedback from customers label Feb 14, 2023
@squaremo squaremo self-assigned this Mar 1, 2023
@squaremo squaremo added the resolution/fixed This issue was fixed label Mar 1, 2023
@squaremo
Copy link
Contributor

squaremo commented Mar 1, 2023

the stacks_failing gauge isn't set to 0 when the stack is deleted.

I've had word that fixing this has removed false positives for a production user. So, on the basis that the documentation is corrected, and the reported problem with it is fixed, I'm going to close this.

@squaremo squaremo closed this as completed Mar 1, 2023
@mikhailshilkov mikhailshilkov added this to the 0.85 milestone Mar 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
customer/feedback Feedback from customers kind/bug Some behavior is incorrect or out of spec resolution/fixed This issue was fixed
Projects
None yet
Development

No branches or pull requests

4 participants