stacks_failing metric may not be documented or implemented correctly #399

MitchellGerdisch · 2023-01-11T16:13:46Z

What happened?

If one sets up port-forwarding from the pulumi operator pod on 8383/metrics one sees something like:

# HELP stacks_active Number of stacks currently tracked by the Pulumi Kubernetes Operator
# TYPE stacks_active gauge
stacks_active 1
# HELP stacks_failing Number of stacks currently registered where the last reconcile failed
# TYPE stacks_failing gauge
stacks_failing{name="core-vault",namespace="pulumi-operator"} 0

The # TYPE stacks_failing gauge line implies it's a gauge

While the documentation here:
https://github.com/pulumi/pulumi-kubernetes-operator/blob/master/docs/metrics.md#metrics-overview
indicates it's a gaugevec metric.

Steps to reproduce

The code here should be able to be used to set up an environment to test what is emitted by the operator metrics:
https://github.com/MitchellGerdisch/pulumi-work/tree/master/pulumi-operator

Expected Behavior

The docs and the output from metrics should be in sync.

Actual Behavior

One says stacks_failing is a gauge and one says it's gaugevec

Output of `pulumi about`

No response

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

The text was updated successfully, but these errors were encountered:

squaremo · 2023-01-11T18:02:12Z

gaugevec is not a kind of metric -- instead, there are a set of stacks_failing gauges, differentiated by name and namespace labels. This should just be changed in the docs to say "gauge".
stacks_active should really be a set of gauges in the same way. In fact, arguably, there should just be one gauge, with failure and success as a label. This makes it easier to aggregate over all stacks.

The stacks_failing metric is created as a GaugeVec in the Go code, which represents a set of time series distinguished by labels (in this case, "namespace" and "name"). But each of these time series are of type `gauge`, so the documentation is misleading in referring to them as `gaugevec` (which is not a kind of metric). I've simplified the verbiage a little, in passing. Addresses #399.

squaremo · 2023-01-20T15:52:27Z

#402 fixes the type of the stacks_failing metric in the documentation, which should prevent some confusion.

I can see one pretty obvious problem with how stacks_failing is recorded. The scheme is this:

there's a set of gauge metrics using the labels namespace and name, meaning each gauge in the set is particular to an individual stack;
when a stack is updated, if its new end state is "failed", set its gauge to 1; if its new end state is "succeeded", set the gauge to 0

This works at all because the labels identify an individual stack, so it doesn't have to keep track of a count, just to say whether the single stack in question qualifies as a failed stack or not. A query of the stacks_failing metric will usually aggregate over the set of time series to get a count. This is a bit of an abuse of labels, because you're generally not supposed to name individual things (since then it can generate arbitrary numbers of bitty time series). But it does what you expect, at least.

Except: the stacks_failing gauge isn't set to 0 when the stack is deleted. So, if you delete the stack while it's failing, there will be a stacks_failing{namespace=..., name=...} 1 left behind:

$ kubectl get stacks
No resources found in default namespace.

$ curl http://localhost:8383/metrics | grep stacks_failing
# HELP stacks_failing Number of stacks currently registered where the last reconcile failed
# TYPE stacks_failing gauge
stacks_failing{name="podinfo-autoapi",namespace="default"} 1

* Clarify type and meaning of stacks_* metrics The stacks_failing metric is created as a GaugeVec in the Go code, which represents a set of time series distinguished by labels (in this case, "namespace" and "name"). But each of these time series are of type `gauge`, so the documentation is misleading in referring to them as `gaugevec` (which is not a kind of metric). I've simplified the verbiage a little, in passing. Addresses #399. * Reset stacks_failed gauge when stack deleted The stacks_failed metric is a set of gauges, each labelled with the namespace and name of a Stack object. The controller sets a gauge to `1` when its Stack object is given a state of "failed", and `0` for "succeeded". A query aggregating over the labels will get the count of failed stacks. However: once a Stack is deleted, the gauge remains with the last value -- and if it was failing, it will still be included in the count. So, this commit resets the gauge to `0` when a Stack is deleted (if it had a state at all). Signed-off-by: Michael Bridgen <[email protected]>

squaremo · 2023-03-01T10:33:03Z

the stacks_failing gauge isn't set to 0 when the stack is deleted.

I've had word that fixing this has removed false positives for a production user. So, on the basis that the documentation is corrected, and the reported problem with it is fixed, I'm going to close this.

MitchellGerdisch added kind/bug Some behavior is incorrect or out of spec needs-triage Needs attention from the triage team labels Jan 11, 2023

guineveresaenger removed the needs-triage Needs attention from the triage team label Jan 13, 2023

squaremo mentioned this issue Jan 18, 2023

Clarify type and meaning of stacks_* metrics #402

Merged

mikhailshilkov added the customer/feedback Feedback from customers label Feb 14, 2023

squaremo self-assigned this Mar 1, 2023

squaremo added the resolution/fixed This issue was fixed label Mar 1, 2023

squaremo closed this as completed Mar 1, 2023

mikhailshilkov added this to the 0.85 milestone Mar 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stacks_failing metric may not be documented or implemented correctly #399

stacks_failing metric may not be documented or implemented correctly #399

MitchellGerdisch commented Jan 11, 2023

squaremo commented Jan 11, 2023 •

edited

Loading

squaremo commented Jan 20, 2023

squaremo commented Mar 1, 2023

stacks_failing metric may not be documented or implemented correctly #399

stacks_failing metric may not be documented or implemented correctly #399

Comments

MitchellGerdisch commented Jan 11, 2023

What happened?

Steps to reproduce

Expected Behavior

Actual Behavior

Output of pulumi about

Additional context

Contributing

squaremo commented Jan 11, 2023 • edited Loading

squaremo commented Jan 20, 2023

squaremo commented Mar 1, 2023

Output of `pulumi about`

squaremo commented Jan 11, 2023 •

edited

Loading