agent: Add runner panics metrics #180

sharnoff · 2023-04-15T09:15:50Z

Typically, panics would be visible in other ways, like K8s events. But because the autoscaler-agent isolates panics just to the threads handling a single VM, these can go unnoticed unless we do something about it.

tychoish · 2023-04-15T19:10:38Z

pkg/agent/prommetrics.go

+		},
+		func() float64 {
+			globalstate.lock.Lock()
+			defer globalstate.lock.Unlock()


It looks like these functions are called pretty often (the frequency wasn't super clear to me but maybe once per metrics scrape? maybe there's another interval?) and I worry some about the lock contention (particularly given the global lock and the fact that we need to take and release a bunch of locks while holding other locks.

In the short term this is probably fine, in the longer term:

maybe panicked can be an atomic so we don't need the inner lock?

maybe we can have metrics for: number started, number gracefully shutdown, and number currently running and use subtraction to infer this number.

Called once per metrics scrape, yep. Every 10s, currently (although maybe that's way too often).

I also considered the lock contention issue. I think realistically it's not a problem, although it would be nice if it weren't.

I didn't want metrics for number started / number shutdown / etc because that creates a "multiple sources of truth" situation, which realistically is fine, but felt kinda meh.

I don't think that using subtraction really counts as multiple sources of truth: this only ends up being inaccurate if there are lots of updates during metrics collection and the atomics in prom are slow (which they aren't really), or if we end up incorrectly recording data (which can always happen?)

Faced with "maybe we'll contend a global lock" vs "there's a small chance of transient off-by-one" errors, I'd pick the second one every time.

Makes sense. I think I care most about getting visibility into this ASAP, but agree that this should be changed in the future. Opened #198 for this.

agent: Add runner panics metrics

57f59ff

Typically, panics would be visible in other ways, like K8s events. But because the autoscaler-agent isolates panics just to the threads handling a single VM, these can go unnoticed unless we do something about it.

sharnoff requested a review from tychoish April 15, 2023 09:15

fix missing struct field

dad70ae

tychoish reviewed Apr 15, 2023

View reviewed changes

sharnoff merged commit 649ace1 into main Apr 19, 2023

sharnoff deleted the sharnoff/agent-runner-panics-metrics branch April 19, 2023 07:18

sharnoff mentioned this pull request Apr 19, 2023

agent "number of X" metrics should be tracked globally #198

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent: Add runner panics metrics #180

agent: Add runner panics metrics #180

sharnoff commented Apr 15, 2023

tychoish Apr 15, 2023

sharnoff Apr 18, 2023

tychoish Apr 19, 2023 •

edited

Loading

sharnoff Apr 19, 2023

agent: Add runner panics metrics #180

agent: Add runner panics metrics #180

Conversation

sharnoff commented Apr 15, 2023

tychoish Apr 15, 2023

Choose a reason for hiding this comment

sharnoff Apr 18, 2023

Choose a reason for hiding this comment

tychoish Apr 19, 2023 • edited Loading

Choose a reason for hiding this comment

sharnoff Apr 19, 2023

Choose a reason for hiding this comment

tychoish Apr 19, 2023 •

edited

Loading