-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
agent: Add runner panics metrics #180
Conversation
Typically, panics would be visible in other ways, like K8s events. But because the autoscaler-agent isolates panics just to the threads handling a single VM, these can go unnoticed unless we do something about it.
}, | ||
func() float64 { | ||
globalstate.lock.Lock() | ||
defer globalstate.lock.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like these functions are called pretty often (the frequency wasn't super clear to me but maybe once per metrics scrape? maybe there's another interval?) and I worry some about the lock contention (particularly given the global lock and the fact that we need to take and release a bunch of locks while holding other locks.
In the short term this is probably fine, in the longer term:
- maybe panicked can be an atomic so we don't need the inner lock?
- maybe we can have metrics for: number started, number gracefully shutdown, and number currently running and use subtraction to infer this number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Called once per metrics scrape, yep. Every 10s, currently (although maybe that's way too often).
I also considered the lock contention issue. I think realistically it's not a problem, although it would be nice if it weren't.
I didn't want metrics for number started / number shutdown / etc because that creates a "multiple sources of truth" situation, which realistically is fine, but felt kinda meh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that using subtraction really counts as multiple sources of truth: this only ends up being inaccurate if there are lots of updates during metrics collection and the atomics in prom are slow (which they aren't really), or if we end up incorrectly recording data (which can always happen?)
Faced with "maybe we'll contend a global lock" vs "there's a small chance of transient off-by-one" errors, I'd pick the second one every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I think I care most about getting visibility into this ASAP, but agree that this should be changed in the future. Opened #198 for this.
Typically, panics would be visible in other ways, like K8s events. But because the autoscaler-agent isolates panics just to the threads handling a single VM, these can go unnoticed unless we do something about it.