Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agent: Add runner panics metrics #180

Merged
merged 2 commits into from
Apr 19, 2023
Merged

Conversation

sharnoff
Copy link
Member

Typically, panics would be visible in other ways, like K8s events. But because the autoscaler-agent isolates panics just to the threads handling a single VM, these can go unnoticed unless we do something about it.

Typically, panics would be visible in other ways, like K8s events. But
because the autoscaler-agent isolates panics just to the threads
handling a single VM, these can go unnoticed unless we do something
about it.
@sharnoff sharnoff requested a review from tychoish April 15, 2023 09:15
},
func() float64 {
globalstate.lock.Lock()
defer globalstate.lock.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like these functions are called pretty often (the frequency wasn't super clear to me but maybe once per metrics scrape? maybe there's another interval?) and I worry some about the lock contention (particularly given the global lock and the fact that we need to take and release a bunch of locks while holding other locks.

In the short term this is probably fine, in the longer term:

  • maybe panicked can be an atomic so we don't need the inner lock?
  • maybe we can have metrics for: number started, number gracefully shutdown, and number currently running and use subtraction to infer this number.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Called once per metrics scrape, yep. Every 10s, currently (although maybe that's way too often).

I also considered the lock contention issue. I think realistically it's not a problem, although it would be nice if it weren't.

I didn't want metrics for number started / number shutdown / etc because that creates a "multiple sources of truth" situation, which realistically is fine, but felt kinda meh.

Copy link
Contributor

@tychoish tychoish Apr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that using subtraction really counts as multiple sources of truth: this only ends up being inaccurate if there are lots of updates during metrics collection and the atomics in prom are slow (which they aren't really), or if we end up incorrectly recording data (which can always happen?)

Faced with "maybe we'll contend a global lock" vs "there's a small chance of transient off-by-one" errors, I'd pick the second one every time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I think I care most about getting visibility into this ASAP, but agree that this should be changed in the future. Opened #198 for this.

@sharnoff sharnoff merged commit 649ace1 into main Apr 19, 2023
@sharnoff sharnoff deleted the sharnoff/agent-runner-panics-metrics branch April 19, 2023 07:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants