-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metric: add debug information to panicking goroutines in Snapshot #140266
metric: add debug information to panicking goroutines in Snapshot #140266
Conversation
Issue cockroachdb#136186 introduced a seemingly impossible situation in which the internal properties of the prometheus histogram implementation mismatched in a way which caused a panic. The properties specifically, where `h` is a histogram were `h.upperBounds` and `h.counts[0/1].buckets`. These properties are slices, initialized once, whose references are never modified, lengths never changed, and are never appended to. The length of the `counts` property is set by the length of the `upperBounds` property, which should guarantee that they are always the same length, however given the panic in the above ticket, `upperBounds` somehow became longer than `counts[0/1]`. While this change cannot tell us how the lengths of each array got to a certain way, it will tell us exactly what the lengths are, which would be the next step if it were to recur. The test in this PR is built in with a time bomb so that if it goes a year without being relevant, the test will start failing, encouraging us to move on and focus on other things. Epic: none Fixes: cockroachdb#136186 Release note: None
cutoff := time.Date(2026, 2, 1, 0, 0, 0, 0, time.UTC) | ||
if time.Now().After(cutoff) { | ||
// Remove this test and the `collectHistogramBoundsCountsLengths` function. | ||
t.Fatalf("This test has expired! It's now %v, which is after %v", time.Now(), cutoff) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the zest gpt..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty cool! I haven't seen this test pattern before :)
When do we expected debugAugmentPanic
to work though (release cycle wise). Will this be back-ported to older versions?
// differ. | ||
func (h *Histogram) debugAugmentPanic() { | ||
if r := recover(); r != nil { | ||
stack := debug.Stack() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to consider using this instead:
cockroach/pkg/util/debugutil/debugutil.go
Line 62 in 1eceee1
func Stack() SafeStack { |
Not sure how this panic will get propagated to us (logs? sentry?). If it's through logs, it might get redacted.
Issue #136186 introduced a seemingly impossible situation in which the internal properties of the prometheus histogram implementation mismatched in a way which caused a panic.
The properties specifically, where
h
is a histogram wereh.upperBounds
andh.counts[0/1].buckets
. These properties are slices, initialized once, whose references are never modified, lengths never changed, and are never appended to. The length of thecounts
property is set by the length of theupperBounds
property, which should guarantee that they are always the same length, however given the panic in the above ticket,upperBounds
somehow became longer thancounts[0/1]
.While this change cannot tell us how the lengths of each array got to a certain way, it will tell us exactly what the lengths are, which would be the next step if it were to recur. The test in this PR is built in with a time bomb so that if it goes a year without being relevant, the test will start failing, encouraging us to move on and focus on other things.
Epic: none
Fixes: #136186
Release note: None