Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metric: add debug information to panicking goroutines in Snapshot #140266

Conversation

angles-n-daemons
Copy link
Contributor

Issue #136186 introduced a seemingly impossible situation in which the internal properties of the prometheus histogram implementation mismatched in a way which caused a panic.

The properties specifically, where h is a histogram were h.upperBounds and h.counts[0/1].buckets. These properties are slices, initialized once, whose references are never modified, lengths never changed, and are never appended to. The length of the counts property is set by the length of the upperBounds property, which should guarantee that they are always the same length, however given the panic in the above ticket, upperBounds somehow became longer than counts[0/1].

While this change cannot tell us how the lengths of each array got to a certain way, it will tell us exactly what the lengths are, which would be the next step if it were to recur. The test in this PR is built in with a time bomb so that if it goes a year without being relevant, the test will start failing, encouraging us to move on and focus on other things.

Epic: none
Fixes: #136186

Release note: None

Issue cockroachdb#136186 introduced a seemingly impossible situation in which the
internal properties of the prometheus histogram implementation
mismatched in a way which caused a panic.

The properties specifically, where `h` is a histogram were
`h.upperBounds` and `h.counts[0/1].buckets`. These properties are
slices, initialized once, whose references are never modified, lengths
never changed, and are never appended to. The length of the `counts`
property is set by the length of the `upperBounds` property, which
should guarantee that they are always the same length, however given the
panic in the above ticket, `upperBounds` somehow became longer than
`counts[0/1]`.

While this change cannot tell us how the lengths of each array got to a
certain way, it will tell us exactly what the lengths are, which would
be the next step if it were to recur. The test in this PR is built in
with a time bomb so that if it goes a year without being relevant, the
test will start failing, encouraging us to move on and focus on other
things.

Epic: none
Fixes: cockroachdb#136186

Release note: None
@angles-n-daemons angles-n-daemons requested review from a team as code owners January 31, 2025 20:50
@angles-n-daemons angles-n-daemons requested review from arjunmahishi and aa-joshi and removed request for a team January 31, 2025 20:50
@cockroach-teamcity
Copy link
Member

This change is Reviewable

cutoff := time.Date(2026, 2, 1, 0, 0, 0, 0, time.UTC)
if time.Now().After(cutoff) {
// Remove this test and the `collectHistogramBoundsCountsLengths` function.
t.Fatalf("This test has expired! It's now %v, which is after %v", time.Now(), cutoff)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the zest gpt..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty cool! I haven't seen this test pattern before :)


When do we expected debugAugmentPanic to work though (release cycle wise). Will this be back-ported to older versions?

// differ.
func (h *Histogram) debugAugmentPanic() {
if r := recover(); r != nil {
stack := debug.Stack()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to consider using this instead:

func Stack() SafeStack {

Not sure how this panic will get propagated to us (logs? sentry?). If it's through logs, it might get redacted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants