Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: generate debug output on marshaling panics #35202

Merged
merged 1 commit into from
Feb 26, 2019

Conversation

tbg
Copy link
Member

@tbg tbg commented Feb 26, 2019

We're seeing panics from within gogoproto marshaling that indicate that
we're mutating a protobuf while it's being marshaled (thus increasing
the size necessary to marshal it, which is not reflected in the slice
being marshaled into).

Since we haven't managed to figure this out just by thinking hard, it's
time to add some debugging into the mix.

Since this hasn't popped up during our testrace builds, I assume it's
either rare enough or just not tickled in any of the tests (many of
which don't even run under race because things get too slow).

My hope is that looking at the bytes we get ouf of this logging we'll
see something that looks out of place and can trace it down.

Touches #34241.

Release note: None

@tbg tbg requested a review from a team February 26, 2019 13:02
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: good idea

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis)

@tbg
Copy link
Member Author

tbg commented Feb 26, 2019

TFTR!

bors r=jordanlewis

@craig
Copy link
Contributor

craig bot commented Feb 26, 2019

Build failed (retrying...)

We're seeing panics from within gogoproto marshaling that indicate that
we're mutating a protobuf while it's being marshaled (thus increasing
the size necessary to marshal it, which is not reflected in the slice
being marshaled into).

Since we haven't managed to figure this out just by thinking hard, it's
time to add some debugging into the mix.

Since this hasn't popped up during our testrace builds, I assume it's
either rare enough or just not tickled in any of the tests (many of
which don't even run under `race` because things get too slow).

My hope is that looking at the bytes we get ouf of this logging we'll
see something that looks out of place and can trace it down.

Touches cockroachdb#34241.

Release note: None
@tbg tbg force-pushed the fix/scanreq-debug branch from ce01b52 to ed6a303 Compare February 26, 2019 15:18
@craig
Copy link
Contributor

craig bot commented Feb 26, 2019

Canceled

@tbg
Copy link
Member Author

tbg commented Feb 26, 2019

Oops, had triggered too early. Linter is happy now.

bors r=jordanlewis

craig bot pushed a commit that referenced this pull request Feb 26, 2019
35148: server/debug: expose panicparse's UI at /debug/pprof/goroutineui r=jordanlewis a=tbg

All but last commit is #35147.

----

[panicparse] is a nifty tool that preprocesses goroutine dumps with the
goal of making them more digestible. To do so, it groups "similar" stacks
and tries to highlight system vs user code.

The grouping in particular is helpful since the situations in which we
stare at goroutine dumps are often the same situations in which there are
tons of goroutines all over the place. And even in a happy cluster, our
thread pools show up with high multiplicity and occupy enormous amounts of
terminal real estate.

The UI sets some defaults that are hopefully sane. First, it won't try to
let panicparse rummage through the source files to improve the display of
arguments, as the source files won't be available in prod
(and trying to find them pp will log annoying messages). Second, we operate
at the most lenient similarity where two stack frames are considered
"similar" no matter what the arguments to the method are. This groups most
aggressively which I think is what we want, though if we find out otherwise
it's always easy to download the raw dump and to use panicparse locally.
Or, of course, we can plumb a GET parameter that lets you chose the
similarity strategy.

[panicparse]: https://github.com/maruel/panicparse/

Here's a sample:

![image](https://user-images.githubusercontent.com/5076964/53244768-271d0980-36ac-11e9-9c2c-8bae8a0896ba.png)


Release note (admin ui change): Provide a colorized and aggregate overview over
the active goroutines (at /debug/pprof/goroutineui), useful for internal
debugging.

35201: storage: maybe fix iterator leak on mtc test shutdown r=bdarnell a=tbg

In a multiTestContext during shut down, a node's stopper may close the
RocksDB engine while a request is still holding on to an iterator unless
we take proper precautions.
This is a whack-a-mole, but these failures seem to be rare, so hopefully
they vanish after this patch. This isn't a problem in production since
`(*Node).Batch` passes the request through a Stopper task already
(though this class of problem may well exist).

Fixes #35173.

Release note: None

35202: kv: generate debug output on marshaling panics r=jordanlewis a=tbg

We're seeing panics from within gogoproto marshaling that indicate that
we're mutating a protobuf while it's being marshaled (thus increasing
the size necessary to marshal it, which is not reflected in the slice
being marshaled into).

Since we haven't managed to figure this out just by thinking hard, it's
time to add some debugging into the mix.

Since this hasn't popped up during our testrace builds, I assume it's
either rare enough or just not tickled in any of the tests (many of
which don't even run under `race` because things get too slow).

My hope is that looking at the bytes we get ouf of this logging we'll
see something that looks out of place and can trace it down.

Touches #34241.

Release note: None

Co-authored-by: Tobias Schottdorf <[email protected]>
@craig
Copy link
Contributor

craig bot commented Feb 26, 2019

Build succeeded

@craig craig bot merged commit ed6a303 into cockroachdb:master Feb 26, 2019
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Mar 12, 2019
This disables a portion of the additional checking added in cockroachdb#35202 because it
increases the rate of panics. This case should be re-enabled with the complete
resolution of cockroachdb#34241.

Release note: None
craig bot pushed a commit that referenced this pull request Mar 12, 2019
35663: kv: disable size check on marshaling to reduce panics r=ajwerner a=ajwerner

This disables a portion of the additional checking added in #35202 because it
increases the rate of panics. This case should be re-enabled with the complete
resolution of #34241.

Release note: None

Co-authored-by: Andrew Werner <[email protected]>
@tbg tbg deleted the fix/scanreq-debug branch March 13, 2019 11:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants