Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server/debug: expose panicparse's UI at /debug/pprof/goroutineui #35148

Merged
merged 6 commits into from
Feb 26, 2019

Conversation

tbg
Copy link
Member

@tbg tbg commented Feb 22, 2019

All but last commit is #35147.


panicparse is a nifty tool that preprocesses goroutine dumps with the
goal of making them more digestible. To do so, it groups "similar" stacks
and tries to highlight system vs user code.

The grouping in particular is helpful since the situations in which we
stare at goroutine dumps are often the same situations in which there are
tons of goroutines all over the place. And even in a happy cluster, our
thread pools show up with high multiplicity and occupy enormous amounts of
terminal real estate.

The UI sets some defaults that are hopefully sane. First, it won't try to
let panicparse rummage through the source files to improve the display of
arguments, as the source files won't be available in prod
(and trying to find them pp will log annoying messages). Second, we operate
at the most lenient similarity where two stack frames are considered
"similar" no matter what the arguments to the method are. This groups most
aggressively which I think is what we want, though if we find out otherwise
it's always easy to download the raw dump and to use panicparse locally.
Or, of course, we can plumb a GET parameter that lets you chose the
similarity strategy.

Here's a sample:

image

Release note (admin ui change): Provide a colorized and aggregate overview over
the active goroutines (at /debug/pprof/goroutineui), useful for internal
debugging.

@tbg tbg requested review from a team February 22, 2019 13:13
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg requested a review from jordanlewis February 25, 2019 11:01
tbg added 4 commits February 25, 2019 12:28
When pprofui CPU profiling is active, add the statement tag and
anonymized statement string to the goroutine labels.

For example, this is what you can see when running

    ./bin/workload run kv --read-percent 50 --init

```
$ pprof -seconds 10 http://localhost:8080/debug/pprof/ui/profile
[...]
(pprof) tags
 stmt.anonymized: Total 7.9s
                  4.0s (50.57%): UPSERT INTO kv(k, v) VALUES ($1, $2)
                  3.9s (49.43%): SELECT k, v FROM kv WHERE k IN ($1,)

 stmt.tag: Total 7.9s
           4.0s (50.57%): INSERT
           3.9s (49.43%): SELECT
```

The dot graphs are similarly annotated, though they require `dot` to be
installed on the machine and thus won't be as useful on the pprofui
itself.

Profile tags are not propagated across RPC boundaries. That is, a node
may have high CPU as a result of SQL queries not originating at the
node itself, and no labels will be available.

But perusing this diff, you may notice that any moving part in the
system can sniff whether profiling is active, and can add labels in
itself, so in principle we could add the application name or any
other information that is propagated along with the transaction
on the recipient node and track down problems that way.

We may also be able to add tags based on RangeIDs to identify ranges
which cause high CPU load. The possibilities are endless, and with this
infra in place, it's trivial to quickly iterate on what's useful.

Closes cockroachdb#30930.

Release note (admin ui change): Running nodes can now be CPU profiled in
a way that breaks down CPU usage by query (some restrictions apply).
The UI URLs also work for pprof and have the nice side effect of letting
the node add extra profiling information as applicable. Exposing the raw
endpoints would make this likely to be forgotten. The raw endpoints are
still there, so they can be used if necessary for some reason (that
reason would constitute a bug).

Release note: None
[panicparse] is a nifty tool that preprocesses goroutine dumps with the
goal of making them more digestible. To do so, it groups "similar"
stacks and tries to highlight system vs user code.

The grouping in particular is helpful since the situations in which we
stare at goroutine dumps are often the same situations in which there
are tons of goroutines all over the place. And even in a happy cluster,
our thread pools show up with high multiplicity and occupy enormous
amounts of terminal real estate.

The UI sets some defaults that are hopefully sane. First, it won't try
to let panicparse rummage through the source files to improve the
display of arguments, as the source files won't be available in prod
(and trying to find them pp will log annoying messages). Second, we
operate at the most lenient similarity where two stack frames are
considered "similar" no matter what the arguments to the method are.
This groups most aggressively which I think is what we want, though if
we find out otherwise it's always easy to download the raw dump and to
use panicparse locally. Or, of course, we can plumb a GET parameter that
lets you chose the similarity strategy.

[panicparse]: https://github.com/maruel/panicparse/

Release note: None
@tbg
Copy link
Member Author

tbg commented Feb 25, 2019

PS @jordanlewis I have a WIP change that tags some of our standard goroutines so that they show up last when sorting by blocked-duration. But it's invasive enough to polish and send out separately later.

@tbg
Copy link
Member Author

tbg commented Feb 26, 2019

Friendly ping, @jordanlewis.

Copy link
Member

@jordanlewis jordanlewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis and @tbg)


pkg/server/debug/goroutineui/dump.go, line 33 at r6 (raw file):

	// We don't know how big the traces are, so grow a few times if they don't fit. Start large, though.
	var trace []byte
	for n := 1 << 20; /* 1mb */ n <= (1 << 29); /* 512mb */ n *= 2 {

512 mb seems giant, but maybe this really does happen in practice?

@tbg
Copy link
Member Author

tbg commented Feb 26, 2019

TFTR!

I hope we won't see 512mb in practice, but one user has repeatedly seen 500k goroutines due to some fast-acting leak. It's unclear that this endpoint will do anything but kill the server in that case anyway, though. I'm open to changing that limit, but will merge for now.

bors r=jordanlewis

craig bot pushed a commit that referenced this pull request Feb 26, 2019
35148: server/debug: expose panicparse's UI at /debug/pprof/goroutineui r=jordanlewis a=tbg

All but last commit is #35147.

----

[panicparse] is a nifty tool that preprocesses goroutine dumps with the
goal of making them more digestible. To do so, it groups "similar" stacks
and tries to highlight system vs user code.

The grouping in particular is helpful since the situations in which we
stare at goroutine dumps are often the same situations in which there are
tons of goroutines all over the place. And even in a happy cluster, our
thread pools show up with high multiplicity and occupy enormous amounts of
terminal real estate.

The UI sets some defaults that are hopefully sane. First, it won't try to
let panicparse rummage through the source files to improve the display of
arguments, as the source files won't be available in prod
(and trying to find them pp will log annoying messages). Second, we operate
at the most lenient similarity where two stack frames are considered
"similar" no matter what the arguments to the method are. This groups most
aggressively which I think is what we want, though if we find out otherwise
it's always easy to download the raw dump and to use panicparse locally.
Or, of course, we can plumb a GET parameter that lets you chose the
similarity strategy.

[panicparse]: https://github.com/maruel/panicparse/

Here's a sample:

![image](https://user-images.githubusercontent.com/5076964/53244768-271d0980-36ac-11e9-9c2c-8bae8a0896ba.png)


Release note (admin ui change): Provide a colorized and aggregate overview over
the active goroutines (at /debug/pprof/goroutineui), useful for internal
debugging.

35201: storage: maybe fix iterator leak on mtc test shutdown r=bdarnell a=tbg

In a multiTestContext during shut down, a node's stopper may close the
RocksDB engine while a request is still holding on to an iterator unless
we take proper precautions.
This is a whack-a-mole, but these failures seem to be rare, so hopefully
they vanish after this patch. This isn't a problem in production since
`(*Node).Batch` passes the request through a Stopper task already
(though this class of problem may well exist).

Fixes #35173.

Release note: None

35202: kv: generate debug output on marshaling panics r=jordanlewis a=tbg

We're seeing panics from within gogoproto marshaling that indicate that
we're mutating a protobuf while it's being marshaled (thus increasing
the size necessary to marshal it, which is not reflected in the slice
being marshaled into).

Since we haven't managed to figure this out just by thinking hard, it's
time to add some debugging into the mix.

Since this hasn't popped up during our testrace builds, I assume it's
either rare enough or just not tickled in any of the tests (many of
which don't even run under `race` because things get too slow).

My hope is that looking at the bytes we get ouf of this logging we'll
see something that looks out of place and can trace it down.

Touches #34241.

Release note: None

Co-authored-by: Tobias Schottdorf <[email protected]>
@craig
Copy link
Contributor

craig bot commented Feb 26, 2019

Build succeeded

@craig craig bot merged commit 9d7ca0f into cockroachdb:master Feb 26, 2019
@tbg tbg deleted the fix/panicparse branch March 13, 2019 11:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants