server/debug: expose panicparse's UI at /debug/pprof/goroutineui #35148

tbg · 2019-02-22T13:13:07Z

All but last commit is #35147.

panicparse is a nifty tool that preprocesses goroutine dumps with the
goal of making them more digestible. To do so, it groups "similar" stacks
and tries to highlight system vs user code.

The grouping in particular is helpful since the situations in which we
stare at goroutine dumps are often the same situations in which there are
tons of goroutines all over the place. And even in a happy cluster, our
thread pools show up with high multiplicity and occupy enormous amounts of
terminal real estate.

The UI sets some defaults that are hopefully sane. First, it won't try to
let panicparse rummage through the source files to improve the display of
arguments, as the source files won't be available in prod
(and trying to find them pp will log annoying messages). Second, we operate
at the most lenient similarity where two stack frames are considered
"similar" no matter what the arguments to the method are. This groups most
aggressively which I think is what we want, though if we find out otherwise
it's always easy to download the raw dump and to use panicparse locally.
Or, of course, we can plumb a GET parameter that lets you chose the
similarity strategy.

Here's a sample:

Release note (admin ui change): Provide a colorized and aggregate overview over
the active goroutines (at /debug/pprof/goroutineui), useful for internal
debugging.

Release note: None

cockroach-teamcity · 2019-02-22T13:13:20Z

This change is

Release note: None

When pprofui CPU profiling is active, add the statement tag and anonymized statement string to the goroutine labels. For example, this is what you can see when running ./bin/workload run kv --read-percent 50 --init ``` $ pprof -seconds 10 http://localhost:8080/debug/pprof/ui/profile [...] (pprof) tags stmt.anonymized: Total 7.9s 4.0s (50.57%): UPSERT INTO kv(k, v) VALUES ($1, $2) 3.9s (49.43%): SELECT k, v FROM kv WHERE k IN ($1,) stmt.tag: Total 7.9s 4.0s (50.57%): INSERT 3.9s (49.43%): SELECT ``` The dot graphs are similarly annotated, though they require `dot` to be installed on the machine and thus won't be as useful on the pprofui itself. Profile tags are not propagated across RPC boundaries. That is, a node may have high CPU as a result of SQL queries not originating at the node itself, and no labels will be available. But perusing this diff, you may notice that any moving part in the system can sniff whether profiling is active, and can add labels in itself, so in principle we could add the application name or any other information that is propagated along with the transaction on the recipient node and track down problems that way. We may also be able to add tags based on RangeIDs to identify ranges which cause high CPU load. The possibilities are endless, and with this infra in place, it's trivial to quickly iterate on what's useful. Closes cockroachdb#30930. Release note (admin ui change): Running nodes can now be CPU profiled in a way that breaks down CPU usage by query (some restrictions apply).

The UI URLs also work for pprof and have the nice side effect of letting the node add extra profiling information as applicable. Exposing the raw endpoints would make this likely to be forgotten. The raw endpoints are still there, so they can be used if necessary for some reason (that reason would constitute a bug). Release note: None

[panicparse] is a nifty tool that preprocesses goroutine dumps with the goal of making them more digestible. To do so, it groups "similar" stacks and tries to highlight system vs user code. The grouping in particular is helpful since the situations in which we stare at goroutine dumps are often the same situations in which there are tons of goroutines all over the place. And even in a happy cluster, our thread pools show up with high multiplicity and occupy enormous amounts of terminal real estate. The UI sets some defaults that are hopefully sane. First, it won't try to let panicparse rummage through the source files to improve the display of arguments, as the source files won't be available in prod (and trying to find them pp will log annoying messages). Second, we operate at the most lenient similarity where two stack frames are considered "similar" no matter what the arguments to the method are. This groups most aggressively which I think is what we want, though if we find out otherwise it's always easy to download the raw dump and to use panicparse locally. Or, of course, we can plumb a GET parameter that lets you chose the similarity strategy. [panicparse]: https://github.com/maruel/panicparse/ Release note: None

tbg · 2019-02-25T11:50:47Z

PS @jordanlewis I have a WIP change that tags some of our standard goroutines so that they show up last when sorting by blocked-duration. But it's invasive enough to polish and send out separately later.

tbg · 2019-02-26T10:52:06Z

Friendly ping, @jordanlewis.

jordanlewis

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis and @tbg)

pkg/server/debug/goroutineui/dump.go, line 33 at r6 (raw file):

	// We don't know how big the traces are, so grow a few times if they don't fit. Start large, though.
	var trace []byte
	for n := 1 << 20; /* 1mb */ n <= (1 << 29); /* 512mb */ n *= 2 {

512 mb seems giant, but maybe this really does happen in practice?

tbg · 2019-02-26T15:09:05Z

TFTR!

I hope we won't see 512mb in practice, but one user has repeatedly seen 500k goroutines due to some fast-acting leak. It's unclear that this endpoint will do anything but kill the server in that case anyway, though. I'm open to changing that limit, but will merge for now.

bors r=jordanlewis

35148: server/debug: expose panicparse's UI at /debug/pprof/goroutineui r=jordanlewis a=tbg All but last commit is #35147. ---- [panicparse] is a nifty tool that preprocesses goroutine dumps with the goal of making them more digestible. To do so, it groups "similar" stacks and tries to highlight system vs user code. The grouping in particular is helpful since the situations in which we stare at goroutine dumps are often the same situations in which there are tons of goroutines all over the place. And even in a happy cluster, our thread pools show up with high multiplicity and occupy enormous amounts of terminal real estate. The UI sets some defaults that are hopefully sane. First, it won't try to let panicparse rummage through the source files to improve the display of arguments, as the source files won't be available in prod (and trying to find them pp will log annoying messages). Second, we operate at the most lenient similarity where two stack frames are considered "similar" no matter what the arguments to the method are. This groups most aggressively which I think is what we want, though if we find out otherwise it's always easy to download the raw dump and to use panicparse locally. Or, of course, we can plumb a GET parameter that lets you chose the similarity strategy. [panicparse]: https://github.com/maruel/panicparse/ Here's a sample: ![image](https://user-images.githubusercontent.com/5076964/53244768-271d0980-36ac-11e9-9c2c-8bae8a0896ba.png) Release note (admin ui change): Provide a colorized and aggregate overview over the active goroutines (at /debug/pprof/goroutineui), useful for internal debugging. 35201: storage: maybe fix iterator leak on mtc test shutdown r=bdarnell a=tbg In a multiTestContext during shut down, a node's stopper may close the RocksDB engine while a request is still holding on to an iterator unless we take proper precautions. This is a whack-a-mole, but these failures seem to be rare, so hopefully they vanish after this patch. This isn't a problem in production since `(*Node).Batch` passes the request through a Stopper task already (though this class of problem may well exist). Fixes #35173. Release note: None 35202: kv: generate debug output on marshaling panics r=jordanlewis a=tbg We're seeing panics from within gogoproto marshaling that indicate that we're mutating a protobuf while it's being marshaled (thus increasing the size necessary to marshal it, which is not reflected in the slice being marshaled into). Since we haven't managed to figure this out just by thinking hard, it's time to add some debugging into the mix. Since this hasn't popped up during our testrace builds, I assume it's either rare enough or just not tickled in any of the tests (many of which don't even run under `race` because things get too slow). My hope is that looking at the bytes we get ouf of this logging we'll see something that looks out of place and can trace it down. Touches #34241. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>

craig · 2019-02-26T15:56:46Z

Build succeeded

GitHub CI (Cockroach)

tbg added 2 commits February 22, 2019 12:08

pprofui: allow hitting ui endpoints via go tool pprof

3f45220

Release note: None

pprofui: add a hook around profile grabbing

1d2c440

Release note: None

tbg requested review from a team February 22, 2019 13:13

tbg force-pushed the fix/panicparse branch from c92e5a6 to 633635e Compare February 22, 2019 13:31

tbg requested a review from jordanlewis February 25, 2019 11:01

tbg added 4 commits February 25, 2019 12:28

debug: let Settings know when CPU profile is active

78522c3

Release note: None

tbg force-pushed the fix/panicparse branch from 633635e to 9d7ca0f Compare February 25, 2019 11:29

jordanlewis approved these changes Feb 26, 2019

View reviewed changes

craig bot merged commit 9d7ca0f into cockroachdb:master Feb 26, 2019

tbg deleted the fix/panicparse branch March 13, 2019 11:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server/debug: expose panicparse's UI at /debug/pprof/goroutineui #35148

server/debug: expose panicparse's UI at /debug/pprof/goroutineui #35148

tbg commented Feb 22, 2019 •

edited

Loading

cockroach-teamcity commented Feb 22, 2019

tbg commented Feb 25, 2019

tbg commented Feb 26, 2019

jordanlewis left a comment

tbg commented Feb 26, 2019

craig bot commented Feb 26, 2019

server/debug: expose panicparse's UI at /debug/pprof/goroutineui #35148

server/debug: expose panicparse's UI at /debug/pprof/goroutineui #35148

Conversation

tbg commented Feb 22, 2019 • edited Loading

cockroach-teamcity commented Feb 22, 2019

tbg commented Feb 25, 2019

tbg commented Feb 26, 2019

jordanlewis left a comment

Choose a reason for hiding this comment

tbg commented Feb 26, 2019

craig bot commented Feb 26, 2019

Build succeeded

tbg commented Feb 22, 2019 •

edited

Loading