-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server/debug: expose panicparse's UI at /debug/pprof/goroutineui #35148
Conversation
Release note: None
Release note: None
Release note: None
When pprofui CPU profiling is active, add the statement tag and anonymized statement string to the goroutine labels. For example, this is what you can see when running ./bin/workload run kv --read-percent 50 --init ``` $ pprof -seconds 10 http://localhost:8080/debug/pprof/ui/profile [...] (pprof) tags stmt.anonymized: Total 7.9s 4.0s (50.57%): UPSERT INTO kv(k, v) VALUES ($1, $2) 3.9s (49.43%): SELECT k, v FROM kv WHERE k IN ($1,) stmt.tag: Total 7.9s 4.0s (50.57%): INSERT 3.9s (49.43%): SELECT ``` The dot graphs are similarly annotated, though they require `dot` to be installed on the machine and thus won't be as useful on the pprofui itself. Profile tags are not propagated across RPC boundaries. That is, a node may have high CPU as a result of SQL queries not originating at the node itself, and no labels will be available. But perusing this diff, you may notice that any moving part in the system can sniff whether profiling is active, and can add labels in itself, so in principle we could add the application name or any other information that is propagated along with the transaction on the recipient node and track down problems that way. We may also be able to add tags based on RangeIDs to identify ranges which cause high CPU load. The possibilities are endless, and with this infra in place, it's trivial to quickly iterate on what's useful. Closes cockroachdb#30930. Release note (admin ui change): Running nodes can now be CPU profiled in a way that breaks down CPU usage by query (some restrictions apply).
The UI URLs also work for pprof and have the nice side effect of letting the node add extra profiling information as applicable. Exposing the raw endpoints would make this likely to be forgotten. The raw endpoints are still there, so they can be used if necessary for some reason (that reason would constitute a bug). Release note: None
[panicparse] is a nifty tool that preprocesses goroutine dumps with the goal of making them more digestible. To do so, it groups "similar" stacks and tries to highlight system vs user code. The grouping in particular is helpful since the situations in which we stare at goroutine dumps are often the same situations in which there are tons of goroutines all over the place. And even in a happy cluster, our thread pools show up with high multiplicity and occupy enormous amounts of terminal real estate. The UI sets some defaults that are hopefully sane. First, it won't try to let panicparse rummage through the source files to improve the display of arguments, as the source files won't be available in prod (and trying to find them pp will log annoying messages). Second, we operate at the most lenient similarity where two stack frames are considered "similar" no matter what the arguments to the method are. This groups most aggressively which I think is what we want, though if we find out otherwise it's always easy to download the raw dump and to use panicparse locally. Or, of course, we can plumb a GET parameter that lets you chose the similarity strategy. [panicparse]: https://github.com/maruel/panicparse/ Release note: None
PS @jordanlewis I have a WIP change that tags some of our standard goroutines so that they show up last when sorting by blocked-duration. But it's invasive enough to polish and send out separately later. |
Friendly ping, @jordanlewis. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis and @tbg)
pkg/server/debug/goroutineui/dump.go, line 33 at r6 (raw file):
// We don't know how big the traces are, so grow a few times if they don't fit. Start large, though. var trace []byte for n := 1 << 20; /* 1mb */ n <= (1 << 29); /* 512mb */ n *= 2 {
512 mb seems giant, but maybe this really does happen in practice?
TFTR! I hope we won't see 512mb in practice, but one user has repeatedly seen 500k goroutines due to some fast-acting leak. It's unclear that this endpoint will do anything but kill the server in that case anyway, though. I'm open to changing that limit, but will merge for now. bors r=jordanlewis |
35148: server/debug: expose panicparse's UI at /debug/pprof/goroutineui r=jordanlewis a=tbg All but last commit is #35147. ---- [panicparse] is a nifty tool that preprocesses goroutine dumps with the goal of making them more digestible. To do so, it groups "similar" stacks and tries to highlight system vs user code. The grouping in particular is helpful since the situations in which we stare at goroutine dumps are often the same situations in which there are tons of goroutines all over the place. And even in a happy cluster, our thread pools show up with high multiplicity and occupy enormous amounts of terminal real estate. The UI sets some defaults that are hopefully sane. First, it won't try to let panicparse rummage through the source files to improve the display of arguments, as the source files won't be available in prod (and trying to find them pp will log annoying messages). Second, we operate at the most lenient similarity where two stack frames are considered "similar" no matter what the arguments to the method are. This groups most aggressively which I think is what we want, though if we find out otherwise it's always easy to download the raw dump and to use panicparse locally. Or, of course, we can plumb a GET parameter that lets you chose the similarity strategy. [panicparse]: https://github.com/maruel/panicparse/ Here's a sample:  Release note (admin ui change): Provide a colorized and aggregate overview over the active goroutines (at /debug/pprof/goroutineui), useful for internal debugging. 35201: storage: maybe fix iterator leak on mtc test shutdown r=bdarnell a=tbg In a multiTestContext during shut down, a node's stopper may close the RocksDB engine while a request is still holding on to an iterator unless we take proper precautions. This is a whack-a-mole, but these failures seem to be rare, so hopefully they vanish after this patch. This isn't a problem in production since `(*Node).Batch` passes the request through a Stopper task already (though this class of problem may well exist). Fixes #35173. Release note: None 35202: kv: generate debug output on marshaling panics r=jordanlewis a=tbg We're seeing panics from within gogoproto marshaling that indicate that we're mutating a protobuf while it's being marshaled (thus increasing the size necessary to marshal it, which is not reflected in the slice being marshaled into). Since we haven't managed to figure this out just by thinking hard, it's time to add some debugging into the mix. Since this hasn't popped up during our testrace builds, I assume it's either rare enough or just not tickled in any of the tests (many of which don't even run under `race` because things get too slow). My hope is that looking at the bytes we get ouf of this logging we'll see something that looks out of place and can trace it down. Touches #34241. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>
Build succeeded |
All but last commit is #35147.
panicparse is a nifty tool that preprocesses goroutine dumps with the
goal of making them more digestible. To do so, it groups "similar" stacks
and tries to highlight system vs user code.
The grouping in particular is helpful since the situations in which we
stare at goroutine dumps are often the same situations in which there are
tons of goroutines all over the place. And even in a happy cluster, our
thread pools show up with high multiplicity and occupy enormous amounts of
terminal real estate.
The UI sets some defaults that are hopefully sane. First, it won't try to
let panicparse rummage through the source files to improve the display of
arguments, as the source files won't be available in prod
(and trying to find them pp will log annoying messages). Second, we operate
at the most lenient similarity where two stack frames are considered
"similar" no matter what the arguments to the method are. This groups most
aggressively which I think is what we want, though if we find out otherwise
it's always easy to download the raw dump and to use panicparse locally.
Or, of course, we can plumb a GET parameter that lets you chose the
similarity strategy.
Here's a sample:
Release note (admin ui change): Provide a colorized and aggregate overview over
the active goroutines (at /debug/pprof/goroutineui), useful for internal
debugging.