Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli/start: ensure startup panics are reported to stderr #91823

Merged
merged 3 commits into from
Nov 24, 2022

Conversation

knz
Copy link
Contributor

@knz knz commented Nov 14, 2022

Fixes #91700.

This fixes a regression introduced in
fc7bfdf (in 2017!)

The recover() call from RecoverAndReportPanic doesn't work unless the function is called via defer directly. It can't be called from another function e.g. a closure with defer func() { ... myPanicCatcher() ... }.

This commit clarifies this in the API docs for the logcrash package and fixes the only call site that was incorrect.

Release note (bug fix): Server crashes that occur during startup are now more clearly reported in logs and the standard error output.

@knz knz requested review from andreimatei, tbg and a team November 14, 2022 12:43
@knz knz requested review from a team as code owners November 14, 2022 12:43
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@knz knz force-pushed the 20221114-start-panics branch from 76e9501 to 23927cf Compare November 14, 2022 12:43
@knz knz force-pushed the 20221114-start-panics branch from 23927cf to bcc243f Compare November 14, 2022 12:48
@knz knz requested a review from a team as a code owner November 14, 2022 13:34
@knz knz requested a review from a team November 14, 2022 13:34
@knz knz requested a review from a team as a code owner November 14, 2022 13:34
@knz
Copy link
Contributor Author

knz commented Nov 14, 2022

ci barking on unrelated flake #91828.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)


pkg/cli/start.go line 662 at r1 (raw file):

				// The settings.Values, if available, determines whether a Sentry
				// report should be sent. No Sentry report is sent if sv is nil.
				logcrash.ReportPanic(ctx, sv, r, 1)

pls comment the 1 inline


pkg/util/log/logcrash/crash_reporting.go line 137 at r1 (raw file):

// RecoverAndReportNonfatalPanic is an alternative RecoverAndReportPanic that
// does not re-panic in Release builds.
// sv may not be nil.

Isn't this an unusual comment?

Code quote:

sv may not be nil.

pkg/util/log/logcrash/crash_reporting.go line 153 at r1 (raw file):

// does not re-panic in release builds.
// sv may not be nil.
func ReportNonfatalPanic(ctx context.Context, sv *settings.Values, r interface{}, depth int) {

consider adding a comment about r


pkg/util/stop/stopper.go line 552 at r4 (raw file):

	// avoids stalls and helps some tests in `./cli` finish cleanly (where
	// panics happen on purpose).
	if r := recover(); r != nil {

This whole recovery business in the stopper strikes me as dubious. Out of the motivations in the comment above, I only buy "avoid stalls". I think we can achieve that with putting a log.Fatal() inside this recover block, and nothing else. I would get rid of the new EmergencyStop; make the one caller outside of this file also log.Fatal() instead.


pkg/sql/rowexec/processor_utils_test.go line 222 at r3 (raw file):

		panic(r)
	} else {
		p.config.FlowCtx.EvalCtx.Stop(ctx)

here aren't you breaking your _must_ be called in a defer... comment? Consider relaxing that comment.

@knz knz force-pushed the 20221114-start-panics branch from 8183c5e to b2356f2 Compare November 14, 2022 19:48
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @tbg)


pkg/cli/start.go line 662 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

pls comment the 1 inline

Done.


pkg/util/log/logcrash/crash_reporting.go line 137 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Isn't this an unusual comment?

I don't understand your comment.


pkg/util/log/logcrash/crash_reporting.go line 153 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

consider adding a comment about r

Done.


pkg/util/stop/stopper.go line 552 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

This whole recovery business in the stopper strikes me as dubious. Out of the motivations in the comment above, I only buy "avoid stalls". I think we can achieve that with putting a log.Fatal() inside this recover block, and nothing else. I would get rid of the new EmergencyStop; make the one caller outside of this file also log.Fatal() instead.

I don't like the idea of log.Fatal for use in unit tests.
Another argument against it is that it prevents the leak prevention checker from running.


pkg/sql/rowexec/processor_utils_test.go line 222 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

here aren't you breaking your _must_ be called in a defer... comment? Consider relaxing that comment.

Done.

@knz knz force-pushed the 20221114-start-panics branch from b2356f2 to 65af6c6 Compare November 14, 2022 20:07
Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @knz and @tbg)


pkg/sql/sem/eval/context.go line 18 at r7 (raw file):

	"time"

	apd "github.com/cockroachdb/apd/v3"

see if this is a good change. I don't have a problem with it, but maybe your editor is not like most people's if it added it (the vast majority of importers to not have the alias)?


pkg/sql/sem/eval/context.go line 439 at r7 (raw file):

// EmergencyStop should be called in the recover() branch of a context
// shutdown when `defer evalCtx.Stop()` is not possible/desirable.
func (ec *Context) EmergencyStop(c context.Context) {

similar to the stopper discussion, I would not add this and leave the handful of tests that use it vulnerable to a bogus panic about some account being closed without memory returned to it (or whatever TestingMon.EmergencyStop() is supposed to help with).


pkg/util/log/logcrash/crash_reporting.go line 137 at r1 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

I don't understand your comment.

I mean generally speaking, pointer args are assumed to be non-nil, unless otherwise specified. Is this sv special enough to warrant a comment? (I definitely did not mean adding more words to the comment :) )
I think you did have a comment about some other sv somewhere being potentially nil, but it's not in this commit, and it's probably not reason enough to pollute all these other functions.


pkg/util/log/logcrash/crash_reporting.go line 154 at r5 (raw file):

// the recovered panic object.
// The caller is responsible for ensuring that the argument sv is non-nil.
func ReportNonfatalPanic(ctx context.Context, sv *settings.Values, r interface{}, depth int) {

does anyone call this function? I would get rid of RecoverAndReportNonfatalPanic() (I consider it harmful), rather than introducing (or at least exporting) more brothers to it.


pkg/util/stop/stopper.go line 552 at r4 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

I don't like the idea of log.Fatal for use in unit tests.
Another argument against it is that it prevents the leak prevention checker from running.

What I'm actually bothered by here is the introduction of EmergecyStop; otherwise I don't necessarily have a problem with the old code. But EmergencyStop I think is too complicated; it's hard to understand when one needs to call it and why, etc. And it leads to a cascade of uncertainty - for example, does transientCluster.Close() now need a comment to say that it should only be used in a defer? EmergencyStop is only used in transientCluster.Close(), which is used only by two tests. I say let those tests vulnerable to the danger of the Stopper.Stop() call deadlocking, and don't touch this code at all.

One thing I would do to this code, though, is unify this recover block with the recovery part of the defer above. I think the

		if r := recover(); r != nil {
			s.handlePanic(ctx, r)
		}

part can be massaged into this block. As written, it's quite hard to keep track of what's going on.


pkg/util/stop/stopper.go line 591 at r8 (raw file):

//	     stopper.Stop(ctx)
//	}()
func (s *Stopper) EmergencyStop(ctx context.Context) {

sprinkle some comment saying that this is best effort

@knz knz force-pushed the 20221114-start-panics branch from 65af6c6 to 40af148 Compare November 15, 2022 18:51
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei and @tbg)


pkg/sql/sem/eval/context.go line 439 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

similar to the stopper discussion, I would not add this and leave the handful of tests that use it vulnerable to a bogus panic about some account being closed without memory returned to it (or whatever TestingMon.EmergencyStop() is supposed to help with).

Done.


pkg/util/log/logcrash/crash_reporting.go line 137 at r1 (raw file):

unless otherwise specified. Is this sv special enough to warrant a comment?

Yes, because some other functions in this file support sv being nil, I found it coherent to clarify when it must not be nil so that there's no open question.


pkg/util/log/logcrash/crash_reporting.go line 154 at r5 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

does anyone call this function? I would get rid of RecoverAndReportNonfatalPanic() (I consider it harmful), rather than introducing (or at least exporting) more brothers to it.

Yes alas we have a few calls to it remaining.


pkg/util/stop/stopper.go line 552 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

What I'm actually bothered by here is the introduction of EmergecyStop; otherwise I don't necessarily have a problem with the old code. But EmergencyStop I think is too complicated; it's hard to understand when one needs to call it and why, etc. And it leads to a cascade of uncertainty - for example, does transientCluster.Close() now need a comment to say that it should only be used in a defer? EmergencyStop is only used in transientCluster.Close(), which is used only by two tests. I say let those tests vulnerable to the danger of the Stopper.Stop() call deadlocking, and don't touch this code at all.

One thing I would do to this code, though, is unify this recover block with the recovery part of the defer above. I think the

		if r := recover(); r != nil {
			s.handlePanic(ctx, r)
		}

part can be massaged into this block. As written, it's quite hard to keep track of what's going on.

I've re-written this bit. Do you like it better?


pkg/util/stop/stopper.go line 591 at r8 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

sprinkle some comment saying that this is best effort

Done.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @knz and @tbg)


pkg/util/log/logcrash/crash_reporting.go line 154 at r5 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

Yes alas we have a few calls to it remaining.

a few calls to what? This ReportNonfatalPanic that you're introducing now only has the one caller above. I would not introduce it, or at least not export it.


pkg/util/stop/stopper.go line 552 at r4 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

I've re-written this bit. Do you like it better?

I like that EmergencyStop is gone.

Other than that, I personally like the old comment around here better. You've moved the "avoids stalls" part around, which I think is the important one; in the old phrasing, it was pretty clear that the "avoid stalls" refers to the "would likely block" - so the stalls are about stopper tasks not terminating. In the new phrasing, it's not clear what these stalls refer to - are they internal to the stopper, or not.

I continue to stand by the suggestion to merge this block with the defered recover above. I understood that, they way the defered one is written, it will catch panics from the Close() calls below, but that comes at the cost of convoluted code and I don't think it's worth it. We've lived without catching those panics forever cause the code was broken (which you're fixing), right? I say we continue to live without it. In tests, such a panic will be caught by leaktest, for better or worse, and the goroutine leak detection will be inhibited.

I don't care too much about all this though, so feel free to do what you want if you disagree.

@tbg tbg removed their request for review November 16, 2022 08:47
Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @knz)


pkg/util/stop/stopper.go line 552 at r4 (raw file):

I continue to stand by the suggestion to merge this block with the defered recover above. I understood that, they way the defered one is written, it will catch panics from the Close() calls below, but that comes at the cost of convoluted code and I don't think it's worth it. We've lived without catching those panics forever cause the code was broken (which you're fixing), right? I say we continue to live without it. In tests, such a panic will be caught by leaktest, for better or worse, and the goroutine leak detection will be inhibited.

Instead of the defered recover above that catches panics from closers in the case when Stopper.Stop() is not called while unwinding from a panic, what I would find useful would be catching panics in go c.Close() below (inside the other goroutine) - so when the closers are called on this panic recovery path. Cause that's when the damn memory monitors panic because memory hasn't been release to them. And when they panic, that panic ends up printed, and the original panic doesn't make it to the output.
This just happened to me now, and it's not the first time.

@knz
Copy link
Contributor Author

knz commented Nov 16, 2022

ok i can do this too

@knz knz force-pushed the 20221114-start-panics branch from 40af148 to b0c7db7 Compare November 17, 2022 12:36
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did one more round of improvements. PTAL.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei)


pkg/util/log/logcrash/crash_reporting.go line 154 at r5 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

a few calls to what? This ReportNonfatalPanic that you're introducing now only has the one caller above. I would not introduce it, or at least not export it.

Done.


pkg/util/stop/stopper.go line 552 at r4 (raw file):

it was pretty clear that the "avoid stalls" refers to the "would likely block" - so the stalls are about stopper tasks not terminating.

Gotcha. I restored that part.

what I would find useful would be catching panics in go c.Close() below (inside the other goroutine)

Done.

@knz knz force-pushed the 20221114-start-panics branch 2 times, most recently from 31680bd to 2e2fb5f Compare November 17, 2022 12:43
Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @knz)


pkg/util/stop/stopper.go line 552 at r4 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

it was pretty clear that the "avoid stalls" refers to the "would likely block" - so the stalls are about stopper tasks not terminating.

Gotcha. I restored that part.

what I would find useful would be catching panics in go c.Close() below (inside the other goroutine)

Done.

Last pitch for removing the deferred recover on line 541.


pkg/util/stop/stopper.go line 580 at r16 (raw file):

			// panic recovery which will ensure panic objects go to logs.
			defer s.recover(ctx)
			s.Quiesce(ctx)

Quiesce has its own recover inside, doesn't it? Which, btw, I'd get rid of, along with getting rid of the defer s.recover here; I don't think they're worth having. I don't think Quiesce is likely to ever panic.
Short of getting rid of them, I'd make it such that, when called on this recovery path, Quiesce doesn't re-panic on panics it recovers internally. Because what we want to see is the original panic - see below.


pkg/util/stop/stopper.go line 598 at r16 (raw file):

		// (If we let the panic below happen too early, the async
		// errors from Quiesce/Close may never get a chance to get reported.)
		time.Sleep(2 * time.Second)

I think this sleep makes sense on its own (although I'd make it more like 5ms instead of 2s for the common enough case where every test in a pkg starts panicking; I don't want to wait forever for that run), but I think its effects on what gets printed are the wrong way around. What we really want to see is the original panic (r). Seeing other panics from the closers is secondary. With this sleep, we're going to see only the closer panic, right?
What I'd do is print r as soon as we recover it, and then perhaps also pass it into s.recover to be printed alongside every other panic from closers, and also I wouldn't repanic on closer panics.

@knz knz force-pushed the 20221114-start-panics branch 2 times, most recently from 8927371 to 923116d Compare November 18, 2022 13:35
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we're getting there soon.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei)


pkg/util/stop/stopper.go line 552 at r4 (raw file):

Instead of the defered recover above that catches panics from closers in the case when Stopper.Stop() is not called while unwinding from a panic, what I would find useful would be catching panics in go c.Close() below (inside the other goroutine) - so when the closers are called on this panic recovery path

I've re-read this and I believe we need both.
The above defer (for when Stop is not called while unwinding), applies for panics caused in the synchronous path (at the end of the function). We do need to recover from (and log) those.

Meanwhile, we also need the recover/log for the async panics below (as you suggested) for the recover that applies during unwinding. IMHO they are distinct.

Last pitch for removing the deferred recover on line 541.

I've peeled it out of the way where it wasn't needed any more. Let me know what you think.


pkg/util/stop/stopper.go line 580 at r16 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Quiesce has its own recover inside, doesn't it? Which, btw, I'd get rid of, along with getting rid of the defer s.recover here; I don't think they're worth having. I don't think Quiesce is likely to ever panic.
Short of getting rid of them, I'd make it such that, when called on this recovery path, Quiesce doesn't re-panic on panics it recovers internally. Because what we want to see is the original panic - see below.

Let's keep the one in Quiesce.

I agree we want to let the original panic appear in logs. But we can achieve this by reporting it earlier (as per your comment below). I've changed that.

Meanwhile I think it's important to let all the panic objects propagate however. If there's a panic under Quiesce/Close, that's non-recoverable even if the code "above" Stop() decides to absorb its own panic.


pkg/util/stop/stopper.go line 598 at r16 (raw file):

its effects on what gets printed are the wrong way around

Good catch. I added the logging as suggested.

Again I don't think we should avoid the re-panic. They are signalling something bad and we shouldn't paper over them.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: but I would not backport this to any releases.

but :), I'd like to drag @tbg into this and argue about the past. I think the panic recovery that was added in #45983 is a bad idea. If that wasn't there, I would push for this here patch to look differently. So I want to see if you'd be fine with removing it.
No other leaktest on the Internet recovers panics, and generally I don't think any Go testing framework does. Recovering panics in tests is not really a thing because you can only do it, at best, on the test's main goroutine - a panic on any other goroutine will still kill the package run. With our complicated code, even catching panics on the main goroutine doesn't work reliably - because such a recovered panic will frequently cause a secondary panic on a Closer registered with the Stopper (e.g. a memory account being closed without all the memory returned to it). Also, recovering panics leads to leaked goroutines, causes us to disable the leaktest for future tests, and generally just leads to pollution. Plus, putting this in the leaktest is too magic. It took me a good while to figure out why some tests had one behavior and others didn't, and I was very surprised to find this hidden recover.
I say we stop doing it - at least until we get annoyed enough by panics again.
WDYT?

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @knz)


pkg/util/log/logcrash/crash_reporting.go line 154 at r5 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

Done.

inline this function into the caller


pkg/util/stop/stopper.go line 552 at r4 (raw file):

Last pitch for removing the deferred recover on line 541.

I've peeled it out of the way where it wasn't needed any more. Let me know what you think.

I think what you have now is better, thanks.


pkg/util/stop/stopper.go line 580 at r16 (raw file):

Meanwhile I think it's important to let all the panic objects propagate however. If there's a panic under Quiesce/Close, that's non-recoverable even if the code "above" Stop() decides to absorb its own panic.

Let's separate the Quiesce from the Closers.

  • Quiesce simply does not panic. There are no callbacks in there. It doesn't need any recovery logic (the recover() call inside it, and the sleep above). The recover call inside it was added by excessive zeal.
  • For the Closers, I don't really agree that they're panics happening when we're already recovering from a panic need to be re-panicked. A panic in a Closer is not any less recoverable than the panic that's stopping the stopper; if anything, it is more recoverable. I think the panic recovery that we do in the leaktest is a bad idea but, if we like that recovery, then I think we definitely do not want it hindered by a panicking closer.

pkg/util/stop/stopper.go line 540 at r17 (raw file):

	}

	// In any case, this Stopper is done for once we return from here.

I wouldn't add this comment. The "in any case" will only confuse most readers.


pkg/util/stop/stopper.go line 558 at r17 (raw file):

		// quiesce and closes asynchronously. This hopefully causes a
		// number of routines to still terminate, which simplifies
		// subsequent goroutine dumps. It may  avoids stalls and helps

double space before "avoids"


pkg/util/stop/stopper.go line 573 at r17 (raw file):

			// below, and cause the go runtime to terminate the process too
			// early. So we wait a little on the return path.
			defer time.Sleep(50 * time.Millisecond)

I would not add this sleep; I don't think it's worth it - Quiesce can't really panic. Any line of code we can remove from around here I think is worth removing.


pkg/util/stop/stopper.go line 605 at r17 (raw file):

	// is catching a panic already encountered in the caller.

	// NB: Quiesce has its own panic recovery/logging code. No need to

I wouldn't add this NB :P


pkg/util/stop/stopper.go line 609 at r17 (raw file):

	s.Quiesce(ctx)

	// The following defer handles panics encountered on the synchronous

nit: this comment doesn't add anything (the code is straight-forward to read without it), and has the potential to become stale. Consider removing.

@knz knz force-pushed the 20221114-start-panics branch 2 times, most recently from f3d2856 to f203419 Compare November 23, 2022 15:50
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL

wrt backports: I suggest we exclude the last commit (the one under discussion) from the backports.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei)


pkg/util/log/logcrash/crash_reporting.go line 154 at r5 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

inline this function into the caller

Done.


pkg/util/stop/stopper.go line 580 at r16 (raw file):

I don't really agree that they're panics happening when we're already recovering from a panic need to be re-panicked.

Now I see what you mean. Since we're already going to propagate a panic object (the original one), we may as well eat up the Closer panics instead of letting them propagate to the go runtime. I'm OK with that.

Done.

Quiesce: The recover call inside it was added by excessive zeal.

I checked the git log and it was there from the beginning. I think there's wisdom in regularity -- it makes the code easier to read and avoids raising the question "why do some methods recover panics and others not".


pkg/util/stop/stopper.go line 598 at r16 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

its effects on what gets printed are the wrong way around

Good catch. I added the logging as suggested.

Again I don't think we should avoid the re-panic. They are signalling something bad and we shouldn't paper over them.

I checked manually and 50ms or less was too short. With 100ms we're good. This is non-ordinary code anyway.


pkg/util/stop/stopper.go line 540 at r17 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I wouldn't add this comment. The "in any case" will only confuse most readers.

Done.


pkg/util/stop/stopper.go line 558 at r17 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

double space before "avoids"

Done.


pkg/util/stop/stopper.go line 573 at r17 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I would not add this sleep; I don't think it's worth it - Quiesce can't really panic. Any line of code we can remove from around here I think is worth removing.

Done.


pkg/util/stop/stopper.go line 605 at r17 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I wouldn't add this NB :P

Done.


pkg/util/stop/stopper.go line 609 at r17 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: this comment doesn't add anything (the code is straight-forward to read without it), and has the potential to become stale. Consider removing.

Done.

@knz knz force-pushed the 20221114-start-panics branch from f203419 to f2f9870 Compare November 23, 2022 17:14
Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

but for the record, I think this commit had it right, and all the panic recovery that came to the stopper afterwards was misguided. My opinion is that we should go back to a simpler time - we should stop recovering in leaktest, and we should figure out something else, out of the regular stopper.Stop() code, for the cli tests that cause panics (which I think were the original reason why some recovery was added to the stopper).
You never said where you stand on the leaktest, and I am curious.
I wish @tbg would also opine, as the one who introduced the leaktest recovery.


about backports, as a matter of practice, I'd ask why should we backport anything here? The default should be a pretty strong no backport. The first thing you need to tick off on the backport PR says

Patches should only be created for serious issues or test-only changes.

Is that met here? Me complaining that I don't get the output I want when I'm introducing server startup bugs during development does not count.
I don't take that backport checklist too seriously at time (but I also almost never backport anything), but I don't see a clear reason to backport this. If you do, however, I'd approve the first commit; I think the rest are clearly unjustifiable.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @knz)


pkg/util/stop/stopper.go line 542 at r19 (raw file):

	defer func() {
		if r := recover(); r != nil {
			s.handlePanic(ctx, r)

this change on its own I think has the problem that unregister(s) is no longer called when stopper.Stop() is called while unwinding from a panic. That sounds bad?
This is solved in the later commit, and I think this commit should be squashed anyway as I say elsewhere.


pkg/util/stop/stopper.go line 540 at r21 (raw file):
the comments on the OnPanic options say

// When Stop() is invoked during stack unwinding, OnPanic is also invoked, but
// Stop() may not have carried out its duties.

That's no longer true, is it? (I don't know if it was ever true)
The OnPanic option is not used by anyone; let's take the opportunity to delete it - particularly since it's getting hard to explain when the thing runs and when it doesn't.


pkg/util/stop/stopper.go line 608 at r21 (raw file):

	s.Quiesce(ctx)

	defer func() {

This can be defer s.recover(), right? And in that case, I think there's no more use of the handlePanic method introduced in a prior commit; let's remove it. Consider squashing that commit with this one.

knz added 3 commits November 24, 2022 15:00
This fixes a regression introduced in
fc7bfdf (in 2017!)

The `recover()` call from `RecoverAndReportPanic` doesn't work unless
the function is called via `defer` directly. It can't be called from
another function e.g. a closure with `defer func() {
... myPanicCatcher() ... }`.

This commit clarifies this in the API docs for the `logcrash` package
and fixes the only call site that was incorrect.

Release note (bug fix): Server crashes that occur during startup are
now more clearly reported in logs and the standard error output.
@knz knz force-pushed the 20221114-start-panics branch from f2f9870 to 934012f Compare November 24, 2022 14:00
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this commit had it right

LOL, that's been a while.
image.png

which I think were the original reason why some recovery was added to the stopper

It's not just that - we also wanted to capture all panics happening during server shutdown and ensure they appear in logs. Which, because of the go runtime vs stderr kerfduddle, wasn't happening on its own.

Me complaining that I don't get the output I want when I'm introducing server startup bugs during development does not count.

👍 works for me.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei)


pkg/util/stop/stopper.go line 542 at r19 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

this change on its own I think has the problem that unregister(s) is no longer called when stopper.Stop() is called while unwinding from a panic. That sounds bad?
This is solved in the later commit, and I think this commit should be squashed anyway as I say elsewhere.

You're right. I'll squash them together before merge.


pkg/util/stop/stopper.go line 540 at r21 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

the comments on the OnPanic options say

// When Stop() is invoked during stack unwinding, OnPanic is also invoked, but
// Stop() may not have carried out its duties.

That's no longer true, is it? (I don't know if it was ever true)
The OnPanic option is not used by anyone; let's take the opportunity to delete it - particularly since it's getting hard to explain when the thing runs and when it doesn't.

Nice. Let's simplify.


pkg/util/stop/stopper.go line 608 at r21 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

This can be defer s.recover(), right? And in that case, I think there's no more use of the handlePanic method introduced in a prior commit; let's remove it. Consider squashing that commit with this one.

Yay, that works.

@knz
Copy link
Contributor Author

knz commented Nov 24, 2022

okay this looks good now (and much simpler)

bors r=andreimatei

@craig
Copy link
Contributor

craig bot commented Nov 24, 2022

Build succeeded:

@craig craig bot merged commit 6475cce into cockroachdb:master Nov 24, 2022
@tbg
Copy link
Member

tbg commented Nov 24, 2022

I wish @tbg would also opine, as the one who introduced the leaktest recovery.

I'm fine with removing this. This was added somewhat opportunistically to duct tape around the poor intra-test isolation that we get in Go tests.

@knz knz deleted the 20221114-start-panics branch November 26, 2022 19:46
Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with removing this. This was added somewhat opportunistically to duct tape around the poor intra-test isolation that we get in Go tests.

Cool. Here's what I think of the whole topic: #92627. PTAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

log: panics during server startup don't make it to the console
4 participants