Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

acceptance: test rapid restarts and early queries #19591

Merged
merged 4 commits into from
Oct 30, 2017

Conversation

tbg
Copy link
Member

@tbg tbg commented Oct 26, 2017

See #19559, #19579, #19571 for issues touched by this test (all of which
it reproduced before the preceding commits fixed them).

The changes here (outside of the test) ensure that a single-node instance starts
as quickly as possible, without waiting for ports to be ready or for replication
to have occurred. This was done to remove any additional delay in
TestRapidRestarts that could otherwise obscur bugs (for example, consider that
going through ./cockroach quit was not able to tickle any of the bugs).

I was also tempted to "bump" the version upgrade tests to use 1.1 instead of 1.0
since that would have removed a bunch of cruft (since 1.0 doesn't write its
listener files), but I decided against that. As a result, some of the new
accessors look a little unclean, but they should straighten themselves out once
we only rely on listener files.

@tbg tbg requested a review from a team as a code owner October 26, 2017 23:37
@tbg tbg requested a review from a team October 26, 2017 23:37
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@a-robinson
Copy link
Contributor

I didn't review the last commit yet because you said not to.

Fixes #19493.

That's a PR -- which issue did you mean?


Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 1 of 1 files at r3.
Review status: 2 of 9 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


pkg/cli/start.go, line 694 at r3 (raw file):

			serverStatusMu.Lock()
			serverStatusMu.draining = true
			needToDrain := serverStatusMu.started

I might change this from needToDrain to drainingIsSafe or something along those lines, since this is changing the logic from aggressively draining whenever the server might have started to only draining if we're positive the server has started (and potentially not draining even if it has started).


Comments from Reviewable

tbg added a commit to tbg/cockroach that referenced this pull request Oct 27, 2017
Extracted from cockroachdb#19591
which also adds a regression test. The tests in that PR pass, but
to make sure I verified manually using

```
while true; do curl -O /dev/null http://localhost:8080/_status/vars; done
```

and `./cockroach start --insecure` in another window, ten times.

Fixes cockroachdb#19559.
@tbg
Copy link
Member Author

tbg commented Oct 29, 2017

#19579, updated the message.


Review status: 2 of 9 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


pkg/cli/start.go, line 694 at r3 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

I might change this from needToDrain to drainingIsSafe or something along those lines, since this is changing the logic from aggressively draining whenever the server might have started to only draining if we're positive the server has started (and potentially not draining even if it has started).

Makes sense, renamed to drainingIsSafe.


Comments from Reviewable

@tbg
Copy link
Member Author

tbg commented Oct 29, 2017

Pushed an update, review to your heart's content! The test has already found an additional error, but it'll take some looking into. edit: added a commit that fixes that one, too.

@tbg tbg force-pushed the startup-crash branch 3 times, most recently from 15a585a to 5d8c911 Compare October 29, 2017 21:46
@nvanbenschoten
Copy link
Member

Reviewed 1 of 1 files at r2, 8 of 8 files at r5, 1 of 1 files at r6, 1 of 1 files at r7, 6 of 7 files at r8.
Review status: 8 of 9 files reviewed at latest revision, 11 unresolved discussions, all commit checks successful.


pkg/acceptance/rapid_restart_test.go, line 33 at r8 (raw file):

	"github.com/cockroachdb/cockroach/pkg/util/log"
	"github.com/cockroachdb/cockroach/pkg/util/timeutil"
	"github.com/pkg/errors"

move up


pkg/acceptance/rapid_restart_test.go, line 71 at r8 (raw file):

		// -1 corresponds to "interrupted", i.e. receiving SIGINT before setting up
		// our own handlers, and 1 corresponds to receiving SIGINT after.
		if status := s.ExitStatus(); status != -1 && status != 1 {

nit: I think this could be cleaned up using the following structure.

status := s.ExitStatus()
switch status {
case -1:
    // -1 corresponds to ...
case 1:
    // 1 corresponds to ...
default:
    return ...
}

pkg/acceptance/rapid_restart_test.go, line 84 at r8 (raw file):

	// LocalCluster that skip all the waiting so that we get to kill the process
	// early in its boot sequence.
	cfg.Nodes = cfg.Nodes[:1]

Do you think there's value in having one of these tests for a multi-node cluster?


pkg/acceptance/rapid_restart_test.go, line 103 at r8 (raw file):

		// would get exit code 255, typically.
		//
		// FIXINPR(tschottdorf): real signal handling (platform specific, the below is darwin).

Isn't this TODO already covered/discussed in unexpectedExitCode?


pkg/acceptance/rapid_restart_test.go, line 122 at r8 (raw file):

					base := c.URL(ctx, 0)
					if base != "" {
						// Torture the prometheus endpoint to prevent regression of #19559.

We don't want to do this for fresh nodes?


pkg/acceptance/version_upgrade_test.go, line 69 at r8 (raw file):

	// Verify that the nodes are *really* at the versions configured. This
	// tests the CI harness.
	log.Info(ctx, "verify that configured versions are actual versions")

super nit: verify -> verifying


pkg/acceptance/localcluster/cluster.go, line 706 at r8 (raw file):

}

func addrFromFile(f string) string {

nit: this should probably have a more general name, like readFileIfExists


pkg/cli/start.go, line 656 at r7 (raw file):

	var returnErr error

	noDrain := make(chan struct{}) // closed if interrupted very early

noDrain doesn't convey very much meaning when only reading the bottom of this function. Consider stopWithoutDrain.


pkg/cli/start.go, line 698 at r7 (raw file):

			drainingIsSafe := serverStatusMu.started
			serverStatusMu.Unlock()
			if !drainingIsSafe {

Add a comment that checking drainingIsSafe here can race with it being set but that it's ok to not drain even if the Server has started (it's just not desirable).


pkg/cli/start.go, line 754 at r7 (raw file):

server drained ...

did it? Your call if it's worth printing a different message here.


Comments from Reviewable

@bdarnell
Copy link
Contributor

Reviewed 1 of 1 files at r2, 1 of 1 files at r3, 7 of 7 files at r4, 1 of 1 files at r7, 7 of 7 files at r8.
Review status: all files reviewed at latest revision, 12 unresolved discussions, all commit checks successful.


pkg/acceptance/rapid_restart_test.go, line 58 at r8 (raw file):

	if err == nil {
		// Server shut down cleanly. Note that returning `err` here would create
		// an error interface wrapping a nil *ExitError, which is *not* nil

For this reason it's probably a better idea to rename err to exitErr.


Comments from Reviewable

tbg added 3 commits October 30, 2017 14:09
This fixes a panic that would occur if the RocksDB instance was closed
with moving parts still trying to use it.
@tbg
Copy link
Member Author

tbg commented Oct 30, 2017

Review status: all files reviewed at latest revision, 12 unresolved discussions, all commit checks successful.


pkg/acceptance/rapid_restart_test.go, line 33 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

move up

Done.


pkg/acceptance/rapid_restart_test.go, line 58 at r8 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

For this reason it's probably a better idea to rename err to exitErr.

Done.


pkg/acceptance/rapid_restart_test.go, line 71 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: I think this could be cleaned up using the following structure.

status := s.ExitStatus()
switch status {
case -1:
    // -1 corresponds to ...
case 1:
    // 1 corresponds to ...
default:
    return ...
}

Done.


pkg/acceptance/rapid_restart_test.go, line 84 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do you think there's value in having one of these tests for a multi-node cluster?

It wouldn't hurt, but it's more than I've set out to do here. Also, the acceptance tests tend to not be set up to return early, so it might require some more refactors.


pkg/acceptance/rapid_restart_test.go, line 103 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Isn't this TODO already covered/discussed in unexpectedExitCode?

Yup, forgot to remove this. Done.


pkg/acceptance/rapid_restart_test.go, line 122 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We don't want to do this for fresh nodes?

Sure, why not. Added.


pkg/acceptance/version_upgrade_test.go, line 69 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

super nit: verify -> verifying

Done.


pkg/acceptance/localcluster/cluster.go, line 706 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: this should probably have a more general name, like readFileIfExists

Done.


pkg/cli/start.go, line 656 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

noDrain doesn't convey very much meaning when only reading the bottom of this function. Consider stopWithoutDrain.

Done.


pkg/cli/start.go, line 698 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Add a comment that checking drainingIsSafe here can race with it being set but that it's ok to not drain even if the Server has started (it's just not desirable).

Done.


pkg/cli/start.go, line 754 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

server drained ...

did it? Your call if it's worth printing a different message here.

Done.


Comments from Reviewable

@nvanbenschoten
Copy link
Member

:lgtm:


Reviewed 4 of 7 files at r12, 1 of 1 files at r13.
Review status: 6 of 9 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.


pkg/acceptance/version_upgrade_test.go, line 69 at r8 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Done.

Probably don't need a separate commit for this 😛


Comments from Reviewable

@tbg
Copy link
Member Author

tbg commented Oct 30, 2017

Review status: 6 of 9 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.


pkg/acceptance/version_upgrade_test.go, line 69 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Probably don't need a separate commit for this 😛

You might have a point.. let me see what I can do.


Comments from Reviewable

See cockroachdb#19559, cockroachdb#19579, cockroachdb#19571 for issues touched by this test (all of which
it reproduced before the preceding commits fixed them).

The changes here (outside of the test) ensure that a single-node instance starts
as quickly as possible, without waiting for ports to be ready or for replication
to have occurred. This was done to remove any additional delay in
`TestRapidRestarts` that could otherwise obscur bugs (for example, consider that
going through `./cockroach quit` was not able to tickle any of the bugs). #
Please enter the commit message for your changes. Lines starting # with '#' will
be ignored, and an empty message aborts the commit.

I was also tempted to "bump" the version upgrade tests to use 1.1 instead of 1.0
since that would have removed a bunch of cruft (since 1.0 doesn't write its
listener files), but I decided against that. As a result, some of the new
accessors look a little unclean, but they should straighten themselves out once
we only rely on listener files.
@tbg
Copy link
Member Author

tbg commented Oct 30, 2017

TFTRs!

@tbg tbg merged commit d63168f into cockroachdb:master Oct 30, 2017
@tbg tbg deleted the startup-crash branch October 30, 2017 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants