Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: readd overlap check for preemptive-on-uninitialized snapshots #33010

Merged
merged 1 commit into from
Dec 13, 2018

Conversation

tbg
Copy link
Member

@tbg tbg commented Dec 11, 2018

It was (recently) accidentally removed in #32817 and caused nightly
failures
.

Release note: None

@tbg tbg requested a review from a team December 11, 2018 12:25
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg requested a review from benesch December 11, 2018 12:26
Copy link
Contributor

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh!

@tbg
Copy link
Member Author

tbg commented Dec 11, 2018

TFTR! Doh indeed. So easy to slip in silly mistakes with "reverts" that aren't actually reverts.

bors r=benesch

@craig
Copy link
Contributor

craig bot commented Dec 11, 2018

Build failed

@tbg tbg force-pushed the fix/preemptive-snap-forgets-chec branch from 777643b to 7b3bfc7 Compare December 11, 2018 13:25
@tbg
Copy link
Member Author

tbg commented Dec 11, 2018

bors r=benesch

Thanks for nothing, staticcheck.

@craig
Copy link
Contributor

craig bot commented Dec 11, 2018

Build failed

It was (recently) accidentally removed in cockroachdb#32817 and was caused [nightly
failures].

[nightly failures]: cockroachdb#32135 (comment)

Release note: None
@tbg tbg force-pushed the fix/preemptive-snap-forgets-chec branch from 7b3bfc7 to 3ef7f68 Compare December 11, 2018 13:47
@tbg
Copy link
Member Author

tbg commented Dec 11, 2018

OK, that one was on me. Third one's the charm...

bors r=benesch

@craig
Copy link
Contributor

craig bot commented Dec 11, 2018

Build failed

@tbg
Copy link
Member Author

tbg commented Dec 11, 2018

Ugh:

acceptance/bank/node-restart
* ERROR: cockroach server exited with error: failed to start store: [n2,s2]: cannot addReplicaInternalLocked; range [n2,s2,r32/3:/{Table/53/1/7…-Max}] has overlapping range r33:/{Table/53/1/266-Max} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=1]
*

Going to have to take another close look at this later.

@tbg
Copy link
Member Author

tbg commented Dec 13, 2018

Oh, I think this might be a merge bug?

The (heavily filtered) log below shows the following sequence of events:

  1. n2 announces it's going to merge r31:/{Table/53/1/525-Max} into r33/3:/Table/53/1/{266-525}
  2. n2 says it's going to replicaGC r31/3 (as part of the merge trigger)
  3. n2 gets killed
  4. n2 restarts and fails because it sees the post-merge r31 (reaching all the way to /Max) but also r32.

So this looks like an atomicity failure while extending the LHS during the merge trigger. We've extended it, but failed to atomically remove the RHS. Going to take a look.

6ce48ec89eae> I181211 13:53:41.319876 94 storage/replica_command.go:492  [n2,merge,s2,r33/3:/Table/53/1/{266-525}] initiating a merge of r31:/{Table/53/1/525-Max} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=2] into this range
6ce48ec89eae> I181211 13:53:41.364031 206 storage/store.go:2561  [n3,s3,r33/2:/Table/53/1/{266-525}] removing replica r31/2
6ce48ec89eae> I181211 13:53:41.364059 176 storage/store.go:2561  [n2,s2,r33/3:/Table/53/1/{266-525}] removing replica r31/3
6ce48ec89eae> I181211 13:53:41.364146 180 storage/store.go:2561  [n1,s1,r33/1:/Table/53/1/{266-525}] removing replica r31/1
6ce48ec89eae> E181211 13:53:43.503144 1 cli/error.go:229  cockroach server exited with error: failed to start store: [n2,s2]: cannot addReplicaInternalLocked; range [n2,s2,r32/3:/{Table/53/1/7…-Max}] has overlapping range r33:/{Table/53/1/266-Max} [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=1]

@tbg
Copy link
Member Author

tbg commented Dec 13, 2018

Wait, no... the overlapping range isn't failing on r31 but r32, which is yet another range... Hmm. Taking another look.

@tbg
Copy link
Member Author

tbg commented Dec 13, 2018

cluster merges r31 [525-710) and r32 [710-Max)
n2 processes it, removing replica 32/3
n2 gets killed and restarts
cluster merges r33 [266-525) and r31 [525-Max)
n2 process it, removing replica r31/3
n2 restarts and finds n32/3 [710-Max) which should have been gc'ed, as well as r33 [266-Max) (which is supposed to be there).

My reading of this is that the first time n2 got killed, somehow data for r32 stuck around. This didn't become obvious during
the first restart, but then the second merge created a replica overlapping that keyspace which triggered the fatal error after
another restart.

n2 applied its last snapshot at 13:52:58 while the above history begins at
15:53:35, over half a minute later. So I don't think an erroneously accepted
snapshot is responsible here.

I ran 25 iterations of the test locally without issues. I think I got very lucky catching this in CI.

The code to carry out the atomic removal of the RHS is here:

https://github.com/tbg/cockroach/blob/a147f24d94b4ecb99162c0808e4a037f29166658/pkg/storage/replica.go#L5493-L5514

One explanation of the bug would be if the RHS' range descriptor's deletion intent never got committed, but the rest of the merge
did. We ignore intents when initializing the replicas of a node, so r32 could "come back" in that way.

Unfortunately, I've audited the code and LeftDesc in the snippet below is actually the merged descriptor, and so
the intents for both range descriptors should be removed atomically with the commit.

if mergeTrigger := args.InternalCommitTrigger.GetMergeTrigger(); mergeTrigger != nil {
// If this is a merge, then use the post-merge descriptor to determine
// which intents are local (note that for a split, we want to use the
// pre-split one instead because it's larger).
desc = &mergeTrigger.LeftDesc
}

The idea would've been that r32's intent never got removed. This would check out with the log,
because n1 carried out the merge of r32 and got killed immediately after it.

I'm going to run the test with some printf debugging to see if I can spot anything off about merges resolving their intents.

@tbg
Copy link
Member Author

tbg commented Dec 13, 2018

Also I'll move to a separate issue, I'm confident enough that this isn't the fault of this PR.

bors r=petermattis

craig bot pushed a commit that referenced this pull request Dec 13, 2018
33010: storage: readd overlap check for preemptive-on-uninitialized snapshots r=petermattis a=tbg

It was (recently) accidentally removed in #32817 and caused [nightly
failures].

[nightly failures]: #32135 (comment)

Release note: None

Co-authored-by: Tobias Schottdorf <[email protected]>
@tbg
Copy link
Member Author

tbg commented Dec 13, 2018

#33120

@craig
Copy link
Contributor

craig bot commented Dec 13, 2018

Build succeeded

@craig craig bot merged commit 3ef7f68 into cockroachdb:master Dec 13, 2018
@tbg tbg requested a review from petermattis December 13, 2018 12:21
@tbg tbg deleted the fix/preemptive-snap-forgets-chec branch December 13, 2018 12:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants