-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: readd overlap check for preemptive-on-uninitialized snapshots #33010
storage: readd overlap check for preemptive-on-uninitialized snapshots #33010
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doh!
TFTR! Doh indeed. So easy to slip in silly mistakes with "reverts" that aren't actually reverts. bors r=benesch |
Build failed |
777643b
to
7b3bfc7
Compare
bors r=benesch Thanks for nothing, staticcheck. |
Build failed |
It was (recently) accidentally removed in cockroachdb#32817 and was caused [nightly failures]. [nightly failures]: cockroachdb#32135 (comment) Release note: None
7b3bfc7
to
3ef7f68
Compare
OK, that one was on me. Third one's the charm... bors r=benesch |
Build failed |
Ugh:
Going to have to take another close look at this later. |
Oh, I think this might be a merge bug? The (heavily filtered) log below shows the following sequence of events:
So this looks like an atomicity failure while extending the LHS during the merge trigger. We've extended it, but failed to atomically remove the RHS. Going to take a look.
|
Wait, no... the overlapping range isn't failing on r31 but r32, which is yet another range... Hmm. Taking another look. |
cluster merges r31 [525-710) and r32 [710-Max) My reading of this is that the first time n2 got killed, somehow data for r32 stuck around. This didn't become obvious during n2 applied its last snapshot at 13:52:58 while the above history begins at I ran 25 iterations of the test locally without issues. I think I got very lucky catching this in CI. The code to carry out the atomic removal of the RHS is here: One explanation of the bug would be if the RHS' range descriptor's deletion intent never got committed, but the rest of the merge Unfortunately, I've audited the code and cockroach/pkg/storage/batcheval/cmd_end_transaction.go Lines 430 to 435 in 3c76b15
The idea would've been that r32's intent never got removed. This would check out with the log, I'm going to run the test with some printf debugging to see if I can spot anything off about merges resolving their intents. |
Also I'll move to a separate issue, I'm confident enough that this isn't the fault of this PR. bors r=petermattis |
33010: storage: readd overlap check for preemptive-on-uninitialized snapshots r=petermattis a=tbg It was (recently) accidentally removed in #32817 and caused [nightly failures]. [nightly failures]: #32135 (comment) Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>
Build succeeded |
It was (recently) accidentally removed in #32817 and caused nightly
failures.
Release note: None