-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ccl/backupccl: TestDataDriven_backup_dropped_descriptors failed #120929
Comments
cc @cockroachdb/disaster-recovery |
Ok, this is still reproducible on master via:
|
Looks like this fails before a backup-restore sequence. We're expecting to observe a descriptor in the dropped state but we can't find it (fails in backup-dropped-descriptors:147):
|
Ok I think i've bisected this to #120144 |
ccl/backupccl.TestDataDriven_backup_dropped_descriptors failed with artifacts on master @ 215ece44cfba2ab950570f9d03cf29599e1ad088:
|
ccl/backupccl.TestDataDriven_backup_dropped_descriptors failed on master @ 7488e090daa588c4d7c0f828c8006bb9b13a90f6:
Parameters:
|
It looks like the test is making an assumption about object IDs that isn't really guaranteed. The test is failing at this line:
I added one more assertion as a sanity check:
Then, when I ran with:
So I wonder if something about #120144 made it more likely for descriptor ID generation to create IDs in a different order. |
It would be a relief if all these test flakes are due to an incorrect assumption, but I'm not sure it's the root cause:
|
Well the theory I have proposed is that the change made it more likely for the incorrect assumption to be violated. I don't yet have a solid theory for why, but it could be something like: the extra work that happens in order to apply the ExcludeDataFromBackup flag to the the relevant spans inside of system tables make the descriptor ID generation code more likely to hit transaction retries.
Yeah I agree, maybe it isn't the root cause. But IMO the example I posted above where I added
does prove that the test is incorrect in assuming that the ID of To check what's going on, I added some logging around whenever we generate a new descriptor ID, as well as when we auto-retry a transaction. Here's what happened:
This shows that creating the In general, retry errors are always possible here. So I definitely think the test should be updated to account for that. However, I don't have a good explanation yet for why #120144 made it more likely for the |
From our docs
Maybe it's something like:
I can test that theory by disabling lease transfers. |
Sounds good. Agreed that tests should not assume a descriptor ID. I can try updating this test to be descID agnostic. If it's green under stress, then I think we can remove the Release blocker for this. If your lease transfer theory holds, the roachtest failure could be explained by a bad bug in backup/restore logic that assumes something about descriptor ID values. But no need to hypothesize yet :) |
I tried using the BlockTransferTarget testing knob. But the test still hit a transaction retry:
As another experiment, I changed |
ccl/backupccl.TestDataDriven_backup_dropped_descriptors failed on master @ 1084d88d63187bf66a2ee078e2f7ffd89cdfefba:
Parameters:
attempt=1
run=26
shard=9
Help
See also: How To Investigate a Go Test Failure (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-36951
The text was updated successfully, but these errors were encountered: