-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upgrades,upgrade,upgrades_test: Added an upgrade for updating invalid column ID in seq's back references #86255
upgrades,upgrade,upgrades_test: Added an upgrade for updating invalid column ID in seq's back references #86255
Conversation
5180eb1
to
6d61d66
Compare
@ajwerner In the short-term fix #82859, you also relaxed the validation logic to permit 0-valued column ID in sequence's If so, I will add on top one more commit to this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1, 3 of 3 files at r2, 5 of 5 files at r3, 2 of 2 files at r4, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @Xiang-Gu)
pkg/upgrade/upgrades/update_invalid_column_ids_in_sequence_back_references.go
line 82 at r3 (raw file):
anyUpdate := false for i, ref := range seqDesc.DependedOnBy { invalidColumnIDFreq := freqOfColumnIDIn(ref.ColumnIDs, 0)
Have you seen cases where this is more than 1? Can you add commentary explaining how that happens and what it means?
pkg/upgrade/upgrades/update_invalid_column_ids_in_sequence_back_references.go
line 162 at r3 (raw file):
} } return false
add this method to descpb.IDs
?
Code quote:
func descIDsContains(slice []descpb.ID, target descpb.ID) bool {
for _, elem := range slice {
if elem == target {
return true
}
}
return false
6d61d66
to
7d04216
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I addressed the comments and I think it's ready for a look again!
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)
pkg/upgrade/upgrades/update_invalid_column_ids_in_sequence_back_references.go
line 82 at r3 (raw file):
Previously, ajwerner wrote…
Have you seen cases where this is more than 1? Can you add commentary explaining how that happens and what it means?
I haven't seen any customer run into such issue, but it's possible to do the following in v21.1
CREATE SEQUENCE s;
CREATE TABLE t (i INT PRIMARY KEY);
followed by
ALTER TABLE t ADD COLUMN j INT DEFAULT nextval('s'), ADD COLUMN k DEFAULT currval('s')
The sequence s
will have a DependedOnBy
field be
"dependedOnBy": [
{
"byId": true,
"columnIds": [
0,
0
],
"id": 52,
"indexId": 0
}
],
and the two newly added columns, j
and k
, will both have
"usesSequenceIds": [
53
],
I added more comments about this.
pkg/upgrade/upgrades/update_invalid_column_ids_in_sequence_back_references.go
line 162 at r3 (raw file):
Previously, ajwerner wrote…
add this method to
descpb.IDs
?
done.
7d04216
to
02c79e9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r6, 5 of 6 files at r7, 2 of 2 files at r8, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @Xiang-Gu)
pkg/upgrade/upgrades/update_invalid_column_ids_in_sequence_back_references_external_test.go
line 74 at r8 (raw file):
CREATE SEQUENCE s; CREATE TABLE tbl (i INT PRIMARY KEY); ALTER TABLE tbl ADD COLUMN j INT DEFAULT nextval('s');
Can I be annoying and ask you to also add a test for the more complex multi-column case, and can you add a test where one descriptor references multiple sequences?
02c79e9
to
3368fe1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ajwerner To add more, complex tests, I need to relax some validation and postDeserializationChanges such that duplicate 0-valued column IDs are permitted when I inject corrupt descriptors in the test. The relevant changes are in commit 4. I'd like to know your thought about that change.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)
pkg/upgrade/upgrades/update_invalid_column_ids_in_sequence_back_references_external_test.go
line 74 at r8 (raw file):
Previously, ajwerner wrote…
Can I be annoying and ask you to also add a test for the more complex multi-column case, and can you add a test where one descriptor references multiple sequences?
done
3368fe1
to
1ae4dec
Compare
@@ -671,13 +671,6 @@ func maybeRemoveDuplicateIDsInRefs(d *descpb.TableDescriptor) (hasChanged bool) | |||
for i := range d.DependedOnBy { | |||
ref := &d.DependedOnBy[i] | |||
s := catalog.MakeTableColSet(ref.ColumnIDs...).Ordered() | |||
// Also strip away O-IDs, which may have made their way in here in the past. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanted to make sure I understood it correctly -- the purpose of keeping the 0 column id is that the user can be aware that there are dependents that can't be back referenced, and thus they will do manual changes. Is this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was some postDeserializationChanges code we used to have where we want to duduplicate column IDs. We used to have some additional logic that says "if there are any 0-valued column ID and non-zero valued column IDs, we will remove the 0-valued column ID". I'm not sure why we had that but that had caused some of my newly added test to fail (where we would construct a corrupt sequence descriptor with a back reference with column ID [2, 0]) so I removed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when was this code added? Does this mean we need to check on back-references when there are no referencing column IDs?
var lastUpgradedID descpb.ID | ||
// Upgrade each sequence, one at a time, until we exhaust all of them. | ||
for { | ||
done, idToUpgrade, err := findNextTableToUpgrade(ctx, d.InternalExecutor, lastUpgradedID, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to have both findNextTableToUpgrade
and maybeUpdateInvalidColumnIdsInSequenceBackReferencesBestEffort
encompassed in a d.CollectionFactory.Txn()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What difference will that make? Will there be any correctness concern the way this is currently implemented?
We used tenantDeps.InternalExecutor
to retrieve all the sequence IDs in the cluster, and then for each of them, we used tenantDeps.CollectionFactory.Txn
to perform an upgrade. Will there be any scenario where this pattern of usage might yield incorrect results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the descriptor we obtained from findNextTableOrViewToUpgrade
be modified or dropped by another goroutine before we execute the upgrade?
Sorry that I'm not 100% sure if my concern is valid -- please take it as an open question rather than a change request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point now -- collectionFactory.TxnWithExecutor
nicely gives us an internal executor that is tied to the collection. I have changed the code as you suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the descriptor could have been modified in parallel. The good news, I think, is that the logic here is monotone. If there was a parallel modification, we'll see it when we retrieve the descriptor and we'll notice that no upgrade needs to happen. Do you see some hazard I'm missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks for the explanation! That's the only hazard I could think of here.
6d1f94c
to
1231d0a
Compare
func updateInvalidColumnIDsInSequenceBackReferences( | ||
ctx context.Context, _ clusterversion.ClusterVersion, d upgrade.TenantDeps, _ *jobs.Job, | ||
) error { | ||
// Upgrade each sequence, one at a time, until we exhaust all of them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this will be fine, but if you wanted to be conservative, you could use a new transaction for each sequence you need to upgrade. Right now, you're going to have a long-running transaction holding locks on the descriptor table. I don't know if that's desirable.
I'm okay with this migration taking O(sequences to be repaired)
transactions. You'll need to return some indication that maybeUpdateInvalidColumnIdsInSequenceBackReferencesBestEffort
did indeed do something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed it back to one sequence per transaction. I agreed that it's not desirable to have potentially long running transactions holding up locks on a table.
One question though pertaining to the new interface though is:
If I have a collectionFactory, I can now do collectionFactory.TxnWithExecutor(..., desc.Collection, sqlutil.InternalExecutor...)
where I can have access to both a desc.Collection
and a InternalExecutor
, and the internal executor is tied to the desc collection.
But now, tenantDeps
has both a collectionFactory
and a internalExecutor
. I'm using tenantDeps.ie
to retrieve the next sequence ID to upgrade first, and later I use tenantDeps.CollectionFactory.TxnWithExecutor(...., _ sqlutil.InternalExecutor)
for the actual upgrade logic without using the provided internal executor.
Will there be any correctness concerns, similar to what Jane said here?
@@ -671,13 +671,6 @@ func maybeRemoveDuplicateIDsInRefs(d *descpb.TableDescriptor) (hasChanged bool) | |||
for i := range d.DependedOnBy { | |||
ref := &d.DependedOnBy[i] | |||
s := catalog.MakeTableColSet(ref.ColumnIDs...).Ordered() | |||
// Also strip away O-IDs, which may have made their way in here in the past. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when was this code added? Does this mean we need to check on back-references when there are no referencing column IDs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that in the past we have seen 0-valued column IDs and we just remove them as a way to fix the corruption. The original code will only remove 0-valued column IDs if they are the only column IDs. @postamar please correct me if I was wrong.
This change was necessary only to enable one complex test case where I will have a corrupt back reference with ColumnIDs= [0, 2]
and I don't want this logic to remove the 0
, which is used by the upgrade logic to detect there was an issue.
Alternatively, we can instead don't do detection and go straight to upgrade each and every sequence by re-constructing their back-references from the referencing table's columns' usesSequenceIDs
fields. What do you think?
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @ZhouXing19)
1231d0a
to
07e252b
Compare
That all makes sense. I think it's good to remove the logic, as you have done. My question is: when was that logic added. If it was before the current release, I think we should go and repair all sequences.
If we do this, we'll need to take some care to make sure the migration is not too expensive. The code where we share an internal executor and descs collection will do a good job here thanks to @jasonmchan's work. My preference is that we keep the code similar to how you had it before, sharing an internal executor and descs collection, but restarting a new transaction (and, thus, a new scan) every time we actually perform a repair. I expect the repairs to be rare. |
f1c7cca
to
0bee82c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ajwerner My latest change implemented a "iteration-till-fixed-point" approach where I start a new transaction with an internal executor (i.e. collectionFactory.TxnWithExecutor
) to scan all sequences, re-construct their expected ColumnIds
in their back-references, and update them if they are not equal to the actual values. If one such update/repair happened, we stop right there and re-start a new transaction that does the same process again, until there is no sequence to repair, at which point we have reached a "fixed point".
This way, we used an internal executor with the right collection tied to it, and we don't risk holding locks of the table too long if we do all the repair in one transaction.
Please let me know if you have any comments. Thank you!
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @Xiang-Gu, and @ZhouXing19)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than paginating the descriptors as you run the repair, LGTM
ctx context.Context, txn *kv.Txn, descriptors *descs.Collection, ie sqlutil.InternalExecutor, | ||
) error { | ||
// Upgrade each sequence, one at a time, until we exhaust all of them. | ||
var lastUpgradedID descpb.ID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With some care, you can pull this variable outside the loop. Or, rather, you can pull a separate variable you write to outside the loop and update it only if the transaction commit successfully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ajwerner I've changed how we paginate the descriptors during repair and now battling CI flakes, but think you might want to take one last look again, just in case.
0bee82c
to
b55fbf0
Compare
var seqID descpb.ID | ||
var done bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately I think this is hazardous. Consider a case where you retry, you might have assigned to seqID
. Try something like this:
var lastSeqID descpb.ID
for {
var curSeqID descpb.ID
var done bool
if err := d.CollectionFactory.TxnWithExecutor(ctx, d.DB, d.SessionData, func(
ctx context.Context, txn *kv.Txn, descriptors *descs.Collection, ie sqlutil.InternalExecutor,
) (err error) {
for {
done, curSeqID, err = findNextTableToUpgrade(ctx, ie, lastSeqID,
func(table *descpb.TableDescriptor) bool {
return table.IsSequence()
})
if err != nil || done {
return err
}
// Sequence `nextIdToUpgrade` might contain back reference with invalid column IDs. If so, we need to
// update them with valid column IDs.
hasUpgrade, err := maybeUpdateInvalidColumnIdsInSequenceBackReferences(ctx, txn, curSeqID, descriptors)
if err != nil {
return err
}
if hasUpgrade {
return nil
}
}
}); err != nil {
return err
}
// Break out of the loop if we upgraded all sequences.
if done {
break
}
lastSeqID = curSeqID
}
```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh, retries, good point! It should work if we add one line to your snippet
var lastSeqID descpb.ID
for {
var curSeqID descpb.ID
var done bool
if err := d.CollectionFactory.TxnWithExecutor(ctx, d.DB, d.SessionData, func(
ctx context.Context, txn *kv.Txn, descriptors *descs.Collection, ie sqlutil.InternalExecutor,
) (err error) {
for {
done, curSeqID, err = findNextTableToUpgrade(ctx, ie, lastSeqID,
func(table *descpb.TableDescriptor) bool {
return table.IsSequence()
})
if err != nil || done {
return err
}
// Sequence `nextIdToUpgrade` might contain back reference with invalid column IDs. If so, we need to
// update them with valid column IDs.
hasUpgrade, err := maybeUpdateInvalidColumnIdsInSequenceBackReferences(ctx, txn, curSeqID, descriptors)
if err != nil {
return err
}
if hasUpgrade {
return nil
}
+ lastSeqID = curSeqID
}
}); err != nil {
return err
}
// Break out of the loop if we upgraded all sequences.
if done {
break
}
lastSeqID = curSeqID
}
Otherwise it will hang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that also does not work. Instead, I think you need to initialize curSeqID
to lastSeqID
above the inner for loop and use curSeqID
to kick off the iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, that's right. How about my latest change?
{
// Scan all sequences and repair those with 0-valued column IDs in their back references,
// one transaction for each repair.
var lastSeqID descpb.ID
for {
var currSeqID descpb.ID
var done bool
if err := d.CollectionFactory.TxnWithExecutor(ctx, d.DB, d.SessionData, func(
ctx context.Context, txn *kv.Txn, descriptors *descs.Collection, ie sqlutil.InternalExecutor,
) (err error) {
currSeqID = lastSeqID
for {
done, currSeqID, err = findNextTableToUpgrade(ctx, ie, currSeqID,
func(table *descpb.TableDescriptor) bool {
return table.IsSequence()
})
if err != nil || done {
return err
}
// Sequence `nextIdToUpgrade` might contain back reference with invalid column IDs. If so, we need to
// update them with valid column IDs.
hasUpgrade, err := maybeUpdateInvalidColumnIdsInSequenceBackReferences(ctx, txn, currSeqID, descriptors)
if err != nil {
return err
}
if hasUpgrade {
return nil
}
}
}); err != nil {
return err
}
// Break out of the loop if we upgraded all sequences.
if done {
break
}
lastSeqID = currSeqID
}
return nil
}
9ef3225
to
52618bd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
52618bd
to
b980281
Compare
Slightly refactored the function that returns the next table descriptor to be upgraded.
This upgrade attempts to update invalid column IDs in sequence back references, if any, on a best-effort basis. The context of the need for such an upgrade can be found in cockroachdb#82576. The summary there is bugs in prior versions might cause sequence descriptor corruption where their back references might contain a column ID 0. We ought to figure out what the actual column ID is and update such invalid column IDs.
Previously, in post deserialization changes, we deduplicate column IDs in a sequence's back references. In particular, if there are any 0-valued column IDs, we remove them altogether unless this is the only column ID in the reference. This commit removed that to allow a ref with column ids like "[0, 2]".
We inject a descriptor that was created in v21.1 and hence contains the buggy invalid column ID in the sequence back reference that the upgrade attempts to update. We exercise the cluster upgrade and assert that our logic indeed updates the invalid column ID to the expected column ID.
b980281
to
1cc5b55
Compare
TFTR! The changes after the last LGTM is only rebasing and resolving some merge conflict. No functional or implementation changes were made. bors r+ |
Build failed (retrying...): |
Build failed (retrying...): |
Build succeeded: |
commit 1: a small refactoring of existing function
commit 2: added a field to TenantDeps
commit 3: added a cluster version upgrade that attempts to update
invalid column ID in seq's back references due to bugs in prior versions.
See below for details
commit 4: wrote a test for this newly added upgrade
Previously, in version 21.1 and prior,
ADD COLUMN DEFAULT nextval('s')
will incorrectly store a 0-valued column id in the sequence 's' back reference
dependedOnBy
because we added this dependency before allocating anID to the newly added column. Customers ran into issues when upgrading
to v22.1 so we relaxed the validation logic as a quick short-term fix, as detailed
in #82859.
Now we want to give this issue a more proper treatment by adding a cluster
upgrade (to v22.2) that attempts to detect such invalid column ID issues and
update them with the correct column ID. This PR does exactly this.
Fixes: #83124
Release note: None
Release justification: fixed a release blocker that will resolve invalid column ID
appearance in sequence's back referenced, caused by bugs in older binaries.