Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrades,upgrade,upgrades_test: Added an upgrade for updating invalid column ID in seq's back references #86255

Conversation

Xiang-Gu
Copy link
Contributor

@Xiang-Gu Xiang-Gu commented Aug 16, 2022

commit 1: a small refactoring of existing function
commit 2: added a field to TenantDeps
commit 3: added a cluster version upgrade that attempts to update
invalid column ID in seq's back references due to bugs in prior versions.
See below for details
commit 4: wrote a test for this newly added upgrade

Previously, in version 21.1 and prior, ADD COLUMN DEFAULT nextval('s')
will incorrectly store a 0-valued column id in the sequence 's' back reference
dependedOnBy because we added this dependency before allocating an
ID to the newly added column. Customers ran into issues when upgrading
to v22.1 so we relaxed the validation logic as a quick short-term fix, as detailed
in #82859.

Now we want to give this issue a more proper treatment by adding a cluster
upgrade (to v22.2) that attempts to detect such invalid column ID issues and
update them with the correct column ID. This PR does exactly this.

Fixes: #83124

Release note: None

Release justification: fixed a release blocker that will resolve invalid column ID
appearance in sequence's back referenced, caused by bugs in older binaries.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@Xiang-Gu Xiang-Gu force-pushed the upgrade_invalid_column_IDs_in_sequence_back_references branch from 5180eb1 to 6d61d66 Compare August 16, 2022 20:18
@Xiang-Gu
Copy link
Contributor Author

@ajwerner In the short-term fix #82859, you also relaxed the validation logic to permit 0-valued column ID in sequence's dependedOnBy field. With this cluster version upgrade, which updates 0-valued column ID on a best effort basis, do you think we should also add back the validation logic?

If so, I will add on top one more commit to this PR.

@Xiang-Gu Xiang-Gu changed the title upgrade: Added an upgrade for updating invalid column ID in seq back references upgrades,upgrade,upgrades_test: Added an upgrade for updating invalid column ID in seq's back references Aug 16, 2022
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1, 3 of 3 files at r2, 5 of 5 files at r3, 2 of 2 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Xiang-Gu)


pkg/upgrade/upgrades/update_invalid_column_ids_in_sequence_back_references.go line 82 at r3 (raw file):

		anyUpdate := false
		for i, ref := range seqDesc.DependedOnBy {
			invalidColumnIDFreq := freqOfColumnIDIn(ref.ColumnIDs, 0)

Have you seen cases where this is more than 1? Can you add commentary explaining how that happens and what it means?


pkg/upgrade/upgrades/update_invalid_column_ids_in_sequence_back_references.go line 162 at r3 (raw file):

		}
	}
	return false

add this method to descpb.IDs?

Code quote:

func descIDsContains(slice []descpb.ID, target descpb.ID) bool {
	for _, elem := range slice {
		if elem == target {
			return true
		}
	}
	return false

@Xiang-Gu Xiang-Gu force-pushed the upgrade_invalid_column_IDs_in_sequence_back_references branch from 6d61d66 to 7d04216 Compare August 18, 2022 14:49
Copy link
Contributor Author

@Xiang-Gu Xiang-Gu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I addressed the comments and I think it's ready for a look again!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)


pkg/upgrade/upgrades/update_invalid_column_ids_in_sequence_back_references.go line 82 at r3 (raw file):

Previously, ajwerner wrote…

Have you seen cases where this is more than 1? Can you add commentary explaining how that happens and what it means?

I haven't seen any customer run into such issue, but it's possible to do the following in v21.1

CREATE SEQUENCE s;
CREATE TABLE t (i INT PRIMARY KEY);

followed by

ALTER TABLE t ADD COLUMN j INT DEFAULT nextval('s'), ADD COLUMN k DEFAULT currval('s')

The sequence s will have a DependedOnBy field be

          "dependedOnBy": [
              {
                  "byId": true,
                  "columnIds": [
                      0,
                      0
                  ],
                  "id": 52,
                  "indexId": 0
              }
          ],

and the two newly added columns, j and k, will both have

                  "usesSequenceIds": [
                      53
                  ],

I added more comments about this.


pkg/upgrade/upgrades/update_invalid_column_ids_in_sequence_back_references.go line 162 at r3 (raw file):

Previously, ajwerner wrote…

add this method to descpb.IDs?

done.

@Xiang-Gu Xiang-Gu force-pushed the upgrade_invalid_column_IDs_in_sequence_back_references branch from 7d04216 to 02c79e9 Compare August 18, 2022 15:04
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r6, 5 of 6 files at r7, 2 of 2 files at r8, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Xiang-Gu)


pkg/upgrade/upgrades/update_invalid_column_ids_in_sequence_back_references_external_test.go line 74 at r8 (raw file):

							CREATE SEQUENCE s;
							CREATE TABLE tbl (i INT PRIMARY KEY);
		          ALTER TABLE tbl ADD COLUMN j INT DEFAULT nextval('s');

Can I be annoying and ask you to also add a test for the more complex multi-column case, and can you add a test where one descriptor references multiple sequences?

@Xiang-Gu Xiang-Gu force-pushed the upgrade_invalid_column_IDs_in_sequence_back_references branch from 02c79e9 to 3368fe1 Compare August 18, 2022 21:04
Copy link
Contributor Author

@Xiang-Gu Xiang-Gu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajwerner To add more, complex tests, I need to relax some validation and postDeserializationChanges such that duplicate 0-valued column IDs are permitted when I inject corrupt descriptors in the test. The relevant changes are in commit 4. I'd like to know your thought about that change.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)


pkg/upgrade/upgrades/update_invalid_column_ids_in_sequence_back_references_external_test.go line 74 at r8 (raw file):

Previously, ajwerner wrote…

Can I be annoying and ask you to also add a test for the more complex multi-column case, and can you add a test where one descriptor references multiple sequences?

done

@Xiang-Gu Xiang-Gu force-pushed the upgrade_invalid_column_IDs_in_sequence_back_references branch from 3368fe1 to 1ae4dec Compare August 18, 2022 22:56
@@ -671,13 +671,6 @@ func maybeRemoveDuplicateIDsInRefs(d *descpb.TableDescriptor) (hasChanged bool)
for i := range d.DependedOnBy {
ref := &d.DependedOnBy[i]
s := catalog.MakeTableColSet(ref.ColumnIDs...).Ordered()
// Also strip away O-IDs, which may have made their way in here in the past.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to make sure I understood it correctly -- the purpose of keeping the 0 column id is that the user can be aware that there are dependents that can't be back referenced, and thus they will do manual changes. Is this correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was some postDeserializationChanges code we used to have where we want to duduplicate column IDs. We used to have some additional logic that says "if there are any 0-valued column ID and non-zero valued column IDs, we will remove the 0-valued column ID". I'm not sure why we had that but that had caused some of my newly added test to fail (where we would construct a corrupt sequence descriptor with a back reference with column ID [2, 0]) so I removed it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when was this code added? Does this mean we need to check on back-references when there are no referencing column IDs?

var lastUpgradedID descpb.ID
// Upgrade each sequence, one at a time, until we exhaust all of them.
for {
done, idToUpgrade, err := findNextTableToUpgrade(ctx, d.InternalExecutor, lastUpgradedID,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to have both findNextTableToUpgrade and maybeUpdateInvalidColumnIdsInSequenceBackReferencesBestEffort encompassed in a d.CollectionFactory.Txn()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What difference will that make? Will there be any correctness concern the way this is currently implemented?

We used tenantDeps.InternalExecutor to retrieve all the sequence IDs in the cluster, and then for each of them, we used tenantDeps.CollectionFactory.Txn to perform an upgrade. Will there be any scenario where this pattern of usage might yield incorrect results?

Copy link
Collaborator

@ZhouXing19 ZhouXing19 Aug 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the descriptor we obtained from findNextTableOrViewToUpgrade be modified or dropped by another goroutine before we execute the upgrade?
Sorry that I'm not 100% sure if my concern is valid -- please take it as an open question rather than a change request.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point now -- collectionFactory.TxnWithExecutor nicely gives us an internal executor that is tied to the collection. I have changed the code as you suggested.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the descriptor could have been modified in parallel. The good news, I think, is that the logic here is monotone. If there was a parallel modification, we'll see it when we retrieve the descriptor and we'll notice that no upgrade needs to happen. Do you see some hazard I'm missing?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for the explanation! That's the only hazard I could think of here.

@Xiang-Gu Xiang-Gu force-pushed the upgrade_invalid_column_IDs_in_sequence_back_references branch 2 times, most recently from 6d1f94c to 1231d0a Compare August 22, 2022 16:18
func updateInvalidColumnIDsInSequenceBackReferences(
ctx context.Context, _ clusterversion.ClusterVersion, d upgrade.TenantDeps, _ *jobs.Job,
) error {
// Upgrade each sequence, one at a time, until we exhaust all of them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this will be fine, but if you wanted to be conservative, you could use a new transaction for each sequence you need to upgrade. Right now, you're going to have a long-running transaction holding locks on the descriptor table. I don't know if that's desirable.

I'm okay with this migration taking O(sequences to be repaired) transactions. You'll need to return some indication that maybeUpdateInvalidColumnIdsInSequenceBackReferencesBestEffort did indeed do something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed it back to one sequence per transaction. I agreed that it's not desirable to have potentially long running transactions holding up locks on a table.

One question though pertaining to the new interface though is:
If I have a collectionFactory, I can now do collectionFactory.TxnWithExecutor(..., desc.Collection, sqlutil.InternalExecutor...) where I can have access to both a desc.Collection and a InternalExecutor, and the internal executor is tied to the desc collection.

But now, tenantDeps has both a collectionFactory and a internalExecutor. I'm using tenantDeps.ie to retrieve the next sequence ID to upgrade first, and later I use tenantDeps.CollectionFactory.TxnWithExecutor(...., _ sqlutil.InternalExecutor) for the actual upgrade logic without using the provided internal executor.

Will there be any correctness concerns, similar to what Jane said here?

@@ -671,13 +671,6 @@ func maybeRemoveDuplicateIDsInRefs(d *descpb.TableDescriptor) (hasChanged bool)
for i := range d.DependedOnBy {
ref := &d.DependedOnBy[i]
s := catalog.MakeTableColSet(ref.ColumnIDs...).Ordered()
// Also strip away O-IDs, which may have made their way in here in the past.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when was this code added? Does this mean we need to check on back-references when there are no referencing column IDs?

Copy link
Contributor Author

@Xiang-Gu Xiang-Gu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that in the past we have seen 0-valued column IDs and we just remove them as a way to fix the corruption. The original code will only remove 0-valued column IDs if they are the only column IDs. @postamar please correct me if I was wrong.

This change was necessary only to enable one complex test case where I will have a corrupt back reference with ColumnIDs= [0, 2] and I don't want this logic to remove the 0, which is used by the upgrade logic to detect there was an issue.

Alternatively, we can instead don't do detection and go straight to upgrade each and every sequence by re-constructing their back-references from the referencing table's columns' usesSequenceIDs fields. What do you think?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @ZhouXing19)

@Xiang-Gu Xiang-Gu force-pushed the upgrade_invalid_column_IDs_in_sequence_back_references branch from 1231d0a to 07e252b Compare August 22, 2022 20:23
@ajwerner
Copy link
Contributor

That all makes sense. I think it's good to remove the logic, as you have done. My question is: when was that logic added. If it was before the current release, I think we should go and repair all sequences.

Alternatively, we can instead don't do detection and go straight to upgrade each and every sequence by re-constructing their back-references from the referencing table's columns' usesSequenceIDs fields. What do you think?

If we do this, we'll need to take some care to make sure the migration is not too expensive. The code where we share an internal executor and descs collection will do a good job here thanks to @jasonmchan's work. My preference is that we keep the code similar to how you had it before, sharing an internal executor and descs collection, but restarting a new transaction (and, thus, a new scan) every time we actually perform a repair. I expect the repairs to be rare.

@Xiang-Gu Xiang-Gu force-pushed the upgrade_invalid_column_IDs_in_sequence_back_references branch 2 times, most recently from f1c7cca to 0bee82c Compare August 23, 2022 14:37
Copy link
Contributor Author

@Xiang-Gu Xiang-Gu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajwerner My latest change implemented a "iteration-till-fixed-point" approach where I start a new transaction with an internal executor (i.e. collectionFactory.TxnWithExecutor) to scan all sequences, re-construct their expected ColumnIds in their back-references, and update them if they are not equal to the actual values. If one such update/repair happened, we stop right there and re-start a new transaction that does the same process again, until there is no sequence to repair, at which point we have reached a "fixed point".

This way, we used an internal executor with the right collection tied to it, and we don't risk holding locks of the table too long if we do all the repair in one transaction.

Please let me know if you have any comments. Thank you!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @Xiang-Gu, and @ZhouXing19)

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than paginating the descriptors as you run the repair, LGTM

ctx context.Context, txn *kv.Txn, descriptors *descs.Collection, ie sqlutil.InternalExecutor,
) error {
// Upgrade each sequence, one at a time, until we exhaust all of them.
var lastUpgradedID descpb.ID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With some care, you can pull this variable outside the loop. Or, rather, you can pull a separate variable you write to outside the loop and update it only if the transaction commit successfully.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajwerner I've changed how we paginate the descriptors during repair and now battling CI flakes, but think you might want to take one last look again, just in case.

@Xiang-Gu Xiang-Gu force-pushed the upgrade_invalid_column_IDs_in_sequence_back_references branch from 0bee82c to b55fbf0 Compare August 23, 2022 19:15
Comment on lines 35 to 36
var seqID descpb.ID
var done bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately I think this is hazardous. Consider a case where you retry, you might have assigned to seqID. Try something like this:

  var lastSeqID descpb.ID
	for {
	  var curSeqID descpb.ID
    var done bool
		if err := d.CollectionFactory.TxnWithExecutor(ctx, d.DB, d.SessionData, func(
			ctx context.Context, txn *kv.Txn, descriptors *descs.Collection, ie sqlutil.InternalExecutor,
		) (err error) {
			for {
					done, curSeqID, err = findNextTableToUpgrade(ctx, ie, lastSeqID,
					func(table *descpb.TableDescriptor) bool {
						return table.IsSequence()
					})
				if err != nil || done {
					return err
				}

				// Sequence `nextIdToUpgrade` might contain back reference with invalid column IDs. If so, we need to
				// update them with valid column IDs.
				hasUpgrade, err := maybeUpdateInvalidColumnIdsInSequenceBackReferences(ctx, txn, curSeqID, descriptors)
				if err != nil {
					return err
				}
				if hasUpgrade {
					return nil
				}
			}
		}); err != nil {
			return err
		}

		// Break out of the loop if we upgraded all sequences.
		if done {
			break
		}
		lastSeqID = curSeqID
	}
	```

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, retries, good point! It should work if we add one line to your snippet

  var lastSeqID descpb.ID
	for {
	  var curSeqID descpb.ID
    var done bool
		if err := d.CollectionFactory.TxnWithExecutor(ctx, d.DB, d.SessionData, func(
			ctx context.Context, txn *kv.Txn, descriptors *descs.Collection, ie sqlutil.InternalExecutor,
		) (err error) {
			for {
					done, curSeqID, err = findNextTableToUpgrade(ctx, ie, lastSeqID,
					func(table *descpb.TableDescriptor) bool {
						return table.IsSequence()
					})
				if err != nil || done {
					return err
				}

				// Sequence `nextIdToUpgrade` might contain back reference with invalid column IDs. If so, we need to
				// update them with valid column IDs.
				hasUpgrade, err := maybeUpdateInvalidColumnIdsInSequenceBackReferences(ctx, txn, curSeqID, descriptors)
				if err != nil {
					return err
				}
				if hasUpgrade {
					return nil
				}
+              lastSeqID = curSeqID
			}
		}); err != nil {
			return err
		}

		// Break out of the loop if we upgraded all sequences.
		if done {
			break
		}
		lastSeqID = curSeqID
	}

Otherwise it will hang

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that also does not work. Instead, I think you need to initialize curSeqID to lastSeqIDabove the inner for loop and use curSeqID to kick off the iteration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that's right. How about my latest change?

{
	// Scan all sequences and repair those with 0-valued column IDs in their back references,
	// one transaction for each repair.
	var lastSeqID descpb.ID

	for {
		var currSeqID descpb.ID
		var done bool
		if err := d.CollectionFactory.TxnWithExecutor(ctx, d.DB, d.SessionData, func(
			ctx context.Context, txn *kv.Txn, descriptors *descs.Collection, ie sqlutil.InternalExecutor,
		) (err error) {
			currSeqID = lastSeqID
			for {
				done, currSeqID, err = findNextTableToUpgrade(ctx, ie, currSeqID,
					func(table *descpb.TableDescriptor) bool {
						return table.IsSequence()
					})
				if err != nil || done {
					return err
				}

				// Sequence `nextIdToUpgrade` might contain back reference with invalid column IDs. If so, we need to
				// update them with valid column IDs.
				hasUpgrade, err := maybeUpdateInvalidColumnIdsInSequenceBackReferences(ctx, txn, currSeqID, descriptors)
				if err != nil {
					return err
				}
				if hasUpgrade {
					return nil
				}
			}
		}); err != nil {
			return err
		}

		// Break out of the loop if we upgraded all sequences.
		if done {
			break
		}
		lastSeqID = currSeqID
	}

	return nil
}

@Xiang-Gu Xiang-Gu force-pushed the upgrade_invalid_column_IDs_in_sequence_back_references branch 2 times, most recently from 9ef3225 to 52618bd Compare August 24, 2022 16:43
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Xiang-Gu Xiang-Gu force-pushed the upgrade_invalid_column_IDs_in_sequence_back_references branch from 52618bd to b980281 Compare August 25, 2022 14:47
Slightly refactored the function that returns the next table
descriptor to be upgraded.
This upgrade attempts to update invalid column IDs in sequence back
references, if any, on a best-effort basis.

The context of the need for such an upgrade can be found in cockroachdb#82576.
The summary there is bugs in prior versions might cause sequence
descriptor corruption where their back references might contain a column
ID 0. We ought to figure out what the actual column ID is and update
such invalid column IDs.
Previously, in post deserialization changes, we deduplicate column IDs
in a sequence's back references. In particular, if there are any
0-valued column IDs, we remove them altogether unless this is the only
column ID in the reference. This commit removed that to allow a ref
with column ids like "[0, 2]".
We inject a descriptor that was created in v21.1 and hence contains the
buggy invalid column ID in the sequence back reference that the upgrade
attempts to update. We exercise the cluster upgrade and assert that
our logic indeed updates the invalid column ID to the expected column
ID.
@Xiang-Gu Xiang-Gu force-pushed the upgrade_invalid_column_IDs_in_sequence_back_references branch from b980281 to 1cc5b55 Compare August 26, 2022 14:11
@Xiang-Gu Xiang-Gu marked this pull request as ready for review August 26, 2022 15:40
@Xiang-Gu Xiang-Gu requested a review from a team August 26, 2022 15:40
@Xiang-Gu
Copy link
Contributor Author

TFTR!

The changes after the last LGTM is only rebasing and resolving some merge conflict. No functional or implementation changes were made.

bors r+

@craig
Copy link
Contributor

craig bot commented Aug 26, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Aug 26, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Aug 26, 2022

Build succeeded:

@craig craig bot merged commit db1554b into cockroachdb:master Aug 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql/catalog/tabledesc,upgrade: migrate away invalid sequence references
4 participants