Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: allow the in-memory version to be the next fence version #99967

Conversation

ajwerner
Copy link
Contributor

In SHOW CLUSTER SETTING version we wait for the stored version to match the in-memory version. The in-memory version get pushed to the fence but the fence is not persisted, so, while migrations are ongoing, we won't see these values match. In practice this is a problem these days because the migrations take a long time.

Epic: none

Informs #99894

Release note: None

@ajwerner ajwerner requested a review from renatolabs March 29, 2023 19:20
@blathers-crl
Copy link

blathers-crl bot commented Mar 29, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@ajwerner ajwerner force-pushed the ajwerner/allow-fence-version-in-show-cluster-setting branch from 92f9c73 to dfd76d5 Compare March 29, 2023 20:16
@renatolabs
Copy link
Contributor

In practice this is a problem these days because the migrations take a long time.

Do you mean individual migrations or all migrations that happen when we upgrade from two released binaries? I believe the former is what is relevant here, and as I mentioned in #99894, I don't see migrations taking anywhere near 2mins around the timestamp of the failure 🤔.

Another thing is that the "show cluster setting" retry loop has basically no backoff, right? So wouldn't it be unlikely that we're hitting a fence every time for 2 minutes?

@ajwerner
Copy link
Contributor Author

Do you mean individual migrations or all migrations that happen when we upgrade from two released binaries?

All migrations. The deal is that at the start of a migration, we move the in-memory version to the next fence version (but we don't write the fence). Then when the migration completes, we move the in-memory version and then write to KV.

@ajwerner
Copy link
Contributor Author

The point being that while migrations are running, it is likely that the in-memory version is at the fence.

@ajwerner
Copy link
Contributor Author

Another thing is that the "show cluster setting" retry loop has basically no backoff, right? So wouldn't it be unlikely that we're hitting a fence every time for 2 minutes?

retry.WithMaxAttempts very much does exponential backoff

return retry.WithMaxAttempts(ctx, retry.Options{}, math.MaxInt32, func() error {

@ajwerner
Copy link
Contributor Author

What's funky is I don't see the upgrade migrations taking very long on master now.

@@ -94,7 +94,8 @@ func (p *planner) getCurrentEncodedVersionSettingValue(
}

localRawVal := []byte(s.Get(&st.SV))
if !bytes.Equal(localRawVal, kvRawVal) {
if !bytes.Equal(localRawVal, kvRawVal) &&
!localIsNextFence(localRawVal, kvRawVal) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps add a comment here explaining what you are in the commit message.

@ajwerner ajwerner force-pushed the ajwerner/allow-fence-version-in-show-cluster-setting branch from dfd76d5 to 4efa1a7 Compare March 30, 2023 23:04
@ajwerner
Copy link
Contributor Author

@irfansharif I added more structure, improved the error message, and added testing. PTAL

@ajwerner ajwerner force-pushed the ajwerner/allow-fence-version-in-show-cluster-setting branch 2 times, most recently from 59b9c83 to d5b0036 Compare March 31, 2023 02:16
In `SHOW CLUSTER SETTING version` we wait for the stored version to match
the in-memory version. The in-memory version get pushed to the fence but
the fence is not persisted, so, while migrations are ongoing, we won't see
these values match. In practice this is a problem these days because the
migrations take a long time.

Epic: none

Informs cockroachdb#99894

Release note (bug fix): Fixed a bug which could cause `SHOW CLUSTER SETTING
version` to hang and return an opaque error while cluster finalization is
ongoing.
@ajwerner ajwerner force-pushed the ajwerner/allow-fence-version-in-show-cluster-setting branch from d5b0036 to 5c3bb1f Compare March 31, 2023 02:30
@ajwerner ajwerner added the backport-23.1.x Flags PRs that need to be backported to 23.1 label Mar 31, 2023
@ajwerner
Copy link
Contributor Author

For your amusement, here's ChatGPT's summary of this change as a short poem:

In a cluster, settings compared,
Two byte slices, a challenge dared,
Decoding checks if they align,
Fence versions considered, by design.

If matching found, nil returned,
But difference breeds concern,
An error speaks of values clashed,
Cluster safety, our goal unmasked.

@ajwerner
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 31, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Mar 31, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Mar 31, 2023

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants