Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-22.1: add checkpointing for raftAppliedIndexTermMigration #84909

Merged

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Jul 22, 2022

The raftAppliedIndexTermMigration upgrade migration could be
unreliable. It iterates over all ranges and runs a Migrate request
which must be applied on all replicas. However, if any ranges merge or
replicas are unavailable, the migration fails and starts over from the
beginning. In large clusters with many ranges, this meant that it might
never complete.

This patch makes the upgrade more robust, by retrying each Migrate
request 5 times, and checkpointing the progress after every fifth batch
(1000 ranges), allowing resumption on failure. At some point this should
be made part of the migration infrastructure.

Resolves #84073.

Release note (bug fix): the 22.1 upgrade migration "21.2-56: populate
RangeAppliedState.RaftAppliedIndexTerm for all ranges" is now more
resilient to failures, by checkpointing its progress in case it fails
and must be restarted. This migration must be applied across all
ranges and replicas in the system, and can fail with 'operation "wait
for Migrate application" timed out' if any replicas are temporarily
unavailable, which is increasingly likely to happen in large clusters
with many ranges -- previously, this would restart the migration from
the start, and might never make it all the way through.

Release justification: makes upgrades more robust.

@erikgrinaker erikgrinaker requested review from irfansharif, sumeerbhola and a team July 22, 2022 15:05
@erikgrinaker erikgrinaker self-assigned this Jul 22, 2022
@blathers-crl
Copy link

blathers-crl bot commented Jul 22, 2022

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this worth adding a test for? (I assume we've manually verified the change.)

@erikgrinaker erikgrinaker force-pushed the v22.1-raft-migration branch from 3124f01 to 271d25c Compare July 25, 2022 18:54
@blathers-crl blathers-crl bot requested a review from irfansharif July 25, 2022 18:54
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Jul 25, 2022

Is this worth adding a test for? (I assume we've manually verified the change.)

I hear ya, but it's a pretty annoying test to write, since it's going to need a bunch of testing knobs and stuff -- it'd be easier to test if we generalized this, but I don't really have time to get sidetracked before stability.

And yes, I've manually verified that this works both on migration errors, coordinator failures, and pause/resume.

@erikgrinaker erikgrinaker force-pushed the v22.1-raft-migration branch from 271d25c to db989ff Compare July 25, 2022 20:50
Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more words in the release note for the observed symptoms before this patch would be helpful me thinks.

@erikgrinaker erikgrinaker force-pushed the v22.1-raft-migration branch 3 times, most recently from a62dfb1 to c9db8d7 Compare July 25, 2022 21:44
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Jul 25, 2022

A few more words in the release note for the observed symptoms before this patch would be helpful me thinks.

How's this?

@irfansharif
Copy link
Contributor

It's great, thanks!

The `raftAppliedIndexTermMigration` upgrade migration could be
unreliable. It iterates over all ranges and runs a `Migrate` request
which must be applied on all replicas. However, if any ranges merge or
replicas are unavailable, the migration fails and starts over from the
beginning. In large clusters with many ranges, this meant that it might
never complete.

This patch makes the upgrade more robust, by retrying each `Migrate`
request 5 times, and checkpointing the progress after every fifth batch
(1000 ranges), allowing resumption on failure. At some point this should
be made part of the migration infrastructure.

Release note (bug fix): the 22.1 upgrade migration "21.2-56: populate
RangeAppliedState.RaftAppliedIndexTerm for all ranges" is now more
resilient to failures, by checkpointing its progress in case it fails
and must be restarted. This migration must be applied across all
ranges and replicas in the system, and can fail with 'operation "wait
for Migrate application" timed out' if any replicas are temporarily
unavailable, which is increasingly likely to happen in large clusters
with many ranges -- previously, this would restart the migration from
the start, and might never make it all the way through.
@erikgrinaker erikgrinaker force-pushed the v22.1-raft-migration branch from c9db8d7 to 0508edc Compare July 26, 2022 11:28
@erikgrinaker erikgrinaker merged commit 5b4e798 into cockroachdb:release-22.1 Jul 26, 2022
craig bot pushed a commit that referenced this pull request Jul 27, 2022
84875: backupccl: handle range keys in BACKUP r=erikgrinaker a=msbutler

Previously BACKUP would not back up range tombstones. With this patch, BACKUPs
with revision_history will backup range tombstones. Non-revision history backups
are not affected by this diff because MVCCExportToSST filters all tombstones
out of the backup already.

Specifically, this patch replaces the iterators used in the backup_processor
with the pebbleIterator, which has baked in range key support. This refactor
introduces a 5% regression in backup runtime, even when the backup has no range
keys, though #83051 hopes to address this gap. See details below on the
benchmark experiment.

At this point a backup with range keys is restorable, thanks to #84214. Note
that the restore codebase still touches iterators that are not range key aware.
This is not a problem because restored data does not have range keys, nor do
the empty ranges restore dumps data into. These iterators (e.g. in SSTBatcher
and in CheckSSTConflicts) will be updated when #70428 gets fixed.

Fixes #71155

Release note: none

To benchmark this diff, the following commands were used on the following sha
a5ccdc3, with and without this commit, over
three trials:
```
roachprod create -n 5 --gce-machine-type=n2-standard-16 $CLUSTER
roachprod put $CLUSTER [build] cockroach

roachprod wipe $CLUSTER; roachprod start $CLUSTER;
roachprod run $CLUSTER:1 -- "./cockroach workload init bank --rows 1000000000"
roachprod sql $CLUSTER:1 -- -e "BACKUP INTO 'gs://somebucket/michael-rangkey?AUTH=implicit'"
```

The backup on the control binary took on average 478 seconds with a stdev of 13
seconds, while the backup with the treatment binary took on average 499 seconds
with stddev of 8 seconds.

84883: kvserver: add server-side transaction retry metrics r=arulajmani a=arulajmani

This patch adds a few new metrics to track successful/failed
server-side transaction retries. Specifically, whenever we attempt
to retry a read or write batch or run into a read within uncertainty
interval error, we increment specific counters indicating if the
retry was successful or not.

Release note: None

85074: upgrades: add checkpointing for `raftAppliedIndexTermMigration` r=irfansharif a=erikgrinaker

Forward-port of #84909, for posterity.

----

The `raftAppliedIndexTermMigration` upgrade migration could be
unreliable. It iterates over all ranges and runs a `Migrate` request
which must be applied on all replicas. However, if any ranges merge or
replicas are unavailable, the migration fails and starts over from the
beginning. In large clusters with many ranges, this meant that it might
never complete.

This patch makes the upgrade more robust, by retrying each `Migrate`
request 5 times, and checkpointing the progress after every fifth batch
(1000 ranges), allowing resumption on failure. At some point this should
be made part of the migration infrastructure.

NB: This fix was initially submitted for 22.1, and even though the
migration will be removed for 22.2, it is forward-ported for posterity.

Release note: None

85086: eval: stop ignoring all ResolveOIDFromOID errors r=ajwerner a=rafiss

fixes #84448

The decision about whether an error is safe to ignore is made at the
place where the error is created/returned. This way, the callers don't
need to be aware of any new error codes that the implementation may
start returning in the future.

Release note (bug fix): Fixed incorrect error handling that could cause
casts to OID types to fail in some cases.

Co-authored-by: Michael Butler <[email protected]>
Co-authored-by: Arul Ajmani <[email protected]>
Co-authored-by: Erik Grinaker <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
@erikgrinaker erikgrinaker deleted the v22.1-raft-migration branch July 27, 2022 09:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants