server: introduce a hook for short-running migrations #42822

tbg · 2019-11-27T14:21:54Z

TODO: see if this migration is actually "short-running". That is, in a
sufficiently large cluster, does this cause significant load?

This is a baby version of #39182 that handles only short-running migrations
but is useful in itself because it allows us to migrate us fully into the
following two KV below-Raft migrations:

use the RangeAppliedState on all ranges
use the unreplicated TruncatedState on all ranges

These two migrations have been around for a while and it has been getting
in the way of things. While ideally we introduce #39182 in the near future
and use it to address a larger class of migration concerns, this PR serves
as a starting point and work done on #39182 should be able to absorb this
PR with relative ease.

With this PR, legacy code related to 1) and 2) above can be removed from
master once the 20.1 branch is cut, i.e. roughly in April 2020.

The main ingredients in this PR are

a) introduce a hook that is called during SET CLUSTER SETTING version,
after the change has been validated but before it is made. b) introduce a
KV-level Migrate ranged write command that triggers
the migrations for 1) and 2) above. It is called from the hook for
all of the keyspace. c) before returning to the client, Migrate waits
for the command to
be durably applied on all followers.

Trying to get this 100% correct has proven tricky, perhaps foreshadowing
similar issues that expect us once we try #39182 in earnest. For one, the
mechanism only reaches replicas that members of the raft group, that is, it
won't touch replicas which are gc'able. For the migrations at hand this
means that in 20.2 there may - in theory - still be replicas that have a
replicated truncated state. I believe that our recent efforts around not
re-using replicas across replicaIDs has made sure that this isn't an issue
for this particular migration, but in general it will have to remain on the
radar. Similarly, it seems hard to prove conclusively that no snapshot is
in-flight that would set up a new follower with a state predating the
explicit migration, though it would be exceptionally rare in practice.

Release note (general change): version upgrades now perform maintenance
duties that may slightly delay the SET CLUSTER SETTING version command
and may cause a small amount of additional load on the cluster while an
upgrade is being finalized.

cockroach-teamcity · 2019-11-27T14:22:03Z

This change is

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @danhhz, @nvanbenschoten, @sumeerbhola, and @tbg)

pkg/storage/batcheval/cmd_migrate.go, line 64 at r2 (raw file):

		// Time to migrate by deleting the legacy key. The downstream-of-Raft
		// code will atomically rewrite the truncated state (supplied via the
		// side effect) into the new unreplicated key.

just curious: what prevents a migrate and truncate command to concurrently be proposed and be ordered in some arbitrary way?

And is it possible that there is already a truncate that will delete the legacy key, but has not just not been committed and applied and so this code will repeat some unnecessary work but is otherwise harmless?

irfansharif

For one, the mechanism only reaches replicas that members of the raft group, that is, it won't touch replicas which are gc'able. For the migrations at hand this means that in 20.2 there may - in theory - still be replicas that have a replicated truncated state. I believe that our recent efforts around not re-using replicas across replicaIDs has made sure that this isn't an issue for this particular migration, but in general it will have to remain on the radar.

This should be in a comment somewhere.

Reviewed 2 of 2 files at r1, 34 of 34 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @danhhz, @nvanbenschoten, and @tbg)

pkg/roachpb/api.proto, line 1480 at r2 (raw file):

// any legacy modes that they are currently in. When this command returns, the
// ranges are ready to run with the most up to date cluster version supported
// by this binary.

nit: this binary? which binary?

pkg/server/server.go, line 704 at r2 (raw file):

		LeaseHolderCache:        s.distSender.LeaseHolderCache(),
		VersionUpgradeHook: func(ctx context.Context, newV roachpb.Version) error {
			if !cluster.Version.ActiveVersionOrEmpty(ctx, s.st).Less(newV) {

For my understanding, when is this ever true? Maybe I'm misunderstanding what "allowed to transition" means in set_cluster_settings.go.

pkg/server/server.go, line 712 at r2 (raw file):

			req.EndKey = keys.MaxKey
			b.AddRawRequest(&req)
			// TODO(tbg): make sure all stores have synced once to persist any

For my understanding, why does this need to happen?

pkg/storage/replica_raft_truncation_test.go, line 88 at r2 (raw file):

				}

				apply, err := handleTruncatedStateBelowRaft(ctx, &prevTruncatedState, newTruncatedState, loader, eng, false /* assert */)

nit: assert -> assertNoLegacy.

irfansharif

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @danhhz, @nvanbenschoten, @sumeerbhola, and @tbg)

pkg/storage/batcheval/cmd_migrate.go, line 64 at r2 (raw file):

just curious: what prevents a migrate and truncate command to concurrently be proposed and be ordered in some arbitrary way?

The span latches, IIUC. Both commands latch on keys.RaftTruncatedStateLegacyKey, though I'd like to confirm that this is the case.

irfansharif · 2019-12-03T17:31:04Z

I'll be picking this up after @nvanbenschoten (and/or @danhhz) take a look at what's here already.

nvanbenschoten

This approach looks good to me. I share the concern you have about replicas that aren't part of a raft group not hearing about the migration, but I'm not sure what we can do about this. One thought is to try to flush out these replicas on node startup by forcing them to apply all of the committed entries they know about.

The other thing we should note here is that WaitForApplication requires all replicas to be online. This means that cluster versions are no longer upgradable with any nodes that are offline. I don't believe this was the case before. This change in availability of cluster upgrades was intentional in #39182, but I want to double-check that it's also intentional here.

Reviewed 2 of 2 files at r1, 34 of 34 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @danhhz, @irfansharif, @sumeerbhola, and @tbg)

pkg/roachpb/method.go, line 161 at r2 (raw file):

	// RangeStats returns the MVCC statistics for a range.
	RangeStats
	// Migrate proactively forces a range to transition out of any legacy

"a range or span of ranges"

pkg/server/server.go, line 704 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

For my understanding, when is this ever true? Maybe I'm misunderstanding what "allowed to transition" means in set_cluster_settings.go.

+1. Could you give this a comment?

pkg/sql/exec_util.go, line 535 at r2 (raw file):

	// VersionUpgradeHook is called after validating a `SET CLUSTER SETTING
	// version` but before executing it. It can carry out certain kinds of

Consider making this a bit more explicit -- if the hook does not complete without error, the cluster version will not be upgraded.

pkg/storage/client_replica_test.go, line 2962 at r2 (raw file):

	}{
		{"ts=new,as=new", stateloader.TruncatedStateUnreplicated},
		{"ts=legacy,as=new", stateloader.TruncatedStateLegacyReplicated},

ts=new,as=legacy isn't possible? Because ts=new implies at least one Raft entry which would have necessarily triggered the applied state key migration? I'd spell out why somewhere.

pkg/storage/replica_write.go, line 212 at r2 (raw file):

				desc := r.Desc()
				// NB: waitForApplication already has a timeout.
				propResult.Err = roachpb.NewError(waitForApplication(

I was concerned that either the pre-apply Raft entry acknowledgment change or the Raft entry application batching change could have broken this, but I just confirmed that neither do. waitForApplication waits on the ReplicaState's LeaseAppliedIndex field to move past its desired max lease index, so those kinds of changes are compatible.

pkg/storage/batcheval/cmd_migrate.go, line 39 at r2 (raw file):

// below-Raft migrations. When this command returns, all of the below-Raft
// features known to be available are enabled.
func Migrate(

What is the plan for removing sub-migrations from here? I'd push for restructuring this a little bit such that individual migrations are self-contained (see pkg/sqlmigrations/migrations.go). This would include each migration configuring its own declared keys and evaluation function. We could then discuss exactly when each migration can be removed and potentially test each migration in isolation.

pkg/storage/batcheval/cmd_migrate.go, line 64 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

just curious: what prevents a migrate and truncate command to concurrently be proposed and be ordered in some arbitrary way?

The span latches, IIUC. Both commands latch on keys.RaftTruncatedStateLegacyKey, though I'd like to confirm that this is the case.

Yes, the latch on keys.RaftTruncatedStateLegacyKey should prevent this.

pkg/storage/stateloader/initial.go, line 73 at r2 (raw file):

	s.Lease = &lease
	s.GCThreshold = &gcThreshold
	if truncStateType != TruncatedStateLegacyReplicatedAndNoAppliedKey {

nit: I think this would be a lot less surprising if this was written like:

s.UsingAppliedStateKey = true
if truncStateType == TruncatedStateLegacyReplicatedAndNoAppliedKey {
	// For testing only. Outside of tests, all new Ranges begin life using
	// the applied state key.
	s.UsingAppliedStateKey = false
}

sumeerbhola

I am not sure I have the right understanding of this in-flight snapshot issue. As I understand it, 20.1 will have code that understands these specific migrations, but 20.2 will remove that code. This includes the code to process these migrations -- in the general case, though it seems that the migrations here do not require any special code below raft -- is that correct?

Say everything has switched to 20.1 and for a range a migrate has been added at raft index 15, but not yet committed and applied at all replicas. So the migrate is not "done" (has not returned to client). Before it is applied a snapshot is taken at index 10 to send to some follower -- this follower must be 20.1 since one is not allowed to switch to 20.2 until these migrations are "done". Then the migration is "done". Is it possible for that follower to switch to 20.2 and receive that snapshot? I realized I don't understand the lifecycle of a snapshot -- is one generated for a particular follower and if that receiving node restarts cannot be used? I guess there is another race in that the snapshot starts getting generated before "done" and by the time it is generated the migration is "done" and the follower has switched to 20.2, so we send it the snapshot and the raft log from 11-15. But in general it may not know how to apply the migrate command at 15?

If there is indeed a problem (that I am missing) could one make the "done" condition be that all the replicas have truncated their raft logs to after the migrate command? Then an old snapshot is useless since the replica that applies it has to request another snapshot.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @danhhz, @irfansharif, @sumeerbhola, and @tbg)

TODO: see if this migration is actually "short-running". That is, in a sufficiently large cluster, does this cause significant load? ---- This is a baby version of cockroachdb#39182 that handles only short-running migrations but is useful in itself because it allows us to migrate us fully into the following two KV below-Raft migrations: 1. use the RangeAppliedState on all ranges 2. use the unreplicated TruncatedState on all ranges These two migrations have been around for a while and it has been getting in the way of things. While ideally we introduce cockroachdb#39182 in the near future and use it to address a larger class of migration concerns, this PR serves as a starting point and work done on cockroachdb#39182 should be able to absorb this PR with relative ease. With this PR, legacy code related to 1) and 2) above can be removed from `master` once the 20.1 branch is cut, i.e. roughly in April 2020. The main ingredients in this PR are a) introduce a hook that is called during `SET CLUSTER SETTING version`, after the change has been validated but before it is made. b) introduce a KV-level `Migrate` ranged write command that triggers the migrations for 1) and 2) above. It is called from the hook for all of the keyspace. c) before returning to the client, `Migrate` waits for the command to be durably applied on all followers. Trying to get this 100% correct has proven tricky, perhaps foreshadowing similar issues that expect us once we try cockroachdb#39182 in earnest. For one, the mechanism only reaches replicas that members of the raft group, that is, it won't touch replicas which are gc'able. For the migrations at hand this means that in 20.2 there may - in theory - still be replicas that have a replicated truncated state. I believe that our recent efforts around not re-using replicas across replicaIDs has made sure that this isn't an issue for this particular migration, but in general it will have to remain on the radar. Similarly, it seems hard to prove conclusively that no snapshot is in-flight that would set up a new follower with a state predating the explicit migration, though it would be exceptionally rare in practice. Release note (general change): version upgrades now perform maintenance duties that may slightly delay the `SET CLUSTER SETTING version` command and may cause a small amount of additional load on the cluster while an upgrade is being finalized.

tbg

@sumeerbhola Raft snapshots are basically generated from a RocksDB snapshot, and fundamentally snapshots can be delayed arbitrarily (though that doesn't really tend to happen in practice); a lot of time can pass between the initial opening of the iterator (at the sender) and the ingestion of the snapshot (at the recipient).

Snapshots serialize with raft commands and they'll never "rewind" the applied index, so once a replica is migrated, it stays migrated (assuming we properly synced as we ought to do in this particular case). If a replica is unmigrated but an active part of the range, then the currently written Migrate command will wait for the command to be applied either way, so nothing goes wrong. If the snapshot contains the migration (i.e. has it applied) then there's also no problem (that's what every snapshot, roughly, will look like once the migration has succeed).
The problem is with old snapshots that create a replica that has since been removed. Something like this:

snapshot gets created on range with members {s1,s2,s3} addressed to s3
snapshot gets held up
s3 gets removed, range now only has {s1,s2}
cluster gets upgraded (migrates replicas on s1,s2)
snapshot arrives, creates replica s3

This replica created in the last step is gc'able, but we don't want that replica to come into existence in the first place. Similarly, if we go through the above history without snapshots, any removal can leave a replica around for a bit of time, and this replica isn't (and won't be) "migrated". Both will never serve traffic, but just that they exist (and have a few moving parts) potentially causes bugs in code paths that have since had the migration removed.

My current thinking is that it's worth adding a version to the replica state which is initially zero around here:

cockroach/pkg/storage/storagepb/state.pb.go

Line 73 in e2fc47c

    
           UsingAppliedStateKey bool `protobuf:"varint,11,opt,name=using_applied_state_key,json=usingAppliedStateKey,proto3" json:"using_applied_state_key,omitempty"`

We should be able to validate the incoming snapshot's version and also validate when we load replicas from disk at initial startup to make sure we reject any that we know couldn't have been members of the range when the cluster version was bumped.

I'll flesh that out as I return to this PR.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @danhhz, @irfansharif, @sumeerbhola, and @tbg)

sumeerbhola

Thanks for clarifying. Regarding "Both will never serve traffic, but just that they exist (and have a few moving parts) potentially causes bugs in code paths that have since had the migration removed.": if a node creates a Replica but is not a member of the raft group, so not evaluating any proposals or serving reads, why would the code look at the contents of the persistent state for that replica - was there a particular case you had in mind?

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @danhhz, @irfansharif, @nvanbenschoten, @sumeerbhola, and @tbg)

tbg

A replica that isn't supposed to exist only exists because the surrounding store hasn't yet realized that this replica should be deleted. As such, the replica will behave like any other, except it won't be very successful at doing much (because its raft peers won't talk to it). But it means there's a large surface area to worry about, and it'll be much cleaner to have the migration system ensure that this kind of rogue stale replica does not exist anymore anywhere in the system by the time the migration has completed.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @danhhz, @irfansharif, @nvanbenschoten, @sumeerbhola, and @tbg)

tbg · 2020-03-02T16:41:30Z

pkg/storage/batcheval/cmd_migrate.go

+// Migrate ensures that the range proactively carries out any outstanding
+// below-Raft migrations. When this command returns, all of the below-Raft
+// features known to be available are enabled.
+func Migrate(


this would look up the actual code from some kind of map[roachpb.Version]Command (or other structure) @irfansharif

This PR onboards the first real long-running migration using the infrastructure we've been building up within pkg/migration. It adds in the final missing pieces described in our original RFC (cockroachdb#48843). These components were originally prototyped in cockroachdb#56107. The migration in question (which happens to be a below-Raft one, first attempted in cockroachdb#42822) now lets us do the following: i. Use the RangeAppliedState on all ranges ii. Use the unreplicated TruncatedState on all ranges The missing pieces we introduce along side this migration are: a. The `Migrate` KV request. This forces all ranges overlapping with the request spans to execute the (below-raft) migrations corresponding to the specific version, moving out of any legacy modes they may currently be in. KV waits for this command to durably apply on all the replicas before returning, guaranteeing to the caller that all pre-migration state has been completely purged from the system. b. `IterateRangeDescriptors`. This provides a handle on every range descriptor in the system, which callers can then use to send out arbitrary KV requests to in order to run arbitrary KV-level migrations. These requests will typically just be the `Migrate` request, with added code next to the `Migrate` command implementation to do the specific KV-level things intended for the specified version. c. The `system.migrations` table. We use it to store metadata about ongoing migrations for external visibility/introspection. The schema (listed below) is designed with an eye towards scriptability. We want operators to be able programmatically use this table to control their upgrade processes, if needed. CREATE TABLE public.migrations ( version STRING NOT NULL, status STRING NOT NULL, description STRING NOT NULL, start TIMESTAMP NOT NULL DEFAULT now():::TIMESTAMP, completed TIMESTAMP NULL, progress STRING NULL, CONSTRAINT "primary" PRIMARY KEY (version ASC), FAMILY "primary" (version, status, description, start, completed, progress) ) Release note(general change): Cluster version upgrades, as initiated by `SET CLUSTER SETTING version = <major>-<minor>`, now perform internal maintenance duties that will delay how long it takes for the command to complete. The delay is proportional to the amount of data currently stored in the cluster. The cluster will also experience a small amount of additional load during this period while the upgrade is being finalized. Release note(general change): We introduce a new `system.migrations` table for introspection into crdb internal data migrations. These migrations are the "maintenance duties" mentioned above. The table surfaces the currently ongoing migrations, the previously completed migrations, and in the case of failure, the errors from the last failed attempt.

This PR onboards the first real long-running migration using the infrastructure we've been building up within pkg/migration. It adds in the final missing pieces described in our original RFC (cockroachdb#48843). These components were originally prototyped in cockroachdb#56107. The migration in question (which happens to be a below-Raft one, first attempted in cockroachdb#42822) now lets us do the following: i. Use the RangeAppliedState on all ranges ii. Use the unreplicated TruncatedState on all ranges The missing pieces we introduce along side this migration are: a. The `Migrate` KV request. This forces all ranges overlapping with the request spans to execute the (below-raft) migrations corresponding to the specific version, moving out of any legacy modes they may currently be in. KV waits for this command to durably apply on all the replicas before returning, guaranteeing to the caller that all pre-migration state has been completely purged from the system. b. `IterateRangeDescriptors`. This provides a handle on every range descriptor in the system, which callers can then use to send out arbitrary KV requests to in order to run arbitrary KV-level migrations. These requests will typically just be the `Migrate` request, with added code next to the `Migrate` command implementation to do the specific KV-level things intended for the specified version. c. The `system.migrations` table. We use it to store metadata about ongoing migrations for external visibility/introspection. The schema (listed below) is designed with an eye towards scriptability. We want operators to be able programmatically use this table to control their upgrade processes, if needed. CREATE TABLE public.migrations ( version STRING NOT NULL, status STRING NOT NULL, description STRING NOT NULL, start TIMESTAMP NOT NULL DEFAULT now():::TIMESTAMP, completed TIMESTAMP NULL, progress STRING NULL, CONSTRAINT "primary" PRIMARY KEY (version ASC), FAMILY "primary" (version, status, description, start, completed, progress) ) Release note (general change): Cluster version upgrades, as initiated by `SET CLUSTER SETTING version = <major>-<minor>`, now perform internal maintenance duties that will delay how long it takes for the command to complete. The delay is proportional to the amount of data currently stored in the cluster. The cluster will also experience a small amount of additional load during this period while the upgrade is being finalized. Release note (general change): We introduce a new `system.migrations` table for introspection into crdb internal data migrations. These migrations are the "maintenance duties" mentioned above. The table surfaces the currently ongoing migrations, the previously completed migrations, and in the case of failure, the errors from the last failed attempt.

tbg requested review from nvanbenschoten, sumeerbhola and danhhz November 27, 2019 14:22

tbg force-pushed the migrate branch 2 times, most recently from 4c1f312 to 23ef7bb Compare December 2, 2019 12:38

sumeerbhola reviewed Dec 2, 2019

View reviewed changes

irfansharif reviewed Dec 2, 2019

View reviewed changes

nvanbenschoten reviewed Dec 4, 2019

View reviewed changes

sumeerbhola reviewed Dec 5, 2019

View reviewed changes

tbg requested review from irfansharif and sumeerbhola January 22, 2020 11:57

tbg commented Jan 22, 2020

View reviewed changes

tbg force-pushed the migrate branch from 23ef7bb to b367307 Compare January 23, 2020 12:05

sumeerbhola reviewed Jan 23, 2020

View reviewed changes

tbg requested review from nvanbenschoten and sumeerbhola January 24, 2020 09:40

tbg commented Jan 24, 2020

View reviewed changes

tbg mentioned this pull request Jan 31, 2020

storage: delete all "raft log as part of snapshots" code #44567

Closed

tbg commented Mar 2, 2020

View reviewed changes

tbg closed this Oct 1, 2020

tbg mentioned this pull request Oct 30, 2020

*: introduce pkg/migrations #56107

Closed

irfansharif mentioned this pull request Dec 3, 2020

[prototype] migration: onboard the first long-running migration #57445

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: introduce a hook for short-running migrations #42822

server: introduce a hook for short-running migrations #42822

tbg commented Nov 27, 2019 •

edited by irfansharif

Loading

cockroach-teamcity commented Nov 27, 2019

sumeerbhola left a comment

irfansharif left a comment

irfansharif left a comment

irfansharif commented Dec 3, 2019

nvanbenschoten left a comment

sumeerbhola left a comment

tbg left a comment

sumeerbhola left a comment

tbg left a comment

tbg Mar 2, 2020

server: introduce a hook for short-running migrations #42822

server: introduce a hook for short-running migrations #42822

Conversation

tbg commented Nov 27, 2019 • edited by irfansharif Loading

cockroach-teamcity commented Nov 27, 2019

sumeerbhola left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

irfansharif commented Dec 3, 2019

nvanbenschoten left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

tbg Mar 2, 2020

Choose a reason for hiding this comment

tbg commented Nov 27, 2019 •

edited by irfansharif

Loading