-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: introduce a hook for short-running migrations #42822
Conversation
4c1f312
to
23ef7bb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @danhhz, @nvanbenschoten, @sumeerbhola, and @tbg)
pkg/storage/batcheval/cmd_migrate.go, line 64 at r2 (raw file):
// Time to migrate by deleting the legacy key. The downstream-of-Raft // code will atomically rewrite the truncated state (supplied via the // side effect) into the new unreplicated key.
just curious: what prevents a migrate and truncate command to concurrently be proposed and be ordered in some arbitrary way?
And is it possible that there is already a truncate that will delete the legacy key, but has not just not been committed and applied and so this code will repeat some unnecessary work but is otherwise harmless?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For one, the mechanism only reaches replicas that members of the raft group, that is, it won't touch replicas which are gc'able. For the migrations at hand this means that in 20.2 there may - in theory - still be replicas that have a replicated truncated state. I believe that our recent efforts around not re-using replicas across replicaIDs has made sure that this isn't an issue for this particular migration, but in general it will have to remain on the radar.
This should be in a comment somewhere.
Reviewed 2 of 2 files at r1, 34 of 34 files at r2.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @danhhz, @nvanbenschoten, and @tbg)
pkg/roachpb/api.proto, line 1480 at r2 (raw file):
// any legacy modes that they are currently in. When this command returns, the // ranges are ready to run with the most up to date cluster version supported // by this binary.
nit: this binary? which binary?
pkg/server/server.go, line 704 at r2 (raw file):
LeaseHolderCache: s.distSender.LeaseHolderCache(), VersionUpgradeHook: func(ctx context.Context, newV roachpb.Version) error { if !cluster.Version.ActiveVersionOrEmpty(ctx, s.st).Less(newV) {
For my understanding, when is this ever true? Maybe I'm misunderstanding what "allowed to transition" means in set_cluster_settings.go
.
pkg/server/server.go, line 712 at r2 (raw file):
req.EndKey = keys.MaxKey b.AddRawRequest(&req) // TODO(tbg): make sure all stores have synced once to persist any
For my understanding, why does this need to happen?
pkg/storage/replica_raft_truncation_test.go, line 88 at r2 (raw file):
} apply, err := handleTruncatedStateBelowRaft(ctx, &prevTruncatedState, newTruncatedState, loader, eng, false /* assert */)
nit: assert -> assertNoLegacy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @danhhz, @nvanbenschoten, @sumeerbhola, and @tbg)
pkg/storage/batcheval/cmd_migrate.go, line 64 at r2 (raw file):
just curious: what prevents a migrate and truncate command to concurrently be proposed and be ordered in some arbitrary way?
The span latches, IIUC. Both commands latch on keys.RaftTruncatedStateLegacyKey
, though I'd like to confirm that this is the case.
I'll be picking this up after @nvanbenschoten (and/or @danhhz) take a look at what's here already. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach looks good to me. I share the concern you have about replicas that aren't part of a raft group not hearing about the migration, but I'm not sure what we can do about this. One thought is to try to flush out these replicas on node startup by forcing them to apply all of the committed entries they know about.
The other thing we should note here is that WaitForApplication
requires all replicas to be online. This means that cluster versions are no longer upgradable with any nodes that are offline. I don't believe this was the case before. This change in availability of cluster upgrades was intentional in #39182, but I want to double-check that it's also intentional here.
Reviewed 2 of 2 files at r1, 34 of 34 files at r2.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @danhhz, @irfansharif, @sumeerbhola, and @tbg)
pkg/roachpb/method.go, line 161 at r2 (raw file):
// RangeStats returns the MVCC statistics for a range. RangeStats // Migrate proactively forces a range to transition out of any legacy
"a range or span of ranges"
pkg/server/server.go, line 704 at r2 (raw file):
Previously, irfansharif (irfan sharif) wrote…
For my understanding, when is this ever true? Maybe I'm misunderstanding what "allowed to transition" means in
set_cluster_settings.go
.
+1. Could you give this a comment?
pkg/sql/exec_util.go, line 535 at r2 (raw file):
// VersionUpgradeHook is called after validating a `SET CLUSTER SETTING // version` but before executing it. It can carry out certain kinds of
Consider making this a bit more explicit -- if the hook does not complete without error, the cluster version will not be upgraded.
pkg/storage/client_replica_test.go, line 2962 at r2 (raw file):
}{ {"ts=new,as=new", stateloader.TruncatedStateUnreplicated}, {"ts=legacy,as=new", stateloader.TruncatedStateLegacyReplicated},
ts=new,as=legacy
isn't possible? Because ts=new
implies at least one Raft entry which would have necessarily triggered the applied state key migration? I'd spell out why somewhere.
pkg/storage/replica_write.go, line 212 at r2 (raw file):
desc := r.Desc() // NB: waitForApplication already has a timeout. propResult.Err = roachpb.NewError(waitForApplication(
I was concerned that either the pre-apply Raft entry acknowledgment change or the Raft entry application batching change could have broken this, but I just confirmed that neither do. waitForApplication
waits on the ReplicaState
's LeaseAppliedIndex
field to move past its desired max lease index, so those kinds of changes are compatible.
pkg/storage/batcheval/cmd_migrate.go, line 39 at r2 (raw file):
// below-Raft migrations. When this command returns, all of the below-Raft // features known to be available are enabled. func Migrate(
What is the plan for removing sub-migrations from here? I'd push for restructuring this a little bit such that individual migrations are self-contained (see pkg/sqlmigrations/migrations.go
). This would include each migration configuring its own declared keys and evaluation function. We could then discuss exactly when each migration can be removed and potentially test each migration in isolation.
pkg/storage/batcheval/cmd_migrate.go, line 64 at r2 (raw file):
Previously, irfansharif (irfan sharif) wrote…
just curious: what prevents a migrate and truncate command to concurrently be proposed and be ordered in some arbitrary way?
The span latches, IIUC. Both commands latch on
keys.RaftTruncatedStateLegacyKey
, though I'd like to confirm that this is the case.
Yes, the latch on keys.RaftTruncatedStateLegacyKey
should prevent this.
pkg/storage/stateloader/initial.go, line 73 at r2 (raw file):
s.Lease = &lease s.GCThreshold = &gcThreshold if truncStateType != TruncatedStateLegacyReplicatedAndNoAppliedKey {
nit: I think this would be a lot less surprising if this was written like:
s.UsingAppliedStateKey = true
if truncStateType == TruncatedStateLegacyReplicatedAndNoAppliedKey {
// For testing only. Outside of tests, all new Ranges begin life using
// the applied state key.
s.UsingAppliedStateKey = false
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I have the right understanding of this in-flight snapshot issue. As I understand it, 20.1 will have code that understands these specific migrations, but 20.2 will remove that code. This includes the code to process these migrations -- in the general case, though it seems that the migrations here do not require any special code below raft -- is that correct?
Say everything has switched to 20.1 and for a range a migrate has been added at raft index 15, but not yet committed and applied at all replicas. So the migrate is not "done" (has not returned to client). Before it is applied a snapshot is taken at index 10 to send to some follower -- this follower must be 20.1 since one is not allowed to switch to 20.2 until these migrations are "done". Then the migration is "done". Is it possible for that follower to switch to 20.2 and receive that snapshot? I realized I don't understand the lifecycle of a snapshot -- is one generated for a particular follower and if that receiving node restarts cannot be used? I guess there is another race in that the snapshot starts getting generated before "done" and by the time it is generated the migration is "done" and the follower has switched to 20.2, so we send it the snapshot and the raft log from 11-15. But in general it may not know how to apply the migrate command at 15?
If there is indeed a problem (that I am missing) could one make the "done" condition be that all the replicas have truncated their raft logs to after the migrate command? Then an old snapshot is useless since the replica that applies it has to request another snapshot.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @danhhz, @irfansharif, @sumeerbhola, and @tbg)
TODO: see if this migration is actually "short-running". That is, in a sufficiently large cluster, does this cause significant load? ---- This is a baby version of cockroachdb#39182 that handles only short-running migrations but is useful in itself because it allows us to migrate us fully into the following two KV below-Raft migrations: 1. use the RangeAppliedState on all ranges 2. use the unreplicated TruncatedState on all ranges These two migrations have been around for a while and it has been getting in the way of things. While ideally we introduce cockroachdb#39182 in the near future and use it to address a larger class of migration concerns, this PR serves as a starting point and work done on cockroachdb#39182 should be able to absorb this PR with relative ease. With this PR, legacy code related to 1) and 2) above can be removed from `master` once the 20.1 branch is cut, i.e. roughly in April 2020. The main ingredients in this PR are a) introduce a hook that is called during `SET CLUSTER SETTING version`, after the change has been validated but before it is made. b) introduce a KV-level `Migrate` ranged write command that triggers the migrations for 1) and 2) above. It is called from the hook for all of the keyspace. c) before returning to the client, `Migrate` waits for the command to be durably applied on all followers. Trying to get this 100% correct has proven tricky, perhaps foreshadowing similar issues that expect us once we try cockroachdb#39182 in earnest. For one, the mechanism only reaches replicas that members of the raft group, that is, it won't touch replicas which are gc'able. For the migrations at hand this means that in 20.2 there may - in theory - still be replicas that have a replicated truncated state. I believe that our recent efforts around not re-using replicas across replicaIDs has made sure that this isn't an issue for this particular migration, but in general it will have to remain on the radar. Similarly, it seems hard to prove conclusively that no snapshot is in-flight that would set up a new follower with a state predating the explicit migration, though it would be exceptionally rare in practice. Release note (general change): version upgrades now perform maintenance duties that may slightly delay the `SET CLUSTER SETTING version` command and may cause a small amount of additional load on the cluster while an upgrade is being finalized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sumeerbhola Raft snapshots are basically generated from a RocksDB snapshot, and fundamentally snapshots can be delayed arbitrarily (though that doesn't really tend to happen in practice); a lot of time can pass between the initial opening of the iterator (at the sender) and the ingestion of the snapshot (at the recipient).
Snapshots serialize with raft commands and they'll never "rewind" the applied index, so once a replica is migrated, it stays migrated (assuming we properly synced as we ought to do in this particular case). If a replica is unmigrated but an active part of the range, then the currently written Migrate command will wait for the command to be applied either way, so nothing goes wrong. If the snapshot contains the migration (i.e. has it applied) then there's also no problem (that's what every snapshot, roughly, will look like once the migration has succeed).
The problem is with old snapshots that create a replica that has since been removed. Something like this:
- snapshot gets created on range with members
{s1,s2,s3}
addressed to s3 - snapshot gets held up
- s3 gets removed, range now only has
{s1,s2}
- cluster gets upgraded (migrates replicas on s1,s2)
- snapshot arrives, creates replica s3
This replica created in the last step is gc'able, but we don't want that replica to come into existence in the first place. Similarly, if we go through the above history without snapshots, any removal can leave a replica around for a bit of time, and this replica isn't (and won't be) "migrated". Both will never serve traffic, but just that they exist (and have a few moving parts) potentially causes bugs in code paths that have since had the migration removed.
My current thinking is that it's worth adding a version to the replica state which is initially zero around here:
UsingAppliedStateKey bool `protobuf:"varint,11,opt,name=using_applied_state_key,json=usingAppliedStateKey,proto3" json:"using_applied_state_key,omitempty"` |
I'll flesh that out as I return to this PR.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @danhhz, @irfansharif, @sumeerbhola, and @tbg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying. Regarding "Both will never serve traffic, but just that they exist (and have a few moving parts) potentially causes bugs in code paths that have since had the migration removed.": if a node creates a Replica but is not a member of the raft group, so not evaluating any proposals or serving reads, why would the code look at the contents of the persistent state for that replica - was there a particular case you had in mind?
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @danhhz, @irfansharif, @nvanbenschoten, @sumeerbhola, and @tbg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A replica that isn't supposed to exist only exists because the surrounding store hasn't yet realized that this replica should be deleted. As such, the replica will behave like any other, except it won't be very successful at doing much (because its raft peers won't talk to it). But it means there's a large surface area to worry about, and it'll be much cleaner to have the migration system ensure that this kind of rogue stale replica does not exist anymore anywhere in the system by the time the migration has completed.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @danhhz, @irfansharif, @nvanbenschoten, @sumeerbhola, and @tbg)
// Migrate ensures that the range proactively carries out any outstanding | ||
// below-Raft migrations. When this command returns, all of the below-Raft | ||
// features known to be available are enabled. | ||
func Migrate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would look up the actual code from some kind of map[roachpb.Version]Command
(or other structure) @irfansharif
This PR onboards the first real long-running migration using the infrastructure we've been building up within pkg/migration. It adds in the final missing pieces described in our original RFC (cockroachdb#48843). These components were originally prototyped in cockroachdb#56107. The migration in question (which happens to be a below-Raft one, first attempted in cockroachdb#42822) now lets us do the following: i. Use the RangeAppliedState on all ranges ii. Use the unreplicated TruncatedState on all ranges The missing pieces we introduce along side this migration are: a. The `Migrate` KV request. This forces all ranges overlapping with the request spans to execute the (below-raft) migrations corresponding to the specific version, moving out of any legacy modes they may currently be in. KV waits for this command to durably apply on all the replicas before returning, guaranteeing to the caller that all pre-migration state has been completely purged from the system. b. `IterateRangeDescriptors`. This provides a handle on every range descriptor in the system, which callers can then use to send out arbitrary KV requests to in order to run arbitrary KV-level migrations. These requests will typically just be the `Migrate` request, with added code next to the `Migrate` command implementation to do the specific KV-level things intended for the specified version. c. The `system.migrations` table. We use it to store metadata about ongoing migrations for external visibility/introspection. The schema (listed below) is designed with an eye towards scriptability. We want operators to be able programmatically use this table to control their upgrade processes, if needed. CREATE TABLE public.migrations ( version STRING NOT NULL, status STRING NOT NULL, description STRING NOT NULL, start TIMESTAMP NOT NULL DEFAULT now():::TIMESTAMP, completed TIMESTAMP NULL, progress STRING NULL, CONSTRAINT "primary" PRIMARY KEY (version ASC), FAMILY "primary" (version, status, description, start, completed, progress) ) Release note(general change): Cluster version upgrades, as initiated by `SET CLUSTER SETTING version = <major>-<minor>`, now perform internal maintenance duties that will delay how long it takes for the command to complete. The delay is proportional to the amount of data currently stored in the cluster. The cluster will also experience a small amount of additional load during this period while the upgrade is being finalized. Release note(general change): We introduce a new `system.migrations` table for introspection into crdb internal data migrations. These migrations are the "maintenance duties" mentioned above. The table surfaces the currently ongoing migrations, the previously completed migrations, and in the case of failure, the errors from the last failed attempt.
This PR onboards the first real long-running migration using the infrastructure we've been building up within pkg/migration. It adds in the final missing pieces described in our original RFC (cockroachdb#48843). These components were originally prototyped in cockroachdb#56107. The migration in question (which happens to be a below-Raft one, first attempted in cockroachdb#42822) now lets us do the following: i. Use the RangeAppliedState on all ranges ii. Use the unreplicated TruncatedState on all ranges The missing pieces we introduce along side this migration are: a. The `Migrate` KV request. This forces all ranges overlapping with the request spans to execute the (below-raft) migrations corresponding to the specific version, moving out of any legacy modes they may currently be in. KV waits for this command to durably apply on all the replicas before returning, guaranteeing to the caller that all pre-migration state has been completely purged from the system. b. `IterateRangeDescriptors`. This provides a handle on every range descriptor in the system, which callers can then use to send out arbitrary KV requests to in order to run arbitrary KV-level migrations. These requests will typically just be the `Migrate` request, with added code next to the `Migrate` command implementation to do the specific KV-level things intended for the specified version. c. The `system.migrations` table. We use it to store metadata about ongoing migrations for external visibility/introspection. The schema (listed below) is designed with an eye towards scriptability. We want operators to be able programmatically use this table to control their upgrade processes, if needed. CREATE TABLE public.migrations ( version STRING NOT NULL, status STRING NOT NULL, description STRING NOT NULL, start TIMESTAMP NOT NULL DEFAULT now():::TIMESTAMP, completed TIMESTAMP NULL, progress STRING NULL, CONSTRAINT "primary" PRIMARY KEY (version ASC), FAMILY "primary" (version, status, description, start, completed, progress) ) Release note (general change): Cluster version upgrades, as initiated by `SET CLUSTER SETTING version = <major>-<minor>`, now perform internal maintenance duties that will delay how long it takes for the command to complete. The delay is proportional to the amount of data currently stored in the cluster. The cluster will also experience a small amount of additional load during this period while the upgrade is being finalized. Release note (general change): We introduce a new `system.migrations` table for introspection into crdb internal data migrations. These migrations are the "maintenance duties" mentioned above. The table surfaces the currently ongoing migrations, the previously completed migrations, and in the case of failure, the errors from the last failed attempt.
TODO: see if this migration is actually "short-running". That is, in a
sufficiently large cluster, does this cause significant load?
This is a baby version of #39182 that handles only short-running migrations
but is useful in itself because it allows us to migrate us fully into the
following two KV below-Raft migrations:
These two migrations have been around for a while and it has been getting
in the way of things. While ideally we introduce #39182 in the near future
and use it to address a larger class of migration concerns, this PR serves
as a starting point and work done on #39182 should be able to absorb this
PR with relative ease.
With this PR, legacy code related to 1) and 2) above can be removed from
master
once the 20.1 branch is cut, i.e. roughly in April 2020.The main ingredients in this PR are
a) introduce a hook that is called during
SET CLUSTER SETTING version
,after the change has been validated but before it is made. b) introduce a
KV-level
Migrate
ranged write command that triggersthe migrations for 1) and 2) above. It is called from the hook for
all of the keyspace. c) before returning to the client,
Migrate
waitsfor the command to
be durably applied on all followers.
Trying to get this 100% correct has proven tricky, perhaps foreshadowing
similar issues that expect us once we try #39182 in earnest. For one, the
mechanism only reaches replicas that members of the raft group, that is, it
won't touch replicas which are gc'able. For the migrations at hand this
means that in 20.2 there may - in theory - still be replicas that have a
replicated truncated state. I believe that our recent efforts around not
re-using replicas across replicaIDs has made sure that this isn't an issue
for this particular migration, but in general it will have to remain on the
radar. Similarly, it seems hard to prove conclusively that no snapshot is
in-flight that would set up a new follower with a state predating the
explicit migration, though it would be exceptionally rare in practice.
Release note (general change): version upgrades now perform maintenance
duties that may slightly delay the
SET CLUSTER SETTING version
commandand may cause a small amount of additional load on the cluster while an
upgrade is being finalized.