kvserver: improve below-Raft migrations #72931

erikgrinaker · 2021-11-18T18:50:15Z

Long-running migrations can send a MigrateRequest for migrations that must be applied below Raft. This request is special in that it only succeeds once it has been applied to all known replicas of the range -- it is not sufficient simply to commit it to the Raft log following acknowledgement from a quorum of replicas.

cockroach/pkg/kv/kvserver/replica_write.go

Lines 254 to 257 in 455cddd

    
           applicationErr := waitForApplication( 
        
           	ctx, r.store.cfg.NodeDialer, desc.RangeID, desc.Replicas().Descriptors(), 
        
           	// We wait for an index >= that of the migration command. 
        
           	r.GetLeaseAppliedIndex())

This requirement is in order to guarantee that no state machine replicas rely on legacy, unmigrated state. However, this requires all replicas for all ranges in a cluster to be available and up-to-date, with a 5-second timeout before giving up. Any retries are currently left to the migration code itself. For example, the postSeparatedIntentsMigration uses 5 retries for a given range and then fails the entire migration, having to restart:

cockroach/pkg/migration/migrations/separated_intents.go

Lines 557 to 563 in 4df8ac2

    
           err := retry.WithMaxAttempts(ctx, base.DefaultRetryOptions(), 5, func() error { 
        
           	err := deps.DB.Migrate(ctx, start, end, cv.Version) 
        
           	if err != nil { 
        
           		log.Infof(ctx, "[batch %d/??] error when running no-op Migrate on range r%d: %s", batchIdx, desc.RangeID, err) 
        
           	} 
        
           	return err 
        
           })

This could be improved in several ways:

~~Consider whether the requirement that all replicas for all ranges are available and up-to-date is necessary, or even viable in large clusters.~~
Introduce migration helpers to do automatic batching, checkpointing and retries of such migrations across ranges, to avoid having to implement this in each separate migration. It should also optimistically continue applying it to additional ranges even when one fails.
Introduce knob to control fan out for how many ranges are migrated at a time, to allow operators to speed up migrations with acceptable hit on foreground traffic.
Use smaller txns (with low priority) when iterating over full set of range descriptors to reduce contention over meta2. Since long running migrations are long running, locking up meta2 for the entire period is not ideal.
Make sure the migration jobs can be paused as necessary, and that they use the regular exponential backoff for job retries.
Make the application timeout configurable, or simply rely on the passed client context (which would be controlled by the new migration infrastructure mentioned above).
Improve the UX by failing with an informative error message explaining that the migration cannot be completed because range X replicas Y,Z are unavailable/behind/uninitialized/etc, and why all replicas must go through the migration before the upgrade can be finalized.

Jira issue: CRDB-11351

Epic CRDB-39898

The text was updated successfully, but these errors were encountered:

irfansharif · 2021-11-18T19:04:47Z

Thanks for filing. Here's the internal thread that prompted this issue.

ajwerner · 2021-11-19T01:22:31Z

We do have an existing mechanism in kvserver to wait for application. The merge protocol relies on it. We could consider doing something like that during the execution and not acknowledging the batch until it's been fully replicated. It wouldn't be too hard to implement.

cockroach/pkg/kv/kvserver/replica_command.go

Lines 790 to 821 in b57af7d

    
           func waitForApplication( 
        
           	ctx context.Context, 
        
           	dialer *nodedialer.Dialer, 
        
           	rangeID roachpb.RangeID, 
        
           	replicas []roachpb.ReplicaDescriptor, 
        
           	leaseIndex uint64, 
        
           ) error { 
        
           	if dialer == nil && len(replicas) == 1 { 
        
           		// This early return supports unit tests (testContext{}) that also 
        
           		// want to perform merges. 
        
           		return nil 
        
           	} 
        
           	return contextutil.RunWithTimeout(ctx, "wait for application", 5*time.Second, func(ctx context.Context) error { 
        
           		g := ctxgroup.WithContext(ctx) 
        
           		for _, repl := range replicas { 
        
           			repl := repl // copy for goroutine 
        
           			g.GoCtx(func(ctx context.Context) error { 
        
           				conn, err := dialer.Dial(ctx, repl.NodeID, rpc.DefaultClass) 
        
           				if err != nil { 
        
           					return errors.Wrapf(err, "could not dial n%d", repl.NodeID) 
        
           				} 
        
           				_, err = NewPerReplicaClient(conn).WaitForApplication(ctx, &WaitForApplicationRequest{ 
        
           					StoreRequestHeader: StoreRequestHeader{NodeID: repl.NodeID, StoreID: repl.StoreID}, 
        
           					RangeID:            rangeID, 
        
           					LeaseIndex:         leaseIndex, 
        
           				}) 
        
           				return err 
        
           			}) 
        
           		} 
        
           		return g.Wait() 
        
           	}) 
        
           }

ajwerner · 2021-11-19T01:22:54Z

Make sure the migration jobs can be paused as necessary, and that they use the regular exponential backoff for job retries.

This should be the case. Is there evidence that it is not?

erikgrinaker · 2021-11-19T09:51:27Z

We do have an existing mechanism in kvserver to wait for application. The merge protocol relies on it. We could consider doing something like that during the execution and not acknowledging the batch until it's been fully replicated. It wouldn't be too hard to implement.

Well yes, that's what we currently do here:

cockroach/pkg/kv/kvserver/replica_write.go

Lines 206 to 259 in 455cddd

    
           if ba.Requests[0].GetMigrate() != nil && propResult.Err == nil { 
        
           	// Migrate is special since it wants commands to be durably 
        
           	// applied on all peers, which we achieve via waitForApplication. 
        
           	// 
        
           	// We don't have to worry about extant snapshots creating 
        
           	// replicas that start at an index before this Migrate request. 
        
           	// Snapshots that don't include the recipient (as specified by 
        
           	// replicaID and descriptor in the snap vs. the replicaID of the 
        
           	// raft instance) are discarded by the recipient, and we're 
        
           	// already checking against all replicas in the descriptor below 
        
           	// (which include learner replicas currently in the process of 
        
           	// receiving snapshots). Snapshots are also discarded unless 
        
           	// they move the LAI forward, so we're not worried about old 
        
           	// snapshots (with indexes preceding the MLAI here) 
        
           	// instantiating pre-migrated state in anyway. We also have a 
        
           	// separate mechanism to ensure replicas with older versions are 
        
           	// purged from the system[1]. This is driven by a higher-level 
        
           	// orchestration layer[2], these are the replicas that we don't 
        
           	// have a handle on here as they're eligible for GC (but may 
        
           	// still hit replica evaluation code paths with pre-migrated 
        
           	// state, unless explicitly purged). 
        
           	// 
        
           	// It's possible that between the proposal returning and the 
        
           	// call to r.Desc() below, the descriptor has already changed. 
        
           	// But the only thing that matters is that r.Desc() is at least 
        
           	// as up to date as the descriptor the command applied on 
        
           	// previously. If a replica got removed - fine, 
        
           	// waitForApplication will fail; we will have to cope with that. 
        
           	// If one got added - it was likely already a learner when we 
        
           	// migrated (in which case waitForApplication will know about 
        
           	// it). If that's not the case, we'll note that the Migrate 
        
           	// command also declares a read latch on the range descriptor. 
        
           	// The replication change will have thus serialized after the 
        
           	// migration, and so the snapshot will also include the 
        
           	// post-migration state. 
        
           	// 
        
           	// TODO(irfansharif): In a cluster that is constantly changing 
        
           	// its replica sets, it's possible to get into a situation 
        
           	// where a Migrate command never manages to complete - all it 
        
           	// takes is a single range in each attempt to throw things off. 
        
           	// Perhaps an error in waitForApplication should lead to a retry 
        
           	// of just the one RPC instead of propagating an error for the 
        
           	// entire migrate invocation. 
        
           	// 
        
           	// [1]: See PurgeOutdatedReplicas from the Migration service. 
        
           	// [2]: pkg/migration 
        
           	desc := r.Desc() 
        
           	// NB: waitForApplication already has a timeout. 
        
           	applicationErr := waitForApplication( 
        
           		ctx, r.store.cfg.NodeDialer, desc.RangeID, desc.Replicas().Descriptors(), 
        
           		// We wait for an index >= that of the migration command. 
        
           		r.GetLeaseAppliedIndex()) 
        
           	propResult.Err = roachpb.NewError(applicationErr) 
        
           }

The problem is that this command has to succeed for every range on every replica in one go (with a few tight retries), otherwise the entire migration fails and has to restart. This can be a problem in large clusters with many ranges (in this case, 400.000 ranges).

Make sure the migration jobs can be paused as necessary, and that they use the regular exponential backoff for job retries.

This should be the case. Is there evidence that it is not?

No, just something we should verify.

irfansharif · 2021-11-19T15:56:24Z

One thing we're sorely lacking is an integration/qualification/acceptance test to run these migrations at very large scales. It could've shaken out some of these low timeouts for e.g. or further validated the need for generic checkpointing.

72266: colfetcher: populate tableoids on the whole batch at once r=yuzefovich a=yuzefovich Previously, we were populating `tableoid` system column (if requested) when finalizing each row. However, the OID value is constant, so we can populate it when finalizing a batch. `finalizeBatch` becomes no longer inlinable, but we're trading a "per row conditional and Set operation" for a "per batch additional function call", and I think it's probably worth it. Release note: None 72946: ui: save filters on cache for Statements page r=maryliag a=maryliag Previously, a sort selection was not maintained when the page change (e.g. coming back from Statement details). This commits saves the selected value to be used. Partially adresses #71851 Showing behaviour: https://www.loom.com/share/681ca9d80f7145faa111b6aacab417f9 Release note: None 72987: kvserver: increase `Migrate` application timeout to 1 minute r=tbg,ajwerner,miretskiy a=erikgrinaker **kvserver: increase Migrate application timeout to 1 minute** This increases the timeout when waiting for application of a `Migrate` command on all range replicas to 1 minute, up from 5 seconds. It also adds a cluster setting `kv.migration.migrate_application.timeout` to control this. When encountering a range that's e.g. undergoing rebalancing, it can take a long time for a learner replica to receive a snapshot and respond to this request, which would cause the timeout to trigger. This is especially likely in clusters with many ranges and frequent rebalancing activity. Touches #72931. Release note (bug fix): The timeout when checking for Raft application of upgrade migrations has been increased from 5 seconds to 1 minute, and is now controllable via the cluster setting `kv.migration.migrate_application_timeout`. This makes migrations much less likely to fail in clusters with ongoing rebalancing activity during upgrade migrations. **migration: add informative log message for sep intents migrate failure** The separated intents migration has been seen to go into failure loops in the wild, with a generic "context deadline exceeded" error. This adds a more informative log entry with additional hints on how to resolve the problem. Release note: None Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Marylia Gutierrez <[email protected]> Co-authored-by: Erik Grinaker <[email protected]>

irfansharif · 2022-07-12T18:19:05Z

Introduce migration helpers to do automatic batching, checkpointing and retries of such migrations across ranges, to avoid having to implement this in each separate migration. It should also optimistically continue applying it to additional ranges even when one fails.

This is worth doing, perhaps as part of #84073.

Introduce knob to control fan out for how many ranges are migrated at a time, to allow operators to speed up migrations with acceptable hit on foreground traffic.

A library in pkg/upgrades to do this sort of thing would prevent us from writing one-range-at-a-time migrations, like we do here:

cockroach/pkg/upgrade/upgrades/raft_applied_index_term.go

Lines 40 to 46 in ea13a52

    
           func raftAppliedIndexTermMigration( 
        
           	ctx context.Context, cv clusterversion.ClusterVersion, deps upgrade.SystemDeps, _ *jobs.Job, 
        
           ) error { 
        
           	var batchIdx, numMigratedRanges int 
        
           	init := func() { batchIdx, numMigratedRanges = 1, 0 } 
        
           	if err := deps.Cluster.IterateRangeDescriptors(ctx, defaultPageSize, init, func(descriptors ...roachpb.RangeDescriptor) error { 
        
           		for _, desc := range descriptors {

ajwerner · 2022-07-12T18:25:56Z

A library in pkg/upgrades to do this sort of thing would prevent us from writing one-range-at-a-time migrations, like we do here

That linked library should not be particularly difficult to generalize.

knz · 2022-08-04T15:25:52Z

Would it make sense to also lower the txn priority inside IterateRangeDescriptors (and possibly the implicit txn used for the MigrateRequest), to reduce the visible contention on meta2 from SQL?

erikgrinaker · 2022-08-04T15:32:16Z

I wouldn't use a transaction at all, and instead scan smaller batches of up-to-date descriptors with individual scan requests. No objection on doing these with low priority, but we specifically don't want a snapshot of meta2, we want fresh data.

irfansharif · 2022-08-04T16:34:54Z

Using smaller txns (and lowering their priority if still needed) to fetch batches of range descriptors makes sense.

irfansharif · 2022-08-04T16:37:44Z

I'll apologize for not having done just all this originally, this simple thing has caused me much grief. Had I kicked the tires more on realistic cluster sizes (100k+ ranges with ambient split/merge activity), all this would've been much more apparent.

knz · 2022-08-04T16:41:50Z

Irfan, using smaller txns with lower priority is not mentioned on the issue erik linked. Mind filing a new followup issue for that (separate) change?

irfansharif · 2022-08-04T17:02:51Z

Do you mean this issue? I've added a bullet point here.

knz · 2022-08-04T21:18:45Z

thanks!

erikgrinaker · 2022-10-24T13:12:48Z

Since we'll be adding a fair bit of concurrency here to improve throughput for trivial migrations, we should probably integrate this with AC somehow as well, to avoid expensive migrations overloading the cluster.

erikgrinaker · 2023-10-25T09:12:14Z

Adding O-support, since we've had several escalations about below-Raft migrations stalling upgrades.

This was referenced Nov 19, 2021

kvserver: increase Migrate application timeout to 1 minute #72987

Merged

release-21.2: kvserver: increase Migrate application timeout to 1 minute #73061

Merged

nvanbenschoten mentioned this issue Apr 28, 2022

release-21.1: kvserver: increase Migrate application timeout to 1 minute #80754

Merged

irfansharif mentioned this issue Jun 7, 2022

Stuck retrying Migration to 21.2-56: "populate RangeAppliedState.RaftAppliedIndexTerm for all ranges" #81961

Closed

erikgrinaker mentioned this issue Jul 8, 2022

upgrades: make raftAppliedIndexTermMigration more robust #84073

Closed

irfansharif assigned erikgrinaker Jul 12, 2022

tbg mentioned this issue Aug 9, 2022

kvserver: add latency-sensitive foreground workload to a larger mixed-version test #85817

Open

exalate-issue-sync bot unassigned erikgrinaker Mar 28, 2023

erikgrinaker added the O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs label Oct 25, 2023

nvanbenschoten added the P-3 Issues/test failures with no fix SLA label Jan 17, 2024

exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Jun 28, 2024

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Incoming in KV Aug 28, 2024

pav-kv mentioned this issue Nov 18, 2024

rac2: chronic send-queue in apply_to_all causes problems with splits, merges, and below-raft migrations #135601

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: improve below-Raft migrations #72931

kvserver: improve below-Raft migrations #72931

erikgrinaker commented Nov 18, 2021 •

edited by exalate-issue-sync bot

Loading

irfansharif commented Nov 18, 2021

ajwerner commented Nov 19, 2021

ajwerner commented Nov 19, 2021

erikgrinaker commented Nov 19, 2021

irfansharif commented Nov 19, 2021

irfansharif commented Jul 12, 2022

ajwerner commented Jul 12, 2022

knz commented Aug 4, 2022

erikgrinaker commented Aug 4, 2022

irfansharif commented Aug 4, 2022

irfansharif commented Aug 4, 2022

knz commented Aug 4, 2022

irfansharif commented Aug 4, 2022

knz commented Aug 4, 2022

erikgrinaker commented Oct 24, 2022

erikgrinaker commented Oct 25, 2023

kvserver: improve below-Raft migrations #72931

kvserver: improve below-Raft migrations #72931

Comments

erikgrinaker commented Nov 18, 2021 • edited by exalate-issue-sync bot Loading

irfansharif commented Nov 18, 2021

ajwerner commented Nov 19, 2021

ajwerner commented Nov 19, 2021

erikgrinaker commented Nov 19, 2021

irfansharif commented Nov 19, 2021

irfansharif commented Jul 12, 2022

ajwerner commented Jul 12, 2022

knz commented Aug 4, 2022

erikgrinaker commented Aug 4, 2022

irfansharif commented Aug 4, 2022

irfansharif commented Aug 4, 2022

knz commented Aug 4, 2022

irfansharif commented Aug 4, 2022

knz commented Aug 4, 2022

erikgrinaker commented Oct 24, 2022

erikgrinaker commented Oct 25, 2023

erikgrinaker commented Nov 18, 2021 •

edited by exalate-issue-sync bot

Loading