Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: improve below-Raft migrations #72931

Open
4 of 7 tasks
erikgrinaker opened this issue Nov 18, 2021 · 16 comments
Open
4 of 7 tasks

kvserver: improve below-Raft migrations #72931

erikgrinaker opened this issue Nov 18, 2021 · 16 comments
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-3 Issues/test failures with no fix SLA S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. T-kv KV Team

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Nov 18, 2021

Long-running migrations can send a MigrateRequest for migrations that must be applied below Raft. This request is special in that it only succeeds once it has been applied to all known replicas of the range -- it is not sufficient simply to commit it to the Raft log following acknowledgement from a quorum of replicas.

applicationErr := waitForApplication(
ctx, r.store.cfg.NodeDialer, desc.RangeID, desc.Replicas().Descriptors(),
// We wait for an index >= that of the migration command.
r.GetLeaseAppliedIndex())

This requirement is in order to guarantee that no state machine replicas rely on legacy, unmigrated state. However, this requires all replicas for all ranges in a cluster to be available and up-to-date, with a 5-second timeout before giving up. Any retries are currently left to the migration code itself. For example, the postSeparatedIntentsMigration uses 5 retries for a given range and then fails the entire migration, having to restart:

err := retry.WithMaxAttempts(ctx, base.DefaultRetryOptions(), 5, func() error {
err := deps.DB.Migrate(ctx, start, end, cv.Version)
if err != nil {
log.Infof(ctx, "[batch %d/??] error when running no-op Migrate on range r%d: %s", batchIdx, desc.RangeID, err)
}
return err
})

This could be improved in several ways:

  • Consider whether the requirement that all replicas for all ranges are available and up-to-date is necessary, or even viable in large clusters.
  • Introduce migration helpers to do automatic batching, checkpointing and retries of such migrations across ranges, to avoid having to implement this in each separate migration. It should also optimistically continue applying it to additional ranges even when one fails.
  • Introduce knob to control fan out for how many ranges are migrated at a time, to allow operators to speed up migrations with acceptable hit on foreground traffic.
  • Use smaller txns (with low priority) when iterating over full set of range descriptors to reduce contention over meta2. Since long running migrations are long running, locking up meta2 for the entire period is not ideal.
  • Make sure the migration jobs can be paused as necessary, and that they use the regular exponential backoff for job retries.
  • Make the application timeout configurable, or simply rely on the passed client context (which would be controlled by the new migration infrastructure mentioned above).
  • Improve the UX by failing with an informative error message explaining that the migration cannot be completed because range X replicas Y,Z are unavailable/behind/uninitialized/etc, and why all replicas must go through the migration before the upgrade can be finalized.

Jira issue: CRDB-11351

Epic CRDB-39898

@erikgrinaker erikgrinaker added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. A-kv-replication Relating to Raft, consensus, and coordination. T-kv-replication labels Nov 18, 2021
@irfansharif
Copy link
Contributor

Thanks for filing. Here's the internal thread that prompted this issue.

@ajwerner
Copy link
Contributor

We do have an existing mechanism in kvserver to wait for application. The merge protocol relies on it. We could consider doing something like that during the execution and not acknowledging the batch until it's been fully replicated. It wouldn't be too hard to implement.

func waitForApplication(
ctx context.Context,
dialer *nodedialer.Dialer,
rangeID roachpb.RangeID,
replicas []roachpb.ReplicaDescriptor,
leaseIndex uint64,
) error {
if dialer == nil && len(replicas) == 1 {
// This early return supports unit tests (testContext{}) that also
// want to perform merges.
return nil
}
return contextutil.RunWithTimeout(ctx, "wait for application", 5*time.Second, func(ctx context.Context) error {
g := ctxgroup.WithContext(ctx)
for _, repl := range replicas {
repl := repl // copy for goroutine
g.GoCtx(func(ctx context.Context) error {
conn, err := dialer.Dial(ctx, repl.NodeID, rpc.DefaultClass)
if err != nil {
return errors.Wrapf(err, "could not dial n%d", repl.NodeID)
}
_, err = NewPerReplicaClient(conn).WaitForApplication(ctx, &WaitForApplicationRequest{
StoreRequestHeader: StoreRequestHeader{NodeID: repl.NodeID, StoreID: repl.StoreID},
RangeID: rangeID,
LeaseIndex: leaseIndex,
})
return err
})
}
return g.Wait()
})
}

@ajwerner
Copy link
Contributor

Make sure the migration jobs can be paused as necessary, and that they use the regular exponential backoff for job retries.

This should be the case. Is there evidence that it is not?

@erikgrinaker
Copy link
Contributor Author

We do have an existing mechanism in kvserver to wait for application. The merge protocol relies on it. We could consider doing something like that during the execution and not acknowledging the batch until it's been fully replicated. It wouldn't be too hard to implement.

Well yes, that's what we currently do here:

if ba.Requests[0].GetMigrate() != nil && propResult.Err == nil {
// Migrate is special since it wants commands to be durably
// applied on all peers, which we achieve via waitForApplication.
//
// We don't have to worry about extant snapshots creating
// replicas that start at an index before this Migrate request.
// Snapshots that don't include the recipient (as specified by
// replicaID and descriptor in the snap vs. the replicaID of the
// raft instance) are discarded by the recipient, and we're
// already checking against all replicas in the descriptor below
// (which include learner replicas currently in the process of
// receiving snapshots). Snapshots are also discarded unless
// they move the LAI forward, so we're not worried about old
// snapshots (with indexes preceding the MLAI here)
// instantiating pre-migrated state in anyway. We also have a
// separate mechanism to ensure replicas with older versions are
// purged from the system[1]. This is driven by a higher-level
// orchestration layer[2], these are the replicas that we don't
// have a handle on here as they're eligible for GC (but may
// still hit replica evaluation code paths with pre-migrated
// state, unless explicitly purged).
//
// It's possible that between the proposal returning and the
// call to r.Desc() below, the descriptor has already changed.
// But the only thing that matters is that r.Desc() is at least
// as up to date as the descriptor the command applied on
// previously. If a replica got removed - fine,
// waitForApplication will fail; we will have to cope with that.
// If one got added - it was likely already a learner when we
// migrated (in which case waitForApplication will know about
// it). If that's not the case, we'll note that the Migrate
// command also declares a read latch on the range descriptor.
// The replication change will have thus serialized after the
// migration, and so the snapshot will also include the
// post-migration state.
//
// TODO(irfansharif): In a cluster that is constantly changing
// its replica sets, it's possible to get into a situation
// where a Migrate command never manages to complete - all it
// takes is a single range in each attempt to throw things off.
// Perhaps an error in waitForApplication should lead to a retry
// of just the one RPC instead of propagating an error for the
// entire migrate invocation.
//
// [1]: See PurgeOutdatedReplicas from the Migration service.
// [2]: pkg/migration
desc := r.Desc()
// NB: waitForApplication already has a timeout.
applicationErr := waitForApplication(
ctx, r.store.cfg.NodeDialer, desc.RangeID, desc.Replicas().Descriptors(),
// We wait for an index >= that of the migration command.
r.GetLeaseAppliedIndex())
propResult.Err = roachpb.NewError(applicationErr)
}

The problem is that this command has to succeed for every range on every replica in one go (with a few tight retries), otherwise the entire migration fails and has to restart. This can be a problem in large clusters with many ranges (in this case, 400.000 ranges).

Make sure the migration jobs can be paused as necessary, and that they use the regular exponential backoff for job retries.

This should be the case. Is there evidence that it is not?

No, just something we should verify.

@irfansharif
Copy link
Contributor

One thing we're sorely lacking is an integration/qualification/acceptance test to run these migrations at very large scales. It could've shaken out some of these low timeouts for e.g. or further validated the need for generic checkpointing.

craig bot pushed a commit that referenced this issue Nov 22, 2021
72266: colfetcher: populate tableoids on the whole batch at once r=yuzefovich a=yuzefovich

Previously, we were populating `tableoid` system column (if requested)
when finalizing each row. However, the OID value is constant, so we can
populate it when finalizing a batch. `finalizeBatch` becomes no longer
inlinable, but we're trading a "per row conditional and Set operation"
for a "per batch additional function call", and I think it's probably
worth it.

Release note: None

72946: ui: save filters on cache for Statements page r=maryliag a=maryliag

Previously, a sort selection was not maintained when
the page change (e.g. coming back from Statement details).
This commits saves the selected value to be used.

Partially adresses #71851

Showing behaviour: https://www.loom.com/share/681ca9d80f7145faa111b6aacab417f9

Release note: None

72987: kvserver: increase `Migrate` application timeout to 1 minute r=tbg,ajwerner,miretskiy a=erikgrinaker

**kvserver: increase Migrate application timeout to 1 minute**

This increases the timeout when waiting for application of a `Migrate`
command on all range replicas to 1 minute, up from 5 seconds. It also
adds a cluster setting `kv.migration.migrate_application.timeout` to
control this.

When encountering a range that's e.g. undergoing rebalancing, it can
take a long time for a learner replica to receive a snapshot and respond
to this request, which would cause the timeout to trigger. This is
especially likely in clusters with many ranges and frequent rebalancing
activity.

Touches #72931.

Release note (bug fix): The timeout when checking for Raft application
of upgrade migrations has been increased from 5 seconds to 1 minute, and
is now controllable via the cluster setting
`kv.migration.migrate_application_timeout`. This makes migrations much
less likely to fail in clusters with ongoing rebalancing activity during
upgrade migrations.

**migration: add informative log message for sep intents migrate failure**

The separated intents migration has been seen to go into failure loops
in the wild, with a generic "context deadline exceeded" error. This adds
a more informative log entry with additional hints on how to resolve the
problem.

Release note: None

Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Marylia Gutierrez <[email protected]>
Co-authored-by: Erik Grinaker <[email protected]>
@irfansharif
Copy link
Contributor

Introduce migration helpers to do automatic batching, checkpointing and retries of such migrations across ranges, to avoid having to implement this in each separate migration. It should also optimistically continue applying it to additional ranges even when one fails.

This is worth doing, perhaps as part of #84073.

Introduce knob to control fan out for how many ranges are migrated at a time, to allow operators to speed up migrations with acceptable hit on foreground traffic.

A library in pkg/upgrades to do this sort of thing would prevent us from writing one-range-at-a-time migrations, like we do here:

func raftAppliedIndexTermMigration(
ctx context.Context, cv clusterversion.ClusterVersion, deps upgrade.SystemDeps, _ *jobs.Job,
) error {
var batchIdx, numMigratedRanges int
init := func() { batchIdx, numMigratedRanges = 1, 0 }
if err := deps.Cluster.IterateRangeDescriptors(ctx, defaultPageSize, init, func(descriptors ...roachpb.RangeDescriptor) error {
for _, desc := range descriptors {

@ajwerner
Copy link
Contributor

A library in pkg/upgrades to do this sort of thing would prevent us from writing one-range-at-a-time migrations, like we do here

That linked library should not be particularly difficult to generalize.

@knz
Copy link
Contributor

knz commented Aug 4, 2022

Would it make sense to also lower the txn priority inside IterateRangeDescriptors (and possibly the implicit txn used for the MigrateRequest), to reduce the visible contention on meta2 from SQL?

@erikgrinaker
Copy link
Contributor Author

I wouldn't use a transaction at all, and instead scan smaller batches of up-to-date descriptors with individual scan requests. No objection on doing these with low priority, but we specifically don't want a snapshot of meta2, we want fresh data.

@irfansharif
Copy link
Contributor

Using smaller txns (and lowering their priority if still needed) to fetch batches of range descriptors makes sense.

@irfansharif
Copy link
Contributor

I'll apologize for not having done just all this originally, this simple thing has caused me much grief. Had I kicked the tires more on realistic cluster sizes (100k+ ranges with ambient split/merge activity), all this would've been much more apparent.

@knz
Copy link
Contributor

knz commented Aug 4, 2022

Irfan, using smaller txns with lower priority is not mentioned on the issue erik linked. Mind filing a new followup issue for that (separate) change?

@irfansharif
Copy link
Contributor

Do you mean this issue? I've added a bullet point here.

@knz
Copy link
Contributor

knz commented Aug 4, 2022

thanks!

@erikgrinaker
Copy link
Contributor Author

Since we'll be adding a fair bit of concurrency here to improve throughput for trivial migrations, we should probably integrate this with AC somehow as well, to avoid expensive migrations overloading the cluster.

@erikgrinaker
Copy link
Contributor Author

Adding O-support, since we've had several escalations about below-Raft migrations stalling upgrades.

@erikgrinaker erikgrinaker added the O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs label Oct 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-3 Issues/test failures with no fix SLA S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. T-kv KV Team
Projects
No open projects
Status: Incoming
Development

No branches or pull requests

5 participants