-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: improve below-Raft migrations #72931
Comments
Thanks for filing. Here's the internal thread that prompted this issue. |
We do have an existing mechanism in kvserver to wait for application. The merge protocol relies on it. We could consider doing something like that during the execution and not acknowledging the batch until it's been fully replicated. It wouldn't be too hard to implement. cockroach/pkg/kv/kvserver/replica_command.go Lines 790 to 821 in b57af7d
|
This should be the case. Is there evidence that it is not? |
Well yes, that's what we currently do here: cockroach/pkg/kv/kvserver/replica_write.go Lines 206 to 259 in 455cddd
The problem is that this command has to succeed for every range on every replica in one go (with a few tight retries), otherwise the entire migration fails and has to restart. This can be a problem in large clusters with many ranges (in this case, 400.000 ranges).
No, just something we should verify. |
One thing we're sorely lacking is an integration/qualification/acceptance test to run these migrations at very large scales. It could've shaken out some of these low timeouts for e.g. or further validated the need for generic checkpointing. |
72266: colfetcher: populate tableoids on the whole batch at once r=yuzefovich a=yuzefovich Previously, we were populating `tableoid` system column (if requested) when finalizing each row. However, the OID value is constant, so we can populate it when finalizing a batch. `finalizeBatch` becomes no longer inlinable, but we're trading a "per row conditional and Set operation" for a "per batch additional function call", and I think it's probably worth it. Release note: None 72946: ui: save filters on cache for Statements page r=maryliag a=maryliag Previously, a sort selection was not maintained when the page change (e.g. coming back from Statement details). This commits saves the selected value to be used. Partially adresses #71851 Showing behaviour: https://www.loom.com/share/681ca9d80f7145faa111b6aacab417f9 Release note: None 72987: kvserver: increase `Migrate` application timeout to 1 minute r=tbg,ajwerner,miretskiy a=erikgrinaker **kvserver: increase Migrate application timeout to 1 minute** This increases the timeout when waiting for application of a `Migrate` command on all range replicas to 1 minute, up from 5 seconds. It also adds a cluster setting `kv.migration.migrate_application.timeout` to control this. When encountering a range that's e.g. undergoing rebalancing, it can take a long time for a learner replica to receive a snapshot and respond to this request, which would cause the timeout to trigger. This is especially likely in clusters with many ranges and frequent rebalancing activity. Touches #72931. Release note (bug fix): The timeout when checking for Raft application of upgrade migrations has been increased from 5 seconds to 1 minute, and is now controllable via the cluster setting `kv.migration.migrate_application_timeout`. This makes migrations much less likely to fail in clusters with ongoing rebalancing activity during upgrade migrations. **migration: add informative log message for sep intents migrate failure** The separated intents migration has been seen to go into failure loops in the wild, with a generic "context deadline exceeded" error. This adds a more informative log entry with additional hints on how to resolve the problem. Release note: None Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Marylia Gutierrez <[email protected]> Co-authored-by: Erik Grinaker <[email protected]>
This is worth doing, perhaps as part of #84073.
A library in pkg/upgrades to do this sort of thing would prevent us from writing one-range-at-a-time migrations, like we do here: cockroach/pkg/upgrade/upgrades/raft_applied_index_term.go Lines 40 to 46 in ea13a52
|
That linked library should not be particularly difficult to generalize. |
Would it make sense to also lower the txn priority inside IterateRangeDescriptors (and possibly the implicit txn used for the MigrateRequest), to reduce the visible contention on meta2 from SQL? |
I wouldn't use a transaction at all, and instead scan smaller batches of up-to-date descriptors with individual scan requests. No objection on doing these with low priority, but we specifically don't want a snapshot of meta2, we want fresh data. |
Using smaller txns (and lowering their priority if still needed) to fetch batches of range descriptors makes sense. |
I'll apologize for not having done just all this originally, this simple thing has caused me much grief. Had I kicked the tires more on realistic cluster sizes (100k+ ranges with ambient split/merge activity), all this would've been much more apparent. |
Irfan, using smaller txns with lower priority is not mentioned on the issue erik linked. Mind filing a new followup issue for that (separate) change? |
Do you mean this issue? I've added a bullet point here. |
thanks! |
Since we'll be adding a fair bit of concurrency here to improve throughput for trivial migrations, we should probably integrate this with AC somehow as well, to avoid expensive migrations overloading the cluster. |
Adding O-support, since we've had several escalations about below-Raft migrations stalling upgrades. |
Long-running migrations can send a
MigrateRequest
for migrations that must be applied below Raft. This request is special in that it only succeeds once it has been applied to all known replicas of the range -- it is not sufficient simply to commit it to the Raft log following acknowledgement from a quorum of replicas.cockroach/pkg/kv/kvserver/replica_write.go
Lines 254 to 257 in 455cddd
This requirement is in order to guarantee that no state machine replicas rely on legacy, unmigrated state. However, this requires all replicas for all ranges in a cluster to be available and up-to-date, with a 5-second timeout before giving up. Any retries are currently left to the migration code itself. For example, the
postSeparatedIntentsMigration
uses 5 retries for a given range and then fails the entire migration, having to restart:cockroach/pkg/migration/migrations/separated_intents.go
Lines 557 to 563 in 4df8ac2
This could be improved in several ways:
Consider whether the requirement that all replicas for all ranges are available and up-to-date is necessary, or even viable in large clusters.Jira issue: CRDB-11351
Epic CRDB-39898
The text was updated successfully, but these errors were encountered: