-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
73288: kv: apply limited timeout to snapshots waiting in reservation queue r=tbg,erikgrinaker a=nvanbenschoten Alternative to #46655. This commit introduces a new cluster setting called `kv.snapshot_receiver.queue_timeout_fraction` which dictates the fraction of a snapshot's total timeout that it is allowed to spend queued on the receiver waiting for a reservation. Enforcement of this snapshotApplySem-scoped timeout is intended to prevent starvation of snapshots in cases where a queue of snapshots waiting for reservations builds and no single snapshot acquires the semaphore with sufficient time to complete, but each holds the semaphore long enough to ensure that later snapshots in the queue encounter this same situation. This is a case of FIFO queuing + timeouts leading to starvation. By rejecting snapshot attempts earlier, we ensure that those that do acquire the semaphore have sufficient time to complete. The commit adds a new test called `TestReserveSnapshotQueueTimeoutAvoidsStarvation` which reproduces this starvation without the fix. With the fix, the test passes and goodput never collapses to 0. This is an alternative to strict LIFO queueing (#46655) and an alternative to Adaptive LIFO queueing (https://queue.acm.org/detail.cfm?id=2839461). The former avoids starvation but at the expense of fairness even under low but steady concurrency. The latter avoids compromising on fairness until it switches from FIFO to LIFO, but is fairly complex. The approach taken in this PR is a compromise that does not trade fairness under low concurrency and is still relatively simple, but does retain some risk of starvation in the case where `totalTimeout - queueTimeout < processingTime`. The default settings ensure that `processingTime` needs to be at least `30s` (assuming `kv.queue.process.guaranteed_time_budget` is used) before this will become a problem in practice. Release notes (bug fix): Raft snapshots no longer risk starvation under very high concurrency. Before this fix, it was possible that a thundering herd of Raft snapshots could be starved and prevented from succeeding due to timeouts, which were accompanied by errors like `error rate limiting bulk io write: context deadline exceeded`. Co-authored-by: Nathan VanBenschoten <[email protected]>
- Loading branch information
Showing
10 changed files
with
338 additions
and
30 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.