admission: improvements to KV admission #65957

sumeerbhola · 2021-06-01T18:59:03Z

Misc TODOs/limitations in KV admission control:

Most work initiated from within KV bypasses admission control. This is not appropriate especially for background work like GC, backups etc. Fixing this mainly requires finding the places in the code where the low priority work originates and plumbing the appropriate AdmissionHeader. (Export, GC are now subject to admission control kvserver,backupccl: make export, gc subject to admission control #71109, but there are TODOs regarding priority assignment)
There is no KVLowPriWork WorkKind, which means even low-priority KV work will get prioritized over user-facing SQL processing. (Edit: with the elastic classification, this is no longer true, since low-priority KV work goes through a different grant coordinator).
The initial scan for RangeFeedRequest is not subject to admission control.
Read-only work is throttled when storage engines are overloaded due to writes. We need a way to distinguish work that includes writes, and make it pass through an additional WorkQueue that is hooked up to an Granter that only uses write overload as a resource signal.
When there are multiple stores, a single engine (for a particular store) being overloaded due to writes will cause work to be throttled for all stores.

Jira issue: CRDB-7813

The text was updated successfully, but these errors were encountered:

ajwerner · 2021-06-16T15:14:21Z

Should we do something about ExportRequest and AddSSTable?

sumeerbhola · 2021-06-16T15:54:30Z

Should we do something about ExportRequest and AddSSTable?

Yes. Since this is happening through InternalServer.Batch, it should be part of the first bullet.

ajwerner · 2021-06-16T16:12:49Z

Cool. For context, I'm raise this in the context of the recent issues involving interactions between the bespoke ExportRequestLimiter and its bad interactions with latching. I suspect we wouldn't have gotten ourselves into that pickle if we were thinking about admission control in a more centralized way.

cockroach/pkg/kv/kvserver/store.go

Lines 181 to 193 in 14bd70c

    
           // ExportRequestsLimit is the number of Export requests that can run at once. 
        
           // Each extracts data from RocksDB to a temp file and then uploads it to cloud 
        
           // storage. In order to not exhaust the disk or memory, or saturate the network, 
        
           // limit the number of these that can be run in parallel. This number was chosen 
        
           // by a guessing - it could be improved by more measured heuristics. Exported 
        
           // here since we check it in in the caller to limit generated requests as well 
        
           // to prevent excessive queuing. 
        
           var ExportRequestsLimit = settings.RegisterIntSetting( 
        
           	"kv.bulk_io_write.concurrent_export_requests", 
        
           	"number of export requests a store will handle concurrently before queuing", 
        
           	3, 
        
           	settings.PositiveInt, 
        
           )

sumeerbhola · 2021-06-16T17:30:25Z

interactions between the bespoke ExportRequestLimiter and its bad interactions with latching

Yep, I noticed the silliness with rate limiting after acquiring latches (when it got fixed). I don't think admission control would have helped -- since the CPU would not be saturated, it would allow low priority requests through. It is not aware that latches are a resource too, and that reducing latch contention due to low priority traffic will allow high priority traffic to get higher throughput. Modeling latches as a resource would be hard since we'd need to track contended spans.

…rations on the overloaded store Previously KVWork was all admitted through a single WorkQueue, and this WorkQueue was paired with a GrantCoordinator that took into account both CPU overload and storage overload. This meant a single overloaded store (in a multi-store setting) would slow down all KV operations including reads, and those on stores that were fine. We now have multiple WorkQueue's, one per store, that KV writes go through, and then get to the shared WorkQueue that all operations go through. The per-store WorkQueues are associated with their own GrantCoordinators that listen to health state for their respective stores. The per-store queues do not care about CPU and therefore do not do grant chaining. Since admission through these per-store queues happens before the shared queue, a store bottleneck will not cause KV slots in the shared queue to be taken by requests that are still waiting for admission elsewhere. The reverse can happen, and is considered acceptable -- per-store tokens, when a store is overloaded, can be used by requests that are now waiting for admission in the shared WorkQueue because of a cpu bottleneck. The code is significantly refactored for the above: NewGrantCoordinators returns a container called StoreGrantCoordinators which lazily initializes the relevant per-store GrantCoordinators when it first fetches Pebble metrics, in addition to the shared GrantCoordinator. The server code now integrates with both types and the code in Node.Batch will sometimes subject a request to two WorkQueues. The PebbleMetricsProvider now includes StoreIDs, and the periodic ticking that fetches these metrics at 1min intervals, and does 1s ticks, is moved to StoreGrantCoordinators. This simplifies the ioLoadListener which no longer does the periodic ticking and eliminates a some testing-only abstractions. The per-store WorkQueues share the same metrics, which represent an aggregate across these queues. Informs cockroachdb#65957 Release note: None

67717: admission,server: scope store overload admission control to write ope… r=sumeerbhola a=sumeerbhola …rations on the overloaded store Previously KVWork was all admitted through a single WorkQueue, and this WorkQueue was paired with a GrantCoordinator that took into account both CPU overload and storage overload. This meant a single overloaded store (in a multi-store setting) would slow down all KV operations including reads, and those on stores that were fine. We now have multiple WorkQueue's, one per store, that KV writes go through, and then get to the shared WorkQueue that all operations go through. The per-store WorkQueues are associated with their own GrantCoordinators that listen to health state for their respective stores. The per-store queues do not care about CPU and therefore do not do grant chaining. Since admission through these per-store queues happens before the shared queue, a store bottleneck will not cause KV slots in the shared queue to be taken by requests that are still waiting for admission elsewhere. The reverse can happen, and is considered acceptable -- per-store tokens, when a store is overloaded, can be used by requests that are now waiting for admission in the shared WorkQueue because of a cpu bottleneck. The code is significantly refactored for the above: NewGrantCoordinators returns a container called StoreGrantCoordinators which lazily initializes the relevant per-store GrantCoordinators when it first fetches Pebble metrics, in addition to the shared GrantCoordinator. The server code now integrates with both types and the code in Node.Batch will sometimes subject a request to two WorkQueues. The PebbleMetricsProvider now includes StoreIDs, and the periodic ticking that fetches these metrics at 1min intervals, and does 1s ticks, is moved to StoreGrantCoordinators. This simplifies the ioLoadListener which no longer does the periodic ticking and eliminates a some testing-only abstractions. The per-store WorkQueues share the same metrics, which represent an aggregate across these queues. Informs #65957 Release note: None Co-authored-by: sumeerbhola <[email protected]>

…rations on the overloaded store Previously KVWork was all admitted through a single WorkQueue, and this WorkQueue was paired with a GrantCoordinator that took into account both CPU overload and storage overload. This meant a single overloaded store (in a multi-store setting) would slow down all KV operations including reads, and those on stores that were fine. We now have multiple WorkQueue's, one per store, that KV writes go through, and then get to the shared WorkQueue that all operations go through. The per-store WorkQueues are associated with their own GrantCoordinators that listen to health state for their respective stores. The per-store queues do not care about CPU and therefore do not do grant chaining. Since admission through these per-store queues happens before the shared queue, a store bottleneck will not cause KV slots in the shared queue to be taken by requests that are still waiting for admission elsewhere. The reverse can happen, and is considered acceptable -- per-store tokens, when a store is overloaded, can be used by requests that are now waiting for admission in the shared WorkQueue because of a cpu bottleneck. The code is significantly refactored for the above: NewGrantCoordinators returns a container called StoreGrantCoordinators which lazily initializes the relevant per-store GrantCoordinators when it first fetches Pebble metrics, in addition to the shared GrantCoordinator. The server code now integrates with both types and the code in Node.Batch will sometimes subject a request to two WorkQueues. The PebbleMetricsProvider now includes StoreIDs, and the periodic ticking that fetches these metrics at 1min intervals, and does 1s ticks, is moved to StoreGrantCoordinators. This simplifies the ioLoadListener which no longer does the periodic ticking and eliminates a some testing-only abstractions. The per-store WorkQueues share the same metrics, which represent an aggregate across these queues. Informs cockroachdb#65957 Release note: None

Admission control will slow down writes to that store without affecting operations on the other store. Also fixed some related graphs on the overload dashboard - the "IO tokens exhausted" graph should have been using a rate - the kv-stores metrics were missing Informs cockroachdb#65955,cockroachdb#65957 Release note: None

69050: sql: Set txn deadline to session expiry for mt pods. r=ajwerner a=rimadeodhar For ephemeral SQL pods in the multi-tenant environment, the transaction deadline should be set to min(sqlliveness.Session expiration, leased descriptor duration). Since a sql instance ID can be reclaimed by another SQL pod once the session for the old pod has expired, we need to ensure that all transactions for a SQL pod commit before the session for that SQL pod expires. In single-tenant environments, the session expiry will not be used. Release note: None Release justification: bug fix for new sqlinstance functionality Resolves: [67719](#67719) 69241: tests,ui: add a multi-store test with single overloaded store r=sumeerbhola a=sumeerbhola Admission control will slow down writes to that store without affecting operations on the other store. Also fixed some related graphs on the overload dashboard - the "IO tokens exhausted" graph should have been using a rate - the kv-stores metrics were missing Informs #65955,#65957 Release note: None Co-authored-by: rimadeodhar <[email protected]> Co-authored-by: sumeerbhola <[email protected]>

They are now marked with AdmissionHeader_ROOT_KV, which stops them from bypassing admission control. The priority is set to admission.LowPri since these are background activities that should not be preferred over interactive user traffic. Informs cockroachdb#65957 Release note: None

They are now marked with AdmissionHeader_ROOT_KV, which stops them from bypassing admission control. The priority is set to admission.NormalPri since there are cases where they do need to complete in a very timely manner. There are TODOs in the integration code to make more sophisticated priority assignments and use LowPri when possible. Informs cockroachdb#65957 Release note: None

71109: kvserver,backupccl: make export, gc subject to admission control r=sumeerbhola a=sumeerbhola They are now marked with AdmissionHeader_ROOT_KV, which stops them from bypassing admission control. The priority is set to admission.NormalPri since there are cases where they do need to complete in a very timely manner. There are TODOs in the integration code to make more sophisticated priority assignments and use LowPri when possible. Informs #65957 Release note: None Co-authored-by: sumeerbhola <[email protected]>

They are now marked with AdmissionHeader_ROOT_KV, which stops them from bypassing admission control. The priority is set to admission.NormalPri since there are cases where they do need to complete in a very timely manner. There are TODOs in the integration code to make more sophisticated priority assignments and use LowPri when possible. Informs cockroachdb#65957 Release note: None

Partial backport of: tests,ui: add a multi-store test with single overloaded store Admission control will slow down writes to that store without affecting operations on the other store. Also fixed some related graphs on the overload dashboard - the "IO tokens exhausted" graph should have been using a rate - the kv-stores metrics were missing Informs cockroachdb#65955,cockroachdb#65957 Release note: None

Part of cockroachdb#65957. Changefeed backfills, given their scan-heavy nature, can be fairly CPU-intensive. In cockroachdb#89656 we introduced a roachtest demonstrating the latency impact backfills can have on a moderately CPU-saturated cluster. Similar to what we saw for backups, this CPU heavy nature can elevate Go scheduling latencies which in turn translates to foreground latency impact. This commit integrates rangefeed catchup scan with the elastic CPU limiter we introduced in cockroachdb#86638; this is one of two optional halves of changefeed backfills. The second half is the initial scan -- scan requests issued over some keyspan as of some timestamp. For that we simply rely on the existing slots mechanism but now setting a lower priority bit (BulkNormalPri). Experimentally we observed that during initial scans the encoding routines in changefeedccl are the most impactful CPU-wise, something cockroachdb#89589 can help with. We leave admission integration of parallel worker goroutines to future work. Unlike export requests rangefeed catchup scans are non-premptible. The rangefeed RPC is a streaming one, and the catchup scan is done during stream setup. So we don't have resumption tokens to propagate up to the caller like we did for backups. We still want CPU-bound work going through admission control to only use 100ms of CPU time, to exercise better control over scheduling latencies. To that end, we introduce the following component used within the rangefeed catchup iterator. // Pacer is used in tight loops (CPU-bound) for non-premptible // elastic work. Callers are expected to invoke Pace() every loop // iteration and Close() once done. Internally this type integrates // with elastic CPU work queue, acquiring tokens for the CPU work // being done, and blocking if tokens are unavailable. This allows // for a form of cooperative scheduling with elastic CPU granters. type Pacer struct func (p *Pacer) Pace(ctx context.Context) error { ... } func (p *Pacer) Close() { ... } Release note: None

Part of cockroachdb#65957. Changefeed backfills, given their scan-heavy nature, can be fairly CPU-intensive. In cockroachdb#89656 we introduced a roachtest demonstrating the latency impact backfills can have on a moderately CPU-saturated cluster. Similar to what we saw for backups, this CPU heavy nature can elevate Go scheduling latencies which in turn translates to foreground latency impact. This commit integrates rangefeed catchup scan with the elastic CPU limiter we introduced in cockroachdb#86638; this is one of two optional halves of changefeed backfills. The second half is the initial scan -- scan requests issued over some keyspan as of some timestamp. For that we simply rely on the existing slots mechanism but now setting a lower priority bit (BulkNormalPri) -- cockroachdb#88733. Experimentally we observed that during initial scans the encoding routines in changefeedccl are the most impactful CPU-wise, something cockroachdb#89589 can help with. We leave admission integration of parallel worker goroutines to future work (cockroachdb#90089). Unlike export requests rangefeed catchup scans are non-premptible. The rangefeed RPC is a streaming one, and the catchup scan is done during stream setup. So we don't have resumption tokens to propagate up to the caller like we did for backups. We still want CPU-bound work going through admission control to only use 100ms of CPU time, to exercise better control over scheduling latencies. To that end, we introduce the following component used within the rangefeed catchup iterator. // Pacer is used in tight loops (CPU-bound) for non-premptible // elastic work. Callers are expected to invoke Pace() every loop // iteration and Close() once done. Internally this type integrates // with elastic CPU work queue, acquiring tokens for the CPU work // being done, and blocking if tokens are unavailable. This allows // for a form of cooperative scheduling with elastic CPU granters. type Pacer struct func (p *Pacer) Pace(ctx context.Context) error { ... } func (p *Pacer) Close() { ... } Release note: None

89709: kv,rangefeed: integrate catchup scans with elastic cpu r=irfansharif a=irfansharif Part of #65957. First commit is from #89656. Changefeed backfills, given their scan-heavy nature, can be fairly CPU-intensive. In #89656 we introduced a roachtest demonstrating the latency impact backfills can have on a moderately CPU-saturated cluster. Similar to what we saw for backups, this CPU heavy nature can elevate Go scheduling latencies which in turn translates to foreground latency impact. This commit integrates rangefeed catchup scan with the elastic CPU limiter we introduced in #86638; this is one of two optional halves of changefeed backfills. The second half is the initial scan -- scan requests issued over some keyspan as of some timestamp. For that we simply rely on the existing slots mechanism but now setting a lower priority bit (BulkNormalPri). Experimentally we observed that during initial scans the encoding routines in changefeedccl are the most impactful CPU-wise, something #89589 can help with. We leave admission integration of parallel worker goroutines to future work. Unlike export requests, rangefeed catchup scans are non-premptible. The rangefeed RPC is a streaming one, and the catchup scan is done during stream setup. So we don't have resumption tokens to propagate up to the caller like we did for backups. We still want CPU-bound work going through admission control to use at most 100ms of CPU time (it makes for an ineffective controller otherwise), and to that end we introduce the following component used within the rangefeed catchup iterator: // Pacer is used in tight loops (CPU-bound) for non-premptible // elastic work. Callers are expected to invoke Pace() every loop // iteration and Close() once done. Internally this type integrates // with elastic CPU work queue, acquiring tokens for the CPU work // being done, and blocking if tokens are unavailable. This allows // for a form of cooperative scheduling with elastic CPU granters. type Pacer struct func (p *Pacer) Pace(ctx context.Context) error { ... } func (p *Pacer) Close() { ... } Release note: None Co-authored-by: irfan sharif <[email protected]>

Part of cockroachdb#65957. Changefeed backfills, given their scan-heavy nature, can be fairly CPU-intensive. In cockroachdb#89656 we introduced a roachtest demonstrating the latency impact backfills can have on a moderately CPU-saturated cluster. Similar to what we saw for backups, this CPU heavy nature can elevate Go scheduling latencies which in turn translates to foreground latency impact. This commit integrates rangefeed catchup scan with the elastic CPU limiter we introduced in cockroachdb#86638; this is one of two optional halves of changefeed backfills. The second half is the initial scan -- scan requests issued over some keyspan as of some timestamp. For that we simply rely on the existing slots mechanism but now setting a lower priority bit (BulkNormalPri) -- cockroachdb#88733. Experimentally we observed that during initial scans the encoding routines in changefeedccl are the most impactful CPU-wise, something cockroachdb#89589 can help with. We leave admission integration of parallel worker goroutines to future work (cockroachdb#90089). Unlike export requests rangefeed catchup scans are non-premptible. The rangefeed RPC is a streaming one, and the catchup scan is done during stream setup. So we don't have resumption tokens to propagate up to the caller like we did for backups. We still want CPU-bound work going through admission control to only use 100ms of CPU time, to exercise better control over scheduling latencies. To that end, we introduce the following component used within the rangefeed catchup iterator. // Pacer is used in tight loops (CPU-bound) for non-premptible // elastic work. Callers are expected to invoke Pace() every loop // iteration and Close() once done. Internally this type integrates // with elastic CPU work queue, acquiring tokens for the CPU work // being done, and blocking if tokens are unavailable. This allows // for a form of cooperative scheduling with elastic CPU granters. type Pacer struct func (p *Pacer) Pace(ctx context.Context) error { ... } func (p *Pacer) Close() { ... } Release note: None

sumeerbhola · 2023-10-19T17:14:54Z

Closing this old issue since statements like "Most work initiated from within KV bypasses admission control. This is not appropriate especially for background work like GC, backups etc" are stale and have been addressed.

sumeerbhola added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-admission-control labels Jun 1, 2021

sumeerbhola mentioned this issue Jul 16, 2021

admission,server: scope store overload admission control to write ope… #67717

Merged

sumeerbhola mentioned this issue Aug 5, 2021

admission control: default cluster settings are set to enabled #65955

Closed

14 tasks

sumeerbhola mentioned this issue Aug 23, 2021

tests,ui: add a multi-store test with single overloaded store #69241

Merged

sumeerbhola mentioned this issue Oct 4, 2021

kvserver,backupccl: make export, gc subject to admission control #71109

Merged

sumeerbhola mentioned this issue Nov 24, 2021

kv: internal operations, especially writes, should not set deadline #65191

Closed

irfansharif changed the title ~~admission control: improvements to KV admission~~ admission: improvements to KV admission Oct 3, 2022

irfansharif mentioned this issue Oct 11, 2022

kv,rangefeed: integrate catchup scans with elastic cpu #89709

Merged

irfansharif mentioned this issue Nov 24, 2022

release-22.2: kv,rangefeed: integrate catchup scans with elastic cpu #92439

Merged

sumeerbhola closed this as completed Oct 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

admission: improvements to KV admission #65957

admission: improvements to KV admission #65957

sumeerbhola commented Jun 1, 2021 •

edited

Loading

ajwerner commented Jun 16, 2021

sumeerbhola commented Jun 16, 2021

ajwerner commented Jun 16, 2021

sumeerbhola commented Jun 16, 2021

sumeerbhola commented Oct 19, 2023

admission: improvements to KV admission #65957

admission: improvements to KV admission #65957

Comments

sumeerbhola commented Jun 1, 2021 • edited Loading

ajwerner commented Jun 16, 2021

sumeerbhola commented Jun 16, 2021

ajwerner commented Jun 16, 2021

sumeerbhola commented Jun 16, 2021

sumeerbhola commented Oct 19, 2023

sumeerbhola commented Jun 1, 2021 •

edited

Loading