kv,storage: re-consider compaction concurrency for multi-store nodes #74697

irfansharif · 2022-01-11T20:04:29Z

Describe the problem

We use a default of 3 cores per store to run compactions (see COCKROACH_ROCKSDB_CONCURRENCY). For multi-store setups, with insufficient cores, that may be far too many. It may also be that we want to update our guidance with respect to "# of cores recommended for a given # of stores". In a recent escalation we observed that a high store count + compaction debt + low core count led to a large % of the total CPU available on nodes being used entirely for compactions. The CPU being pegged in this manner was disruptive to foreground traffic.

Currently the compaction concurrency for a store defaults to min(3, numCPUs). This isn't multi-store-aware at all, as we could have a lot of CPUs but not enough to give every store 3 of them for concurrent compactions.

Expected behavior

Automatic configuration of compaction concurrency to min(3, numCPUs/numStores) at the very least. Guidance for what an appropriate number of cores are for a given number of stores. Or compaction concurrency that's reflective of the total number of cores available for the total number of stores (presumably after experimentation of our own).

Jira issue: CRDB-12216

Epic CRDB-41111

The text was updated successfully, but these errors were encountered:

jbowens · 2022-01-11T20:35:45Z

Linking this to cockroachdb/pebble#1329, the broader issue of adjusting resource utilization of background Pebble tasks.

Each store has independent disk-bandwidth and IOPs constraints, but shared CPU. I think we'll need something adaptive like discussed in cockroachdb/pebble#1329 to avoid saturating CPU while also sufficiently utilizing disk bandwidth.

sumeerbhola · 2023-01-18T20:26:29Z

For a non-adaptive solution, we could simply have a shared limit across stores. The difficulty is how to roll this out to existing CockroachDB users that have clusters with multiple stores. Presumably they have already fiddled with the individual store setting (or are fine with the default) -- we don't want them to suddenly have reduced concurrency. We could have something that only applies to new clusters, but that seems error prone.

itsbilal · 2024-07-30T21:37:29Z

More context on the O-testcluster label: we've hit the issue of high CPU usage with compactions on multi-store DRT clusters and had to dial down compaction concurrency manually. Ideally this would be automated, so at least every store's max compaction concurrency setting gets set to min(3, numCPUs/numStores) as opposed to the current min(3, numCPUs).

nameisbhaskar · 2024-07-31T14:39:29Z

Archive.zip
Uploading the CPU profiles of drt-large node 1. More details in the thread - https://cockroachlabs.slack.com/archives/CAC6K3SLU/p1722423058416819

itsbilal · 2024-08-01T21:23:48Z

I did a quick analysis of large1.cpuprof.2024-07-29T23_58_53.227.80.pprof in the above comment, coming off of the drt-large cluster's n1. Looking at the Pebble logs from the node itself, I see that an avg of 4 concurrent compactions were live on the node in the 10 clock-seconds (= 160 cpu-seconds) the profile spans.

That would mean 40 cpu-seconds would go towards compactions in the profile if all a compaction did was CPU work. Instead we see 36 profiled cpu-seconds go towards runCompaction, and of those 36, ~2s are in fread and ~2s are in fwrite, so we're left with 32 cpu-seconds in non-IO CPU work, or around 80% of the 40s. From this we can estimate that 80% of a compaction is CPU time, assuming sufficiently fast disks which seems to be the case on drt-large because we have a lot more nvme local SSDs than we can drive quickly with our (limited) CPUs.

80% CPU utilization in a compaction does seem fairly high, but when looking at where the CPU time is being spent, it does seem to make more sense. Most of it is in decoding blocks, snappy-decompressing it, then encoding the write-side blocks, and snappy-compressing it. I don't think the 80% estimate is significantly far off of the true amount of cpu time spent in compactions, although on other clusters/machines where we're driving IO/disk utilization higher than we are with drt-large, the ratio of CPU time is likely lower.

This estimate could be useful in trying to determine how to divvy-up CPUs for concurrent compactions on nodes that have a lot of stores.

compaction concurrency in a multi-store configuration. Each Pebble store (i.e. an instance of *DB) still maintains its own per-store compaction concurrency which is controlled by `opts.MaxConcurrentCompactions`. However, in a multi-store configuration, disk I/O is a per-store resource while CPU is shared across stores. A significant portion of compaction is CPU-intensive, and so this ensures that excessive compactions don't interrupt foreground CPU tasks even if the disks are capable of handling the additional throughput from those compactions. The shared compaction concurrency only applies to automatic compactions This means that delete-only compactions are excluded because they are expected to be cheap, as are flushes because they should never be blocked. Fixes: cockroachdb#3813 Informs: cockroachdb/cockroach#74697

This change adds a new compaction pool which enforces a global max compaction concurrency in a multi-store configuration. Each Pebble store (i.e. an instance of *DB) still maintains its own per-store compaction concurrency which is controlled by `opts.MaxConcurrentCompactions`. However, in a multi-store configuration, disk I/O is a per-store resource while CPU is shared across stores. A significant portion of compaction is CPU-intensive, and so this ensures that excessive compactions don't interrupt foreground CPU tasks even if the disks are capable of handling the additional throughput from those compactions. The shared compaction concurrency only applies to automatic compactions This means that delete-only compactions are excluded because they are expected to be cheap, as are flushes because they should never be blocked. Fixes: cockroachdb#3813 Informs: cockroachdb/cockroach#74697

CompactionScheduler is an interface that encompasses (a) our current compaction scheduling behavior, (b) compaction scheduling in a multi-store setting that adds a per-node limit in addition to the per-store limit, and prioritizes across stores, (c) compaction scheduling that includes (b) plus is aware of resource usage and can prioritize across stores and across other long-lived work in addition to compactions (e.g. range snapshot reception). CompactionScheduler calls into DB and the DB calls into the CompactionScheduler. This requires some care in specification of the synchronization expectations, to avoid deadlock. For the most part, the complexity is borne by the CompactionScheduler -- see the code comments for details. ConcurrencyLimitScheduler is an implementation for (a), and is paired with a single DB. It has no knowledge of delete-only compactions, so we have redefined the meaning of Options.MaxConcurrentCompactions, as discussed in the code comment. CompactionScheduler has some related interfaces/structs: - CompactionGrantHandle is used to report the start and end of the compaction, and frequently report the written bytes, and CPU consumption. In the implementation of CompactionGrantHandle provided by CockroachDB's AC component, the CPU consumption will use the grunning package. - WaitingCompaction is a struct used to prioritize the DB's compaction relative to other long-lived work (including compactions by other DBs). makeWaitingCompaction is a helper that constructs this struct. For integrating the CompactionScheduler with DB, there are a number of changes: - The entry paths to ask to schedule a compaction are reduced to 1, by removing DB.maybeScheduleCompactionPicker. The only path is DB.maybeScheduleCompaction. - versionSet.{curCompactionConcurrency,pickedCompactionCache} are added to satisfy the interface expected by CompactionScheduler. Specifically, pickedCompactionCache allows us to safely cache a pickedCompaction that cannot be run. There is some commentary on the worst-case waste in compaction picking -- with the default ConcurrencyLimitScheduler on average there should be no wastage. - versionSet.curCompactionConcurrency and DB.mu.compact.manualLen are two atomic variables introduced to implement DB.GetAllowedWithoutPermission, which allows the DB to adjust what minimum compaction concurrency it desires based on the backlog of automatic and manual compactions. The encoded logic is meant to be equivalent to our current logic. The CompactionSlot and CompactionLimiter introduced in a recent PR are deleted. CompactionGrantHandle is analogous to CompactionSlot, and allows for measuring of CPU usage since the implementation of CompactionScheduler in AC will explicitly monitor usage and capacity. CompactionScheduler is analogous to CompactionLimiter. CompactionLimiter had a non-queueing interface in that it never called into the DB. This worked since the only events that allowed another compaction to run were also ones that caused another call to maybeScheduleCompaction. This is not true when a CompactionScheduler is scheduling across multiple DBs, or managing a compaction and other long-lived work (snapshot reception), since something unrelated to the DB can cause resources to become available to run a compaction. There is a partial implementation of a resource aware scheduler in https://github.com/sumeerbhola/cockroach/tree/long_lived_granter/pkg/util/admission/admit_long. Informs cockroachdb#3813, cockroachdb/cockroach#74697, cockroachdb#1329

irfansharif added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-storage Relating to our storage engine (Pebble) on-disk storage. labels Jan 11, 2022

blathers-crl bot added the T-storage Storage Team label Jan 11, 2022

jbowens added this to [Deprecated] Storage Jun 4, 2024

jbowens moved this to 24.2 candidates in [Deprecated] Storage Jun 4, 2024

BabuSrithar added the O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster label Jul 26, 2024

itsbilal mentioned this issue Aug 6, 2024

db: shared compaction concurrency limit across multiple Pebble instances cockroachdb/pebble#3813

Open

anish-shanbhag mentioned this issue Aug 23, 2024

compact: add shared compaction pool for multiple stores cockroachdb/pebble#3880

Open

andrewbaptist mentioned this issue Nov 20, 2024

roachtest: perturbation/metamorphic/decommission failed #135241

Closed

itsbilal mentioned this issue Dec 3, 2024

admission, storage: prototype AC limiter for concurrent compactions #136615

Closed

sumeerbhola mentioned this issue Jan 27, 2025

db: introduce CompactionScheduler and integrate with DB cockroachdb/pebble#4297

Open

sumeerbhola self-assigned this Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv,storage: re-consider compaction concurrency for multi-store nodes #74697

kv,storage: re-consider compaction concurrency for multi-store nodes #74697

irfansharif commented Jan 11, 2022 •

edited by exalate-issue-sync bot

Loading

jbowens commented Jan 11, 2022

sumeerbhola commented Jan 18, 2023

itsbilal commented Jul 30, 2024

nameisbhaskar commented Jul 31, 2024

itsbilal commented Aug 1, 2024

kv,storage: re-consider compaction concurrency for multi-store nodes #74697

kv,storage: re-consider compaction concurrency for multi-store nodes #74697

Comments

irfansharif commented Jan 11, 2022 • edited by exalate-issue-sync bot Loading

jbowens commented Jan 11, 2022

sumeerbhola commented Jan 18, 2023

itsbilal commented Jul 30, 2024

nameisbhaskar commented Jul 31, 2024

itsbilal commented Aug 1, 2024

irfansharif commented Jan 11, 2022 •

edited by exalate-issue-sync bot

Loading