db: introduce CompactionScheduler and integrate with DB #4297

sumeerbhola · 2025-01-27T22:21:29Z

CompactionScheduler is an interface that encompasses (a) our current compaction scheduling behavior, (b) compaction scheduling in a multi-store setting that adds a per-node limit in addition to the per-store limit, and prioritizes across stores, (c) compaction scheduling that includes (b) plus is aware of resource usage and can prioritize across stores and across other long-lived work in addition to compactions (e.g. range snapshot reception).

CompactionScheduler calls into DB and the DB calls into the CompactionScheduler. This requires some care in specification of the synchronization expectations, to avoid deadlock. For the most part, the complexity is borne by the CompactionScheduler -- see the code comments for details.

ConcurrencyLimitScheduler is an implementation for (a), and is paired with a single DB. It has no knowledge of delete-only compactions, so we have redefined the meaning of Options.MaxConcurrentCompactions, as discussed in the code comment.

CompactionScheduler has some related interfaces/structs:

CompactionGrantHandle is used to report the start and end of the compaction, and frequently report the written bytes, and CPU consumption. In the implementation of CompactionGrantHandle provided by CockroachDB's AC component, the CPU consumption will use the grunning package.
WaitingCompaction is a struct used to prioritize the DB's compaction relative to other long-lived work (including compactions by other DBs). makeWaitingCompaction is a helper that constructs this struct.

For integrating the CompactionScheduler with DB, there are a number of changes:

The entry paths to ask to schedule a compaction are reduced to 1, by removing DB.maybeScheduleCompactionPicker. The only path is DB.maybeScheduleCompaction.
versionSet.{curCompactionConcurrency,pickedCompactionCache} are added to satisfy the interface expected by CompactionScheduler. Specifically, pickedCompactionCache allows us to safely cache a pickedCompaction that cannot be run. There is some commentary on the worst-case waste in compaction picking -- with the default ConcurrencyLimitScheduler on average there should be no wastage.
versionSet.curCompactionConcurrency and DB.mu.compact.manualLen are two atomic variables introduced to implement DB.GetAllowedWithoutPermission, which allows the DB to adjust what minimum compaction concurrency it desires based on the backlog of automatic and manual compactions. The encoded logic is meant to be equivalent to our current logic.

The CompactionSlot and CompactionLimiter introduced in a recent PR are deleted. CompactionGrantHandle is analogous to CompactionSlot, and allows for measuring of CPU usage since the implementation of CompactionScheduler in AC will explicitly monitor usage and capacity. CompactionScheduler is analogous to CompactionLimiter. CompactionLimiter had a non-queueing interface in that it never called into the DB. This worked since the only events that allowed another compaction to run were also ones that caused another call to maybeScheduleCompaction. This is not true when a CompactionScheduler is scheduling across multiple DBs, or managing a compaction and other long-lived work (snapshot reception), since something unrelated to the DB can cause resources to become available to run a compaction.

There is a partial implementation of a resource aware scheduler in https://github.com/sumeerbhola/cockroach/tree/long_lived_granter/pkg/util/admission/admit_long.

Informs #3813, cockroachdb/cockroach#74697, #1329

cockroach-teamcity · 2025-01-27T22:21:36Z

This change is

sumeerbhola

Consider this a prototype. Once I have some experimental results (with CockroachDB integration), I will start cleaning this up into smaller changes for review.
But it gives a good sense for the kind of changes this will entail.

Existing Pebble tests pass and stress meta ran for 3+ hours without failure.

Reviewable status: 0 of 22 files reviewed, all discussions resolved (waiting on @aadityasondhi)

RaduBerinde

Reviewable status: 0 of 22 files reviewed, 4 unresolved discussions (waiting on @aadityasondhi)

compaction_scheduler.go line 76 at r1 (raw file):

	// by CompactionScheduler is long-lived (often executing for multiple
	// seconds).
	GetWaitingCompaction() (bool, WaitingCompaction)

In practice, would this be called more than once after TrySchedule? (if not, we can just pass the compaction information via TrySchedule; if yes, when? periodically?)

compaction_scheduler.go line 77 at r1 (raw file):

	// seconds).
	GetWaitingCompaction() (bool, WaitingCompaction)
	// Schedule grants the DB permission to run a compaction. The DB returns

Isn't this specifically permission to run the compaction it last returned via GetWaitingCompaction?

compaction_scheduler.go line 90 at r1 (raw file):

	// overall healthy LSM. This value can be compared across compactions and
	// other long-lived work.
	Optional bool

[nit] Isn't having a priority enough?

internal/base/compaction_grant_handle.go line 26 at r1 (raw file):

	// requirement to call it on the same goroutine that runs the compaction.
	Started()
	// MeasureCPU is called from each of the two goroutines that consume CPU

[nit] What does it do?

sumeerbhola

Reviewable status: 0 of 23 files reviewed, 4 unresolved discussions (waiting on @aadityasondhi and @RaduBerinde)

compaction_scheduler.go line 76 at r1 (raw file):

Previously, RaduBerinde wrote…

In practice, would this be called more than once after TrySchedule? (if not, we can just pass the compaction information via TrySchedule; if yes, when? periodically?)

Yes it can be called many times with no bounds. The scheduler could keep calling it whenever it has some capacity, and pick some other long-lived work (another store's compaction, incoming range snapshot). Hence the caching logic in pickedCompactionCache which then results in a worst-case overhead (discussed in a comment below).

The reason we don't pass this info in TrySchedule is that the scheduler has no ability to know when it becomes stale. So it becomes Pebble's responsibility to know the freshest state (either in the cache or needs to pick a compaction) when asked.

compaction_scheduler.go line 77 at r1 (raw file):

Previously, RaduBerinde wrote…

Isn't this specifically permission to run the compaction it last returned via GetWaitingCompaction?

Kind of. Pebble's state may have changed and it may run some other compaction. In practice, we are assuming that a DBForCompaction will run either what was returned in GetWaitingCompaction or something even more important.

compaction_scheduler.go line 90 at r1 (raw file):

Previously, RaduBerinde wrote…

[nit] Isn't having a priority enough?

This part is very much subject to revision. The reason I went with this for the prototype is that we don't just have compactions that are vying for resources: incoming range snapshots also consume disk and cpu. And so we would need a way to come up with a prioritization scheme that worked across the two. Instead of that, we classify some Pebble compactions as optional, where this optional bit is decided by Pebble (not the integration code in CockroachDB), and then optional=true becomes the lowest importance. And when the optional bit is the same, compactions win over incoming range snapshots.

RaduBerinde

Reviewable status: 0 of 23 files reviewed, 4 unresolved discussions (waiting on @aadityasondhi and @sumeerbhola)

compaction_scheduler.go line 77 at r1 (raw file):

Previously, sumeerbhola wrote…

Kind of. Pebble's state may have changed and it may run some other compaction. In practice, we are assuming that a DBForCompaction will run either what was returned in GetWaitingCompaction or something even more important.

I think that's reasonable. We could make that explicit. Schedule(CompactionGrantHandle, Priority): permission granted to schedule a compaction of priority at least as high as the argument.

compaction_scheduler.go line 90 at r1 (raw file):

Previously, sumeerbhola wrote…

This part is very much subject to revision. The reason I went with this for the prototype is that we don't just have compactions that are vying for resources: incoming range snapshots also consume disk and cpu. And so we would need a way to come up with a prioritization scheme that worked across the two. Instead of that, we classify some Pebble compactions as optional, where this optional bit is decided by Pebble (not the integration code in CockroachDB), and then optional=true becomes the lowest importance. And when the optional bit is the same, compactions win over incoming range snapshots.

What I'm missing is why isn't that equivalent to having a single integer, where some of the integer range is only for optional compactions and some other part of the range is for non-optional compactions?

sumeerbhola

Reviewable status: 0 of 23 files reviewed, 4 unresolved discussions (waiting on @aadityasondhi and @RaduBerinde)

compaction_scheduler.go line 90 at r1 (raw file):

Previously, RaduBerinde wrote…

What I'm missing is why isn't that equivalent to having a single integer, where some of the integer range is only for optional compactions and some other part of the range is for non-optional compactions?

That is doable. But note the comparator for the heap is a cleaner tuple comparison due to keeping optional as a separate field https://github.com/sumeerbhola/cockroach/blob/435f3332e75c89fa5ec27b981edf92110ee8eb32/pkg/util/admission/admit_long/granter.go#L942-L955 (from cockroachdb/cockroach#141501)

RaduBerinde

Reviewable status: 0 of 23 files reviewed, 4 unresolved discussions (waiting on @aadityasondhi and @sumeerbhola)

compaction_scheduler.go line 90 at r1 (raw file):

Previously, sumeerbhola wrote…

That is doable. But note the comparator for the heap is a cleaner tuple comparison due to keeping optional as a separate field https://github.com/sumeerbhola/cockroach/blob/435f3332e75c89fa5ec27b981edf92110ee8eb32/pkg/util/admission/admit_long/granter.go#L942-L955 (from cockroachdb/cockroach#141501)

Understood. So we are comparing them like a (Optional, Priority, Score) tuple; it wasn't clear from the comments. Maybe say for priority: "it is only compared across compactions with the same Optional value" and for score: "it is only compared across compactions with the same Optional and Priority values. Or just say at the top that it represents and compares like an (Optional, Priority, Score).

sumeerbhola

@RaduBerinde @jbowens
Now that the approach has been partially validated with the integration prototype in cockroachdb/cockroach#141501 and results in cockroachdb/cockroach#136615 (comment) I'd like to get a detailed review and then merge this PR.
I owe a test for ConcurrencyLimitScheduler which will be quite straightforward, so I'd like to wait until the rest is fully reviewed, as interface or other changes can affect ConcurrencyLimitScheduler.

Reviewable status: 0 of 23 files reviewed, 3 unresolved discussions (waiting on @aadityasondhi, @jbowens, and @RaduBerinde)

compaction_scheduler.go line 77 at r1 (raw file):

Previously, RaduBerinde wrote…

I think that's reasonable. We could make that explicit. Schedule(CompactionGrantHandle, Priority): permission granted to schedule a compaction of priority at least as high as the argument.

We could. But it isn't clear that this race is worth complicating the interface and bookkeeping. Since for Pebble compactions, the importance of what it wants to run can't really decrease (there is just a backlog of work that increases).

compaction_scheduler.go line 90 at r1 (raw file):

Previously, RaduBerinde wrote…

Understood. So we are comparing them like a (Optional, Priority, Score) tuple; it wasn't clear from the comments. Maybe say for priority: "it is only compared across compactions with the same Optional value" and for score: "it is only compared across compactions with the same Optional and Priority values. Or just say at the top that it represents and compares like an (Optional, Priority, Score).

Done

internal/base/compaction_grant_handle.go line 26 at r1 (raw file):

Previously, RaduBerinde wrote…

[nit] What does it do?

Done.

internal/compact/run.go line 194 at r3 (raw file):

			})
			// TODO: need to give the GrantHandle to the writer so it can account on
			// all its goroutines.

I will do this in a followup PR, so that main interface and refactoring can be merged first.

CompactionScheduler is an interface that encompasses (a) our current compaction scheduling behavior, (b) compaction scheduling in a multi-store setting that adds a per-node limit in addition to the per-store limit, and prioritizes across stores, (c) compaction scheduling that includes (b) plus is aware of resource usage and can prioritize across stores and across other long-lived work in addition to compactions (e.g. range snapshot reception). CompactionScheduler calls into DB and the DB calls into the CompactionScheduler. This requires some care in specification of the synchronization expectations, to avoid deadlock. For the most part, the complexity is borne by the CompactionScheduler -- see the code comments for details. ConcurrencyLimitScheduler is an implementation for (a), and is paired with a single DB. It has no knowledge of delete-only compactions, so we have redefined the meaning of Options.MaxConcurrentCompactions, as discussed in the code comment. CompactionScheduler has some related interfaces/structs: - CompactionGrantHandle is used to report the start and end of the compaction, and frequently report the written bytes, and CPU consumption. In the implementation of CompactionGrantHandle provided by CockroachDB's AC component, the CPU consumption will use the grunning package. - WaitingCompaction is a struct used to prioritize the DB's compaction relative to other long-lived work (including compactions by other DBs). makeWaitingCompaction is a helper that constructs this struct. For integrating the CompactionScheduler with DB, there are a number of changes: - The entry paths to ask to schedule a compaction are reduced to 1, by removing DB.maybeScheduleCompactionPicker. The only path is DB.maybeScheduleCompaction. - versionSet.{curCompactionConcurrency,pickedCompactionCache} are added to satisfy the interface expected by CompactionScheduler. Specifically, pickedCompactionCache allows us to safely cache a pickedCompaction that cannot be run. There is some commentary on the worst-case waste in compaction picking -- with the default ConcurrencyLimitScheduler on average there should be no wastage. - versionSet.curCompactionConcurrency and DB.mu.compact.manualLen are two atomic variables introduced to implement DB.GetAllowedWithoutPermission, which allows the DB to adjust what minimum compaction concurrency it desires based on the backlog of automatic and manual compactions. The encoded logic is meant to be equivalent to our current logic. The CompactionSlot and CompactionLimiter introduced in a recent PR are deleted. CompactionGrantHandle is analogous to CompactionSlot, and allows for measuring of CPU usage since the implementation of CompactionScheduler in AC will explicitly monitor usage and capacity. CompactionScheduler is analogous to CompactionLimiter. CompactionLimiter had a non-queueing interface in that it never called into the DB. This worked since the only events that allowed another compaction to run were also ones that caused another call to maybeScheduleCompaction. This is not true when a CompactionScheduler is scheduling across multiple DBs, or managing a compaction and other long-lived work (snapshot reception), since something unrelated to the DB can cause resources to become available to run a compaction. There is a partial implementation of a resource aware scheduler in https://github.com/sumeerbhola/cockroach/tree/long_lived_granter/pkg/util/admission/admit_long. Informs cockroachdb#3813, cockroachdb/cockroach#74697, cockroachdb#1329

sumeerbhola requested review from a team and aadityasondhi January 27, 2025 22:21

sumeerbhola commented Jan 27, 2025

View reviewed changes

sumeerbhola force-pushed the compaction_scheduler branch 3 times, most recently from 9070f6a to a8b60cf Compare February 5, 2025 16:53

RaduBerinde reviewed Feb 12, 2025

View reviewed changes

sumeerbhola force-pushed the compaction_scheduler branch 2 times, most recently from 0f515ce to 9c0418d Compare February 14, 2025 14:04

sumeerbhola commented Feb 18, 2025

View reviewed changes

RaduBerinde reviewed Feb 18, 2025

View reviewed changes

sumeerbhola commented Feb 19, 2025

View reviewed changes

RaduBerinde reviewed Feb 19, 2025

View reviewed changes

sumeerbhola force-pushed the compaction_scheduler branch 2 times, most recently from a8b60cf to 9e66e28 Compare February 28, 2025 23:17

sumeerbhola requested a review from jbowens February 28, 2025 23:22

sumeerbhola marked this pull request as ready for review February 28, 2025 23:22

sumeerbhola commented Feb 28, 2025

View reviewed changes

sumeerbhola force-pushed the compaction_scheduler branch from 9e66e28 to a4a526f Compare March 2, 2025 16:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

db: introduce CompactionScheduler and integrate with DB #4297

db: introduce CompactionScheduler and integrate with DB #4297

sumeerbhola commented Jan 27, 2025

cockroach-teamcity commented Jan 27, 2025

sumeerbhola left a comment

RaduBerinde left a comment

sumeerbhola left a comment

RaduBerinde left a comment

sumeerbhola left a comment

RaduBerinde left a comment

sumeerbhola left a comment

db: introduce CompactionScheduler and integrate with DB #4297

Are you sure you want to change the base?

db: introduce CompactionScheduler and integrate with DB #4297

Conversation

sumeerbhola commented Jan 27, 2025

cockroach-teamcity commented Jan 27, 2025

sumeerbhola left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment