Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db: introduce CompactionScheduler and integrate with DB #4297

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sumeerbhola
Copy link
Collaborator

CompactionScheduler is an interface that encompasses (a) our current compaction scheduling behavior, (b) compaction scheduling in a multi-store setting that adds a per-node limit in addition to the per-store limit, and prioritizes across stores, (c) compaction scheduling that includes (b) plus is aware of resource usage and can prioritize across stores and across other long-lived work in addition to compactions (e.g. range snapshot reception).

CompactionScheduler calls into DB and the DB calls into the CompactionScheduler. This requires some care in specification of the synchronization expectations, to avoid deadlock. For the most part, the complexity is borne by the CompactionScheduler -- see the code comments for details.

ConcurrencyLimitScheduler is an implementation for (a), and is paired with a single DB. It has no knowledge of delete-only compactions, so we have redefined the meaning of Options.MaxConcurrentCompactions, as discussed in the code comment.

CompactionScheduler has some related interfaces/structs:

  • CompactionGrantHandle is used to report the start and end of the compaction, and frequently report the written bytes, and CPU consumption. In the implementation of CompactionGrantHandle provided by CockroachDB's AC component, the CPU consumption will use the grunning package.
  • WaitingCompaction is a struct used to prioritize the DB's compaction relative to other long-lived work (including compactions by other DBs). makeWaitingCompaction is a helper that constructs this struct.

For integrating the CompactionScheduler with DB, there are a number of changes:

  • The entry paths to ask to schedule a compaction are reduced to 1, by removing DB.maybeScheduleCompactionPicker. The only path is DB.maybeScheduleCompaction.
  • versionSet.{curCompactionConcurrency,pickedCompactionCache} are added to satisfy the interface expected by CompactionScheduler. Specifically, pickedCompactionCache allows us to safely cache a pickedCompaction that cannot be run. There is some commentary on the worst-case waste in compaction picking -- with the default ConcurrencyLimitScheduler on average there should be no wastage.
  • versionSet.curCompactionConcurrency and DB.mu.compact.manualLen are two atomic variables introduced to implement DB.GetAllowedWithoutPermission, which allows the DB to adjust what minimum compaction concurrency it desires based on the backlog of automatic and manual compactions. The encoded logic is meant to be equivalent to our current logic.

The CompactionSlot and CompactionLimiter introduced in a recent PR are deleted. CompactionGrantHandle is analogous to CompactionSlot, and allows for measuring of CPU usage since the implementation of CompactionScheduler in AC will explicitly monitor usage and capacity. CompactionScheduler is analogous to CompactionLimiter. CompactionLimiter had a non-queueing interface in that it never called into the DB. This worked since the only events that allowed another compaction to run were also ones that caused another call to maybeScheduleCompaction. This is not true when a CompactionScheduler is scheduling across multiple DBs, or managing a compaction and other long-lived work (snapshot reception), since something unrelated to the DB can cause resources to become available to run a compaction.

There is a partial implementation of a resource aware scheduler in https://github.com/sumeerbhola/cockroach/tree/long_lived_granter/pkg/util/admission/admit_long.

Informs #3813, cockroachdb/cockroach#74697, #1329

@sumeerbhola sumeerbhola requested review from a team and aadityasondhi January 27, 2025 22:21
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider this a prototype. Once I have some experimental results (with CockroachDB integration), I will start cleaning this up into smaller changes for review.
But it gives a good sense for the kind of changes this will entail.

Existing Pebble tests pass and stress meta ran for 3+ hours without failure.

Reviewable status: 0 of 22 files reviewed, all discussions resolved (waiting on @aadityasondhi)

@sumeerbhola sumeerbhola force-pushed the compaction_scheduler branch 3 times, most recently from 9070f6a to a8b60cf Compare February 5, 2025 16:53
Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 22 files reviewed, 4 unresolved discussions (waiting on @aadityasondhi)


compaction_scheduler.go line 76 at r1 (raw file):

	// by CompactionScheduler is long-lived (often executing for multiple
	// seconds).
	GetWaitingCompaction() (bool, WaitingCompaction)

In practice, would this be called more than once after TrySchedule? (if not, we can just pass the compaction information via TrySchedule; if yes, when? periodically?)


compaction_scheduler.go line 77 at r1 (raw file):

	// seconds).
	GetWaitingCompaction() (bool, WaitingCompaction)
	// Schedule grants the DB permission to run a compaction. The DB returns

Isn't this specifically permission to run the compaction it last returned via GetWaitingCompaction?


compaction_scheduler.go line 90 at r1 (raw file):

	// overall healthy LSM. This value can be compared across compactions and
	// other long-lived work.
	Optional bool

[nit] Isn't having a priority enough?


internal/base/compaction_grant_handle.go line 26 at r1 (raw file):

	// requirement to call it on the same goroutine that runs the compaction.
	Started()
	// MeasureCPU is called from each of the two goroutines that consume CPU

[nit] What does it do?

@sumeerbhola sumeerbhola force-pushed the compaction_scheduler branch 2 times, most recently from 0f515ce to 9c0418d Compare February 14, 2025 14:04
Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 23 files reviewed, 4 unresolved discussions (waiting on @aadityasondhi and @RaduBerinde)


compaction_scheduler.go line 76 at r1 (raw file):

Previously, RaduBerinde wrote…

In practice, would this be called more than once after TrySchedule? (if not, we can just pass the compaction information via TrySchedule; if yes, when? periodically?)

Yes it can be called many times with no bounds. The scheduler could keep calling it whenever it has some capacity, and pick some other long-lived work (another store's compaction, incoming range snapshot). Hence the caching logic in pickedCompactionCache which then results in a worst-case overhead (discussed in a comment below).

The reason we don't pass this info in TrySchedule is that the scheduler has no ability to know when it becomes stale. So it becomes Pebble's responsibility to know the freshest state (either in the cache or needs to pick a compaction) when asked.


compaction_scheduler.go line 77 at r1 (raw file):

Previously, RaduBerinde wrote…

Isn't this specifically permission to run the compaction it last returned via GetWaitingCompaction?

Kind of. Pebble's state may have changed and it may run some other compaction. In practice, we are assuming that a DBForCompaction will run either what was returned in GetWaitingCompaction or something even more important.


compaction_scheduler.go line 90 at r1 (raw file):

Previously, RaduBerinde wrote…

[nit] Isn't having a priority enough?

This part is very much subject to revision. The reason I went with this for the prototype is that we don't just have compactions that are vying for resources: incoming range snapshots also consume disk and cpu. And so we would need a way to come up with a prioritization scheme that worked across the two. Instead of that, we classify some Pebble compactions as optional, where this optional bit is decided by Pebble (not the integration code in CockroachDB), and then optional=true becomes the lowest importance. And when the optional bit is the same, compactions win over incoming range snapshots.

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 23 files reviewed, 4 unresolved discussions (waiting on @aadityasondhi and @sumeerbhola)


compaction_scheduler.go line 77 at r1 (raw file):

Previously, sumeerbhola wrote…

Kind of. Pebble's state may have changed and it may run some other compaction. In practice, we are assuming that a DBForCompaction will run either what was returned in GetWaitingCompaction or something even more important.

I think that's reasonable. We could make that explicit. Schedule(CompactionGrantHandle, Priority): permission granted to schedule a compaction of priority at least as high as the argument.


compaction_scheduler.go line 90 at r1 (raw file):

Previously, sumeerbhola wrote…

This part is very much subject to revision. The reason I went with this for the prototype is that we don't just have compactions that are vying for resources: incoming range snapshots also consume disk and cpu. And so we would need a way to come up with a prioritization scheme that worked across the two. Instead of that, we classify some Pebble compactions as optional, where this optional bit is decided by Pebble (not the integration code in CockroachDB), and then optional=true becomes the lowest importance. And when the optional bit is the same, compactions win over incoming range snapshots.

What I'm missing is why isn't that equivalent to having a single integer, where some of the integer range is only for optional compactions and some other part of the range is for non-optional compactions?

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 23 files reviewed, 4 unresolved discussions (waiting on @aadityasondhi and @RaduBerinde)


compaction_scheduler.go line 90 at r1 (raw file):

Previously, RaduBerinde wrote…

What I'm missing is why isn't that equivalent to having a single integer, where some of the integer range is only for optional compactions and some other part of the range is for non-optional compactions?

That is doable. But note the comparator for the heap is a cleaner tuple comparison due to keeping optional as a separate field https://github.com/sumeerbhola/cockroach/blob/435f3332e75c89fa5ec27b981edf92110ee8eb32/pkg/util/admission/admit_long/granter.go#L942-L955 (from cockroachdb/cockroach#141501)

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 23 files reviewed, 4 unresolved discussions (waiting on @aadityasondhi and @sumeerbhola)


compaction_scheduler.go line 90 at r1 (raw file):

Previously, sumeerbhola wrote…

That is doable. But note the comparator for the heap is a cleaner tuple comparison due to keeping optional as a separate field https://github.com/sumeerbhola/cockroach/blob/435f3332e75c89fa5ec27b981edf92110ee8eb32/pkg/util/admission/admit_long/granter.go#L942-L955 (from cockroachdb/cockroach#141501)

Understood. So we are comparing them like a (Optional, Priority, Score) tuple; it wasn't clear from the comments. Maybe say for priority: "it is only compared across compactions with the same Optional value" and for score: "it is only compared across compactions with the same Optional and Priority values. Or just say at the top that it represents and compares like an (Optional, Priority, Score).

@sumeerbhola sumeerbhola force-pushed the compaction_scheduler branch 2 times, most recently from a8b60cf to 9e66e28 Compare February 28, 2025 23:17
@sumeerbhola sumeerbhola requested a review from jbowens February 28, 2025 23:22
@sumeerbhola sumeerbhola marked this pull request as ready for review February 28, 2025 23:22
Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RaduBerinde @jbowens
Now that the approach has been partially validated with the integration prototype in cockroachdb/cockroach#141501 and results in cockroachdb/cockroach#136615 (comment) I'd like to get a detailed review and then merge this PR.
I owe a test for ConcurrencyLimitScheduler which will be quite straightforward, so I'd like to wait until the rest is fully reviewed, as interface or other changes can affect ConcurrencyLimitScheduler.

Reviewable status: 0 of 23 files reviewed, 3 unresolved discussions (waiting on @aadityasondhi, @jbowens, and @RaduBerinde)


compaction_scheduler.go line 77 at r1 (raw file):

Previously, RaduBerinde wrote…

I think that's reasonable. We could make that explicit. Schedule(CompactionGrantHandle, Priority): permission granted to schedule a compaction of priority at least as high as the argument.

We could. But it isn't clear that this race is worth complicating the interface and bookkeeping. Since for Pebble compactions, the importance of what it wants to run can't really decrease (there is just a backlog of work that increases).


compaction_scheduler.go line 90 at r1 (raw file):

Previously, RaduBerinde wrote…

Understood. So we are comparing them like a (Optional, Priority, Score) tuple; it wasn't clear from the comments. Maybe say for priority: "it is only compared across compactions with the same Optional value" and for score: "it is only compared across compactions with the same Optional and Priority values. Or just say at the top that it represents and compares like an (Optional, Priority, Score).

Done


internal/base/compaction_grant_handle.go line 26 at r1 (raw file):

Previously, RaduBerinde wrote…

[nit] What does it do?

Done.


internal/compact/run.go line 194 at r3 (raw file):

			})
			// TODO: need to give the GrantHandle to the writer so it can account on
			// all its goroutines.

I will do this in a followup PR, so that main interface and refactoring can be merged first.

CompactionScheduler is an interface that encompasses (a) our current
compaction scheduling behavior, (b) compaction scheduling in a multi-store
setting that adds a per-node limit in addition to the per-store limit, and
prioritizes across stores, (c) compaction scheduling that includes (b) plus
is aware of resource usage and can prioritize across stores and across
other long-lived work in addition to compactions (e.g. range snapshot
reception).

CompactionScheduler calls into DB and the DB calls into the
CompactionScheduler. This requires some care in specification of the
synchronization expectations, to avoid deadlock. For the most part, the
complexity is borne by the CompactionScheduler -- see the code comments
for details.

ConcurrencyLimitScheduler is an implementation for (a), and is paired with
a single DB. It has no knowledge of delete-only compactions, so we have
redefined the meaning of Options.MaxConcurrentCompactions, as discussed
in the code comment.

CompactionScheduler has some related interfaces/structs:
- CompactionGrantHandle is used to report the start and end of the
  compaction, and frequently report the written bytes, and CPU consumption.
  In the implementation of CompactionGrantHandle provided by CockroachDB's
  AC component, the CPU consumption will use the grunning package.
- WaitingCompaction is a struct used to prioritize the DB's compaction
  relative to other long-lived work (including compactions by other DBs).
  makeWaitingCompaction is a helper that constructs this struct.

For integrating the CompactionScheduler with DB, there are a number of
changes:
- The entry paths to ask to schedule a compaction are reduced to 1, by
  removing DB.maybeScheduleCompactionPicker. The only path is
  DB.maybeScheduleCompaction.
- versionSet.{curCompactionConcurrency,pickedCompactionCache} are added
  to satisfy the interface expected by CompactionScheduler. Specifically,
  pickedCompactionCache allows us to safely cache a pickedCompaction
  that cannot be run. There is some commentary on the worst-case waste
  in compaction picking -- with the default ConcurrencyLimitScheduler
  on average there should be no wastage.
- versionSet.curCompactionConcurrency and DB.mu.compact.manualLen are two
  atomic variables introduced to implement DB.GetAllowedWithoutPermission,
  which allows the DB to adjust what minimum compaction concurrency it
  desires based on the backlog of automatic and manual compactions. The
  encoded logic is meant to be equivalent to our current logic.

The CompactionSlot and CompactionLimiter introduced in a recent PR are
deleted. CompactionGrantHandle is analogous to CompactionSlot, and allows for
measuring of CPU usage since the implementation of CompactionScheduler in AC
will explicitly monitor usage and capacity. CompactionScheduler is analogous to
CompactionLimiter. CompactionLimiter had a non-queueing interface in
that it never called into the DB. This worked since the only events that
allowed another compaction to run were also ones that caused another
call to maybeScheduleCompaction. This is not true when a
CompactionScheduler is scheduling across multiple DBs, or managing a
compaction and other long-lived work (snapshot reception), since something
unrelated to the DB can cause resources to become available to run a
compaction.

There is a partial implementation of a resource aware scheduler in
https://github.com/sumeerbhola/cockroach/tree/long_lived_granter/pkg/util/admission/admit_long.

Informs cockroachdb#3813, cockroachdb/cockroach#74697, cockroachdb#1329
@sumeerbhola sumeerbhola force-pushed the compaction_scheduler branch from 9e66e28 to a4a526f Compare March 2, 2025 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants