-
Notifications
You must be signed in to change notification settings - Fork 482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
db: introduce CompactionScheduler and integrate with DB #4297
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider this a prototype. Once I have some experimental results (with CockroachDB integration), I will start cleaning this up into smaller changes for review.
But it gives a good sense for the kind of changes this will entail.
Existing Pebble tests pass and stress meta ran for 3+ hours without failure.
Reviewable status: 0 of 22 files reviewed, all discussions resolved (waiting on @aadityasondhi)
9070f6a
to
a8b60cf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 22 files reviewed, 4 unresolved discussions (waiting on @aadityasondhi)
compaction_scheduler.go
line 76 at r1 (raw file):
// by CompactionScheduler is long-lived (often executing for multiple // seconds). GetWaitingCompaction() (bool, WaitingCompaction)
In practice, would this be called more than once after TrySchedule
? (if not, we can just pass the compaction information via TrySchedule; if yes, when? periodically?)
compaction_scheduler.go
line 77 at r1 (raw file):
// seconds). GetWaitingCompaction() (bool, WaitingCompaction) // Schedule grants the DB permission to run a compaction. The DB returns
Isn't this specifically permission to run the compaction it last returned via GetWaitingCompaction
?
compaction_scheduler.go
line 90 at r1 (raw file):
// overall healthy LSM. This value can be compared across compactions and // other long-lived work. Optional bool
[nit] Isn't having a priority enough?
internal/base/compaction_grant_handle.go
line 26 at r1 (raw file):
// requirement to call it on the same goroutine that runs the compaction. Started() // MeasureCPU is called from each of the two goroutines that consume CPU
[nit] What does it do?
0f515ce
to
9c0418d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 23 files reviewed, 4 unresolved discussions (waiting on @aadityasondhi and @RaduBerinde)
compaction_scheduler.go
line 76 at r1 (raw file):
Previously, RaduBerinde wrote…
In practice, would this be called more than once after
TrySchedule
? (if not, we can just pass the compaction information via TrySchedule; if yes, when? periodically?)
Yes it can be called many times with no bounds. The scheduler could keep calling it whenever it has some capacity, and pick some other long-lived work (another store's compaction, incoming range snapshot). Hence the caching logic in pickedCompactionCache
which then results in a worst-case overhead (discussed in a comment below).
The reason we don't pass this info in TrySchedule
is that the scheduler has no ability to know when it becomes stale. So it becomes Pebble's responsibility to know the freshest state (either in the cache or needs to pick a compaction) when asked.
compaction_scheduler.go
line 77 at r1 (raw file):
Previously, RaduBerinde wrote…
Isn't this specifically permission to run the compaction it last returned via
GetWaitingCompaction
?
Kind of. Pebble's state may have changed and it may run some other compaction. In practice, we are assuming that a DBForCompaction
will run either what was returned in GetWaitingCompaction
or something even more important.
compaction_scheduler.go
line 90 at r1 (raw file):
Previously, RaduBerinde wrote…
[nit] Isn't having a priority enough?
This part is very much subject to revision. The reason I went with this for the prototype is that we don't just have compactions that are vying for resources: incoming range snapshots also consume disk and cpu. And so we would need a way to come up with a prioritization scheme that worked across the two. Instead of that, we classify some Pebble compactions as optional, where this optional bit is decided by Pebble (not the integration code in CockroachDB), and then optional=true becomes the lowest importance. And when the optional bit is the same, compactions win over incoming range snapshots.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 23 files reviewed, 4 unresolved discussions (waiting on @aadityasondhi and @sumeerbhola)
compaction_scheduler.go
line 77 at r1 (raw file):
Previously, sumeerbhola wrote…
Kind of. Pebble's state may have changed and it may run some other compaction. In practice, we are assuming that a
DBForCompaction
will run either what was returned inGetWaitingCompaction
or something even more important.
I think that's reasonable. We could make that explicit. Schedule(CompactionGrantHandle, Priority)
: permission granted to schedule a compaction of priority at least as high as the argument.
compaction_scheduler.go
line 90 at r1 (raw file):
Previously, sumeerbhola wrote…
This part is very much subject to revision. The reason I went with this for the prototype is that we don't just have compactions that are vying for resources: incoming range snapshots also consume disk and cpu. And so we would need a way to come up with a prioritization scheme that worked across the two. Instead of that, we classify some Pebble compactions as optional, where this optional bit is decided by Pebble (not the integration code in CockroachDB), and then optional=true becomes the lowest importance. And when the optional bit is the same, compactions win over incoming range snapshots.
What I'm missing is why isn't that equivalent to having a single integer, where some of the integer range is only for optional compactions and some other part of the range is for non-optional compactions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 23 files reviewed, 4 unresolved discussions (waiting on @aadityasondhi and @RaduBerinde)
compaction_scheduler.go
line 90 at r1 (raw file):
Previously, RaduBerinde wrote…
What I'm missing is why isn't that equivalent to having a single integer, where some of the integer range is only for optional compactions and some other part of the range is for non-optional compactions?
That is doable. But note the comparator for the heap is a cleaner tuple comparison due to keeping optional as a separate field https://github.com/sumeerbhola/cockroach/blob/435f3332e75c89fa5ec27b981edf92110ee8eb32/pkg/util/admission/admit_long/granter.go#L942-L955 (from cockroachdb/cockroach#141501)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 23 files reviewed, 4 unresolved discussions (waiting on @aadityasondhi and @sumeerbhola)
compaction_scheduler.go
line 90 at r1 (raw file):
Previously, sumeerbhola wrote…
That is doable. But note the comparator for the heap is a cleaner tuple comparison due to keeping optional as a separate field https://github.com/sumeerbhola/cockroach/blob/435f3332e75c89fa5ec27b981edf92110ee8eb32/pkg/util/admission/admit_long/granter.go#L942-L955 (from cockroachdb/cockroach#141501)
Understood. So we are comparing them like a (Optional, Priority, Score) tuple; it wasn't clear from the comments. Maybe say for priority: "it is only compared across compactions with the same Optional value" and for score: "it is only compared across compactions with the same Optional and Priority values. Or just say at the top that it represents and compares like an (Optional, Priority, Score).
a8b60cf
to
9e66e28
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RaduBerinde @jbowens
Now that the approach has been partially validated with the integration prototype in cockroachdb/cockroach#141501 and results in cockroachdb/cockroach#136615 (comment) I'd like to get a detailed review and then merge this PR.
I owe a test for ConcurrencyLimitScheduler
which will be quite straightforward, so I'd like to wait until the rest is fully reviewed, as interface or other changes can affect ConcurrencyLimitScheduler
.
Reviewable status: 0 of 23 files reviewed, 3 unresolved discussions (waiting on @aadityasondhi, @jbowens, and @RaduBerinde)
compaction_scheduler.go
line 77 at r1 (raw file):
Previously, RaduBerinde wrote…
I think that's reasonable. We could make that explicit.
Schedule(CompactionGrantHandle, Priority)
: permission granted to schedule a compaction of priority at least as high as the argument.
We could. But it isn't clear that this race is worth complicating the interface and bookkeeping. Since for Pebble compactions, the importance of what it wants to run can't really decrease (there is just a backlog of work that increases).
compaction_scheduler.go
line 90 at r1 (raw file):
Previously, RaduBerinde wrote…
Understood. So we are comparing them like a (Optional, Priority, Score) tuple; it wasn't clear from the comments. Maybe say for priority: "it is only compared across compactions with the same Optional value" and for score: "it is only compared across compactions with the same Optional and Priority values. Or just say at the top that it represents and compares like an (Optional, Priority, Score).
Done
internal/base/compaction_grant_handle.go
line 26 at r1 (raw file):
Previously, RaduBerinde wrote…
[nit] What does it do?
Done.
internal/compact/run.go
line 194 at r3 (raw file):
}) // TODO: need to give the GrantHandle to the writer so it can account on // all its goroutines.
I will do this in a followup PR, so that main interface and refactoring can be merged first.
CompactionScheduler is an interface that encompasses (a) our current compaction scheduling behavior, (b) compaction scheduling in a multi-store setting that adds a per-node limit in addition to the per-store limit, and prioritizes across stores, (c) compaction scheduling that includes (b) plus is aware of resource usage and can prioritize across stores and across other long-lived work in addition to compactions (e.g. range snapshot reception). CompactionScheduler calls into DB and the DB calls into the CompactionScheduler. This requires some care in specification of the synchronization expectations, to avoid deadlock. For the most part, the complexity is borne by the CompactionScheduler -- see the code comments for details. ConcurrencyLimitScheduler is an implementation for (a), and is paired with a single DB. It has no knowledge of delete-only compactions, so we have redefined the meaning of Options.MaxConcurrentCompactions, as discussed in the code comment. CompactionScheduler has some related interfaces/structs: - CompactionGrantHandle is used to report the start and end of the compaction, and frequently report the written bytes, and CPU consumption. In the implementation of CompactionGrantHandle provided by CockroachDB's AC component, the CPU consumption will use the grunning package. - WaitingCompaction is a struct used to prioritize the DB's compaction relative to other long-lived work (including compactions by other DBs). makeWaitingCompaction is a helper that constructs this struct. For integrating the CompactionScheduler with DB, there are a number of changes: - The entry paths to ask to schedule a compaction are reduced to 1, by removing DB.maybeScheduleCompactionPicker. The only path is DB.maybeScheduleCompaction. - versionSet.{curCompactionConcurrency,pickedCompactionCache} are added to satisfy the interface expected by CompactionScheduler. Specifically, pickedCompactionCache allows us to safely cache a pickedCompaction that cannot be run. There is some commentary on the worst-case waste in compaction picking -- with the default ConcurrencyLimitScheduler on average there should be no wastage. - versionSet.curCompactionConcurrency and DB.mu.compact.manualLen are two atomic variables introduced to implement DB.GetAllowedWithoutPermission, which allows the DB to adjust what minimum compaction concurrency it desires based on the backlog of automatic and manual compactions. The encoded logic is meant to be equivalent to our current logic. The CompactionSlot and CompactionLimiter introduced in a recent PR are deleted. CompactionGrantHandle is analogous to CompactionSlot, and allows for measuring of CPU usage since the implementation of CompactionScheduler in AC will explicitly monitor usage and capacity. CompactionScheduler is analogous to CompactionLimiter. CompactionLimiter had a non-queueing interface in that it never called into the DB. This worked since the only events that allowed another compaction to run were also ones that caused another call to maybeScheduleCompaction. This is not true when a CompactionScheduler is scheduling across multiple DBs, or managing a compaction and other long-lived work (snapshot reception), since something unrelated to the DB can cause resources to become available to run a compaction. There is a partial implementation of a resource aware scheduler in https://github.com/sumeerbhola/cockroach/tree/long_lived_granter/pkg/util/admission/admit_long. Informs cockroachdb#3813, cockroachdb/cockroach#74697, cockroachdb#1329
9e66e28
to
a4a526f
Compare
CompactionScheduler is an interface that encompasses (a) our current compaction scheduling behavior, (b) compaction scheduling in a multi-store setting that adds a per-node limit in addition to the per-store limit, and prioritizes across stores, (c) compaction scheduling that includes (b) plus is aware of resource usage and can prioritize across stores and across other long-lived work in addition to compactions (e.g. range snapshot reception).
CompactionScheduler calls into DB and the DB calls into the CompactionScheduler. This requires some care in specification of the synchronization expectations, to avoid deadlock. For the most part, the complexity is borne by the CompactionScheduler -- see the code comments for details.
ConcurrencyLimitScheduler is an implementation for (a), and is paired with a single DB. It has no knowledge of delete-only compactions, so we have redefined the meaning of Options.MaxConcurrentCompactions, as discussed in the code comment.
CompactionScheduler has some related interfaces/structs:
For integrating the CompactionScheduler with DB, there are a number of changes:
The CompactionSlot and CompactionLimiter introduced in a recent PR are deleted. CompactionGrantHandle is analogous to CompactionSlot, and allows for measuring of CPU usage since the implementation of CompactionScheduler in AC will explicitly monitor usage and capacity. CompactionScheduler is analogous to CompactionLimiter. CompactionLimiter had a non-queueing interface in that it never called into the DB. This worked since the only events that allowed another compaction to run were also ones that caused another call to maybeScheduleCompaction. This is not true when a CompactionScheduler is scheduling across multiple DBs, or managing a compaction and other long-lived work (snapshot reception), since something unrelated to the DB can cause resources to become available to run a compaction.
There is a partial implementation of a resource aware scheduler in https://github.com/sumeerbhola/cockroach/tree/long_lived_granter/pkg/util/admission/admit_long.
Informs #3813, cockroachdb/cockroach#74697, #1329