Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv,storage: re-consider compaction concurrency for multi-store nodes #74697

Open
irfansharif opened this issue Jan 11, 2022 · 5 comments
Open
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster T-storage Storage Team

Comments

@irfansharif
Copy link
Contributor

irfansharif commented Jan 11, 2022

Describe the problem

We use a default of 3 cores per store to run compactions (see COCKROACH_ROCKSDB_CONCURRENCY). For multi-store setups, with insufficient cores, that may be far too many. It may also be that we want to update our guidance with respect to "# of cores recommended for a given # of stores". In a recent escalation we observed that a high store count + compaction debt + low core count led to a large % of the total CPU available on nodes being used entirely for compactions. The CPU being pegged in this manner was disruptive to foreground traffic.

Currently the compaction concurrency for a store defaults to min(3, numCPUs). This isn't multi-store-aware at all, as we could have a lot of CPUs but not enough to give every store 3 of them for concurrent compactions.

Expected behavior

Automatic configuration of compaction concurrency to min(3, numCPUs/numStores) at the very least. Guidance for what an appropriate number of cores are for a given number of stores. Or compaction concurrency that's reflective of the total number of cores available for the total number of stores (presumably after experimentation of our own).

Jira issue: CRDB-12216

Epic CRDB-41111

@irfansharif irfansharif added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-storage Relating to our storage engine (Pebble) on-disk storage. labels Jan 11, 2022
@blathers-crl blathers-crl bot added the T-storage Storage Team label Jan 11, 2022
@jbowens
Copy link
Collaborator

jbowens commented Jan 11, 2022

Linking this to cockroachdb/pebble#1329, the broader issue of adjusting resource utilization of background Pebble tasks.

Each store has independent disk-bandwidth and IOPs constraints, but shared CPU. I think we'll need something adaptive like discussed in cockroachdb/pebble#1329 to avoid saturating CPU while also sufficiently utilizing disk bandwidth.

@sumeerbhola
Copy link
Collaborator

For a non-adaptive solution, we could simply have a shared limit across stores. The difficulty is how to roll this out to existing CockroachDB users that have clusters with multiple stores. Presumably they have already fiddled with the individual store setting (or are fine with the default) -- we don't want them to suddenly have reduced concurrency. We could have something that only applies to new clusters, but that seems error prone.

@jbowens jbowens moved this to 24.2 candidates in [Deprecated] Storage Jun 4, 2024
@BabuSrithar BabuSrithar added the O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster label Jul 26, 2024
@itsbilal
Copy link
Contributor

More context on the O-testcluster label: we've hit the issue of high CPU usage with compactions on multi-store DRT clusters and had to dial down compaction concurrency manually. Ideally this would be automated, so at least every store's max compaction concurrency setting gets set to min(3, numCPUs/numStores) as opposed to the current min(3, numCPUs).

@nameisbhaskar
Copy link
Contributor

Archive.zip
Uploading the CPU profiles of drt-large node 1. More details in the thread - https://cockroachlabs.slack.com/archives/CAC6K3SLU/p1722423058416819

@itsbilal
Copy link
Contributor

itsbilal commented Aug 1, 2024

I did a quick analysis of large1.cpuprof.2024-07-29T23_58_53.227.80.pprof in the above comment, coming off of the drt-large cluster's n1. Looking at the Pebble logs from the node itself, I see that an avg of 4 concurrent compactions were live on the node in the 10 clock-seconds (= 160 cpu-seconds) the profile spans.

That would mean 40 cpu-seconds would go towards compactions in the profile if all a compaction did was CPU work. Instead we see 36 profiled cpu-seconds go towards runCompaction, and of those 36, ~2s are in fread and ~2s are in fwrite, so we're left with 32 cpu-seconds in non-IO CPU work, or around 80% of the 40s. From this we can estimate that 80% of a compaction is CPU time, assuming sufficiently fast disks which seems to be the case on drt-large because we have a lot more nvme local SSDs than we can drive quickly with our (limited) CPUs.

80% CPU utilization in a compaction does seem fairly high, but when looking at where the CPU time is being spent, it does seem to make more sense. Most of it is in decoding blocks, snappy-decompressing it, then encoding the write-side blocks, and snappy-compressing it. I don't think the 80% estimate is significantly far off of the true amount of cpu time spent in compactions, although on other clusters/machines where we're driving IO/disk utilization higher than we are with drt-large, the ratio of CPU time is likely lower.

This estimate could be useful in trying to determine how to divvy-up CPUs for concurrent compactions on nodes that have a lot of stores.

anish-shanbhag added a commit to anish-shanbhag/pebble that referenced this issue Aug 28, 2024
compaction concurrency in a multi-store configuration. Each Pebble store
(i.e. an instance of *DB) still maintains its own per-store compaction
concurrency which is controlled by `opts.MaxConcurrentCompactions`.
However, in a multi-store configuration, disk I/O is a per-store resource
while CPU is shared across stores. A significant portion of compaction
is CPU-intensive, and so this ensures that excessive compactions don't
interrupt foreground CPU tasks even if the disks are capable of handling
the additional throughput from those compactions.

The shared compaction concurrency only applies to automatic compactions
This means that delete-only compactions are excluded because they are
expected to be cheap, as are flushes because they should never be
blocked.

Fixes: cockroachdb#3813
Informs: cockroachdb/cockroach#74697
anish-shanbhag added a commit to anish-shanbhag/pebble that referenced this issue Aug 28, 2024
This change adds a new compaction pool which enforces a global max
compaction concurrency in a multi-store configuration. Each Pebble store
(i.e. an instance of *DB) still maintains its own per-store compaction
concurrency which is controlled by `opts.MaxConcurrentCompactions`.
However, in a multi-store configuration, disk I/O is a per-store resource
while CPU is shared across stores. A significant portion of compaction
is CPU-intensive, and so this ensures that excessive compactions don't
interrupt foreground CPU tasks even if the disks are capable of handling
the additional throughput from those compactions.

The shared compaction concurrency only applies to automatic compactions
This means that delete-only compactions are excluded because they are
expected to be cheap, as are flushes because they should never be
blocked.

Fixes: cockroachdb#3813
Informs: cockroachdb/cockroach#74697
anish-shanbhag added a commit to anish-shanbhag/pebble that referenced this issue Aug 29, 2024
This change adds a new compaction pool which enforces a global max
compaction concurrency in a multi-store configuration. Each Pebble store
(i.e. an instance of *DB) still maintains its own per-store compaction
concurrency which is controlled by `opts.MaxConcurrentCompactions`.
However, in a multi-store configuration, disk I/O is a per-store resource
while CPU is shared across stores. A significant portion of compaction
is CPU-intensive, and so this ensures that excessive compactions don't
interrupt foreground CPU tasks even if the disks are capable of handling
the additional throughput from those compactions.

The shared compaction concurrency only applies to automatic compactions
This means that delete-only compactions are excluded because they are
expected to be cheap, as are flushes because they should never be
blocked.

Fixes: cockroachdb#3813
Informs: cockroachdb/cockroach#74697
anish-shanbhag added a commit to anish-shanbhag/pebble that referenced this issue Aug 29, 2024
This change adds a new compaction pool which enforces a global max
compaction concurrency in a multi-store configuration. Each Pebble store
(i.e. an instance of *DB) still maintains its own per-store compaction
concurrency which is controlled by `opts.MaxConcurrentCompactions`.
However, in a multi-store configuration, disk I/O is a per-store resource
while CPU is shared across stores. A significant portion of compaction
is CPU-intensive, and so this ensures that excessive compactions don't
interrupt foreground CPU tasks even if the disks are capable of handling
the additional throughput from those compactions.

The shared compaction concurrency only applies to automatic compactions
This means that delete-only compactions are excluded because they are
expected to be cheap, as are flushes because they should never be
blocked.

Fixes: cockroachdb#3813
Informs: cockroachdb/cockroach#74697
anish-shanbhag pushed a commit to anish-shanbhag/pebble that referenced this issue Aug 30, 2024
This change adds a new compaction pool which enforces a global max
compaction concurrency in a multi-store configuration. Each Pebble store
(i.e. an instance of *DB) still maintains its own per-store compaction
concurrency which is controlled by `opts.MaxConcurrentCompactions`.
However, in a multi-store configuration, disk I/O is a per-store resource
while CPU is shared across stores. A significant portion of compaction
is CPU-intensive, and so this ensures that excessive compactions don't
interrupt foreground CPU tasks even if the disks are capable of handling
the additional throughput from those compactions.

The shared compaction concurrency only applies to automatic compactions
This means that delete-only compactions are excluded because they are
expected to be cheap, as are flushes because they should never be
blocked.

Fixes: cockroachdb#3813
Informs: cockroachdb/cockroach#74697
anish-shanbhag pushed a commit to anish-shanbhag/pebble that referenced this issue Aug 30, 2024
This change adds a new compaction pool which enforces a global max
compaction concurrency in a multi-store configuration. Each Pebble store
(i.e. an instance of *DB) still maintains its own per-store compaction
concurrency which is controlled by `opts.MaxConcurrentCompactions`.
However, in a multi-store configuration, disk I/O is a per-store resource
while CPU is shared across stores. A significant portion of compaction
is CPU-intensive, and so this ensures that excessive compactions don't
interrupt foreground CPU tasks even if the disks are capable of handling
the additional throughput from those compactions.

The shared compaction concurrency only applies to automatic compactions
This means that delete-only compactions are excluded because they are
expected to be cheap, as are flushes because they should never be
blocked.

Fixes: cockroachdb#3813
Informs: cockroachdb/cockroach#74697
anish-shanbhag added a commit to anish-shanbhag/pebble that referenced this issue Aug 30, 2024
This change adds a new compaction pool which enforces a global max
compaction concurrency in a multi-store configuration. Each Pebble store
(i.e. an instance of *DB) still maintains its own per-store compaction
concurrency which is controlled by `opts.MaxConcurrentCompactions`.
However, in a multi-store configuration, disk I/O is a per-store resource
while CPU is shared across stores. A significant portion of compaction
is CPU-intensive, and so this ensures that excessive compactions don't
interrupt foreground CPU tasks even if the disks are capable of handling
the additional throughput from those compactions.

The shared compaction concurrency only applies to automatic compactions
This means that delete-only compactions are excluded because they are
expected to be cheap, as are flushes because they should never be
blocked.

Fixes: cockroachdb#3813
Informs: cockroachdb/cockroach#74697
sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Jan 27, 2025
CompactionScheduler is an interface that encompasses (a) our current
compaction scheduling behavior, (b) compaction scheduling in a multi-store
setting that adds a per-node limit in addition to the per-store limit, and
prioritizes across stores, (c) compaction scheduling that includes (b) plus
is aware of resource usage and can prioritize across stores and across
other long-lived work in addition to compactions (e.g. range snapshot
reception).

CompactionScheduler calls into DB and the DB calls into the
CompactionScheduler. This requires some care in specification of the
synchronization expectations, to avoid deadlock. For the most part, the
complexity is borne by the CompactionScheduler -- see the code comments
for details.

ConcurrencyLimitScheduler is an implementation for (a), and is paired with
a single DB. It has no knowledge of delete-only compactions, so we have
redefined the meaning of Options.MaxConcurrentCompactions, as discussed
in the code comment.

CompactionScheduler has some related interfaces/structs:
- CompactionGrantHandle is used to report the start and end of the
  compaction, and frequently report the written bytes, and CPU consumption.
  In the implementation of CompactionGrantHandle provided by CockroachDB's
  AC component, the CPU consumption will use the grunning package.
- WaitingCompaction is a struct used to prioritize the DB's compaction
  relative to other long-lived work (including compactions by other DBs).
  makeWaitingCompaction is a helper that constructs this struct.

For integrating the CompactionScheduler with DB, there are a number of
changes:
- The entry paths to ask to schedule a compaction are reduced to 1, by
  removing DB.maybeScheduleCompactionPicker. The only path is
  DB.maybeScheduleCompaction.
- versionSet.{curCompactionConcurrency,pickedCompactionCache} are added
  to satisfy the interface expected by CompactionScheduler. Specifically,
  pickedCompactionCache allows us to safely cache a pickedCompaction
  that cannot be run. There is some commentary on the worst-case waste
  in compaction picking -- with the default ConcurrencyLimitScheduler
  on average there should be no wastage.
- versionSet.curCompactionConcurrency and DB.mu.compact.manualLen are two
  atomic variables introduced to implement DB.GetAllowedWithoutPermission,
  which allows the DB to adjust what minimum compaction concurrency it
  desires based on the backlog of automatic and manual compactions. The
  encoded logic is meant to be equivalent to our current logic.

The CompactionSlot and CompactionLimiter introduced in a recent PR are
deleted. CompactionGrantHandle is analogous to CompactionSlot, and allows for
measuring of CPU usage since the implementation of CompactionScheduler in AC
will explicitly monitor usage and capacity. CompactionScheduler is analogous to
CompactionLimiter. CompactionLimiter had a non-queueing interface in
that it never called into the DB. This worked since the only events that
allowed another compaction to run were also ones that caused another
call to maybeScheduleCompaction. This is not true when a
CompactionScheduler is scheduling across multiple DBs, or managing a
compaction and other long-lived work (snapshot reception), since something
unrelated to the DB can cause resources to become available to run a
compaction.

There is a partial implementation of a resource aware scheduler in
https://github.com/sumeerbhola/cockroach/tree/long_lived_granter/pkg/util/admission/admit_long.

Informs cockroachdb#3813, cockroachdb/cockroach#74697, cockroachdb#1329
sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Jan 27, 2025
CompactionScheduler is an interface that encompasses (a) our current
compaction scheduling behavior, (b) compaction scheduling in a multi-store
setting that adds a per-node limit in addition to the per-store limit, and
prioritizes across stores, (c) compaction scheduling that includes (b) plus
is aware of resource usage and can prioritize across stores and across
other long-lived work in addition to compactions (e.g. range snapshot
reception).

CompactionScheduler calls into DB and the DB calls into the
CompactionScheduler. This requires some care in specification of the
synchronization expectations, to avoid deadlock. For the most part, the
complexity is borne by the CompactionScheduler -- see the code comments
for details.

ConcurrencyLimitScheduler is an implementation for (a), and is paired with
a single DB. It has no knowledge of delete-only compactions, so we have
redefined the meaning of Options.MaxConcurrentCompactions, as discussed
in the code comment.

CompactionScheduler has some related interfaces/structs:
- CompactionGrantHandle is used to report the start and end of the
  compaction, and frequently report the written bytes, and CPU consumption.
  In the implementation of CompactionGrantHandle provided by CockroachDB's
  AC component, the CPU consumption will use the grunning package.
- WaitingCompaction is a struct used to prioritize the DB's compaction
  relative to other long-lived work (including compactions by other DBs).
  makeWaitingCompaction is a helper that constructs this struct.

For integrating the CompactionScheduler with DB, there are a number of
changes:
- The entry paths to ask to schedule a compaction are reduced to 1, by
  removing DB.maybeScheduleCompactionPicker. The only path is
  DB.maybeScheduleCompaction.
- versionSet.{curCompactionConcurrency,pickedCompactionCache} are added
  to satisfy the interface expected by CompactionScheduler. Specifically,
  pickedCompactionCache allows us to safely cache a pickedCompaction
  that cannot be run. There is some commentary on the worst-case waste
  in compaction picking -- with the default ConcurrencyLimitScheduler
  on average there should be no wastage.
- versionSet.curCompactionConcurrency and DB.mu.compact.manualLen are two
  atomic variables introduced to implement DB.GetAllowedWithoutPermission,
  which allows the DB to adjust what minimum compaction concurrency it
  desires based on the backlog of automatic and manual compactions. The
  encoded logic is meant to be equivalent to our current logic.

The CompactionSlot and CompactionLimiter introduced in a recent PR are
deleted. CompactionGrantHandle is analogous to CompactionSlot, and allows for
measuring of CPU usage since the implementation of CompactionScheduler in AC
will explicitly monitor usage and capacity. CompactionScheduler is analogous to
CompactionLimiter. CompactionLimiter had a non-queueing interface in
that it never called into the DB. This worked since the only events that
allowed another compaction to run were also ones that caused another
call to maybeScheduleCompaction. This is not true when a
CompactionScheduler is scheduling across multiple DBs, or managing a
compaction and other long-lived work (snapshot reception), since something
unrelated to the DB can cause resources to become available to run a
compaction.

There is a partial implementation of a resource aware scheduler in
https://github.com/sumeerbhola/cockroach/tree/long_lived_granter/pkg/util/admission/admit_long.

Informs cockroachdb#3813, cockroachdb/cockroach#74697, cockroachdb#1329
sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Feb 5, 2025
CompactionScheduler is an interface that encompasses (a) our current
compaction scheduling behavior, (b) compaction scheduling in a multi-store
setting that adds a per-node limit in addition to the per-store limit, and
prioritizes across stores, (c) compaction scheduling that includes (b) plus
is aware of resource usage and can prioritize across stores and across
other long-lived work in addition to compactions (e.g. range snapshot
reception).

CompactionScheduler calls into DB and the DB calls into the
CompactionScheduler. This requires some care in specification of the
synchronization expectations, to avoid deadlock. For the most part, the
complexity is borne by the CompactionScheduler -- see the code comments
for details.

ConcurrencyLimitScheduler is an implementation for (a), and is paired with
a single DB. It has no knowledge of delete-only compactions, so we have
redefined the meaning of Options.MaxConcurrentCompactions, as discussed
in the code comment.

CompactionScheduler has some related interfaces/structs:
- CompactionGrantHandle is used to report the start and end of the
  compaction, and frequently report the written bytes, and CPU consumption.
  In the implementation of CompactionGrantHandle provided by CockroachDB's
  AC component, the CPU consumption will use the grunning package.
- WaitingCompaction is a struct used to prioritize the DB's compaction
  relative to other long-lived work (including compactions by other DBs).
  makeWaitingCompaction is a helper that constructs this struct.

For integrating the CompactionScheduler with DB, there are a number of
changes:
- The entry paths to ask to schedule a compaction are reduced to 1, by
  removing DB.maybeScheduleCompactionPicker. The only path is
  DB.maybeScheduleCompaction.
- versionSet.{curCompactionConcurrency,pickedCompactionCache} are added
  to satisfy the interface expected by CompactionScheduler. Specifically,
  pickedCompactionCache allows us to safely cache a pickedCompaction
  that cannot be run. There is some commentary on the worst-case waste
  in compaction picking -- with the default ConcurrencyLimitScheduler
  on average there should be no wastage.
- versionSet.curCompactionConcurrency and DB.mu.compact.manualLen are two
  atomic variables introduced to implement DB.GetAllowedWithoutPermission,
  which allows the DB to adjust what minimum compaction concurrency it
  desires based on the backlog of automatic and manual compactions. The
  encoded logic is meant to be equivalent to our current logic.

The CompactionSlot and CompactionLimiter introduced in a recent PR are
deleted. CompactionGrantHandle is analogous to CompactionSlot, and allows for
measuring of CPU usage since the implementation of CompactionScheduler in AC
will explicitly monitor usage and capacity. CompactionScheduler is analogous to
CompactionLimiter. CompactionLimiter had a non-queueing interface in
that it never called into the DB. This worked since the only events that
allowed another compaction to run were also ones that caused another
call to maybeScheduleCompaction. This is not true when a
CompactionScheduler is scheduling across multiple DBs, or managing a
compaction and other long-lived work (snapshot reception), since something
unrelated to the DB can cause resources to become available to run a
compaction.

There is a partial implementation of a resource aware scheduler in
https://github.com/sumeerbhola/cockroach/tree/long_lived_granter/pkg/util/admission/admit_long.

Informs cockroachdb#3813, cockroachdb/cockroach#74697, cockroachdb#1329
sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Feb 5, 2025
CompactionScheduler is an interface that encompasses (a) our current
compaction scheduling behavior, (b) compaction scheduling in a multi-store
setting that adds a per-node limit in addition to the per-store limit, and
prioritizes across stores, (c) compaction scheduling that includes (b) plus
is aware of resource usage and can prioritize across stores and across
other long-lived work in addition to compactions (e.g. range snapshot
reception).

CompactionScheduler calls into DB and the DB calls into the
CompactionScheduler. This requires some care in specification of the
synchronization expectations, to avoid deadlock. For the most part, the
complexity is borne by the CompactionScheduler -- see the code comments
for details.

ConcurrencyLimitScheduler is an implementation for (a), and is paired with
a single DB. It has no knowledge of delete-only compactions, so we have
redefined the meaning of Options.MaxConcurrentCompactions, as discussed
in the code comment.

CompactionScheduler has some related interfaces/structs:
- CompactionGrantHandle is used to report the start and end of the
  compaction, and frequently report the written bytes, and CPU consumption.
  In the implementation of CompactionGrantHandle provided by CockroachDB's
  AC component, the CPU consumption will use the grunning package.
- WaitingCompaction is a struct used to prioritize the DB's compaction
  relative to other long-lived work (including compactions by other DBs).
  makeWaitingCompaction is a helper that constructs this struct.

For integrating the CompactionScheduler with DB, there are a number of
changes:
- The entry paths to ask to schedule a compaction are reduced to 1, by
  removing DB.maybeScheduleCompactionPicker. The only path is
  DB.maybeScheduleCompaction.
- versionSet.{curCompactionConcurrency,pickedCompactionCache} are added
  to satisfy the interface expected by CompactionScheduler. Specifically,
  pickedCompactionCache allows us to safely cache a pickedCompaction
  that cannot be run. There is some commentary on the worst-case waste
  in compaction picking -- with the default ConcurrencyLimitScheduler
  on average there should be no wastage.
- versionSet.curCompactionConcurrency and DB.mu.compact.manualLen are two
  atomic variables introduced to implement DB.GetAllowedWithoutPermission,
  which allows the DB to adjust what minimum compaction concurrency it
  desires based on the backlog of automatic and manual compactions. The
  encoded logic is meant to be equivalent to our current logic.

The CompactionSlot and CompactionLimiter introduced in a recent PR are
deleted. CompactionGrantHandle is analogous to CompactionSlot, and allows for
measuring of CPU usage since the implementation of CompactionScheduler in AC
will explicitly monitor usage and capacity. CompactionScheduler is analogous to
CompactionLimiter. CompactionLimiter had a non-queueing interface in
that it never called into the DB. This worked since the only events that
allowed another compaction to run were also ones that caused another
call to maybeScheduleCompaction. This is not true when a
CompactionScheduler is scheduling across multiple DBs, or managing a
compaction and other long-lived work (snapshot reception), since something
unrelated to the DB can cause resources to become available to run a
compaction.

There is a partial implementation of a resource aware scheduler in
https://github.com/sumeerbhola/cockroach/tree/long_lived_granter/pkg/util/admission/admit_long.

Informs cockroachdb#3813, cockroachdb/cockroach#74697, cockroachdb#1329
sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Feb 5, 2025
CompactionScheduler is an interface that encompasses (a) our current
compaction scheduling behavior, (b) compaction scheduling in a multi-store
setting that adds a per-node limit in addition to the per-store limit, and
prioritizes across stores, (c) compaction scheduling that includes (b) plus
is aware of resource usage and can prioritize across stores and across
other long-lived work in addition to compactions (e.g. range snapshot
reception).

CompactionScheduler calls into DB and the DB calls into the
CompactionScheduler. This requires some care in specification of the
synchronization expectations, to avoid deadlock. For the most part, the
complexity is borne by the CompactionScheduler -- see the code comments
for details.

ConcurrencyLimitScheduler is an implementation for (a), and is paired with
a single DB. It has no knowledge of delete-only compactions, so we have
redefined the meaning of Options.MaxConcurrentCompactions, as discussed
in the code comment.

CompactionScheduler has some related interfaces/structs:
- CompactionGrantHandle is used to report the start and end of the
  compaction, and frequently report the written bytes, and CPU consumption.
  In the implementation of CompactionGrantHandle provided by CockroachDB's
  AC component, the CPU consumption will use the grunning package.
- WaitingCompaction is a struct used to prioritize the DB's compaction
  relative to other long-lived work (including compactions by other DBs).
  makeWaitingCompaction is a helper that constructs this struct.

For integrating the CompactionScheduler with DB, there are a number of
changes:
- The entry paths to ask to schedule a compaction are reduced to 1, by
  removing DB.maybeScheduleCompactionPicker. The only path is
  DB.maybeScheduleCompaction.
- versionSet.{curCompactionConcurrency,pickedCompactionCache} are added
  to satisfy the interface expected by CompactionScheduler. Specifically,
  pickedCompactionCache allows us to safely cache a pickedCompaction
  that cannot be run. There is some commentary on the worst-case waste
  in compaction picking -- with the default ConcurrencyLimitScheduler
  on average there should be no wastage.
- versionSet.curCompactionConcurrency and DB.mu.compact.manualLen are two
  atomic variables introduced to implement DB.GetAllowedWithoutPermission,
  which allows the DB to adjust what minimum compaction concurrency it
  desires based on the backlog of automatic and manual compactions. The
  encoded logic is meant to be equivalent to our current logic.

The CompactionSlot and CompactionLimiter introduced in a recent PR are
deleted. CompactionGrantHandle is analogous to CompactionSlot, and allows for
measuring of CPU usage since the implementation of CompactionScheduler in AC
will explicitly monitor usage and capacity. CompactionScheduler is analogous to
CompactionLimiter. CompactionLimiter had a non-queueing interface in
that it never called into the DB. This worked since the only events that
allowed another compaction to run were also ones that caused another
call to maybeScheduleCompaction. This is not true when a
CompactionScheduler is scheduling across multiple DBs, or managing a
compaction and other long-lived work (snapshot reception), since something
unrelated to the DB can cause resources to become available to run a
compaction.

There is a partial implementation of a resource aware scheduler in
https://github.com/sumeerbhola/cockroach/tree/long_lived_granter/pkg/util/admission/admit_long.

Informs cockroachdb#3813, cockroachdb/cockroach#74697, cockroachdb#1329
@sumeerbhola sumeerbhola self-assigned this Feb 14, 2025
sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Feb 28, 2025
CompactionScheduler is an interface that encompasses (a) our current
compaction scheduling behavior, (b) compaction scheduling in a multi-store
setting that adds a per-node limit in addition to the per-store limit, and
prioritizes across stores, (c) compaction scheduling that includes (b) plus
is aware of resource usage and can prioritize across stores and across
other long-lived work in addition to compactions (e.g. range snapshot
reception).

CompactionScheduler calls into DB and the DB calls into the
CompactionScheduler. This requires some care in specification of the
synchronization expectations, to avoid deadlock. For the most part, the
complexity is borne by the CompactionScheduler -- see the code comments
for details.

ConcurrencyLimitScheduler is an implementation for (a), and is paired with
a single DB. It has no knowledge of delete-only compactions, so we have
redefined the meaning of Options.MaxConcurrentCompactions, as discussed
in the code comment.

CompactionScheduler has some related interfaces/structs:
- CompactionGrantHandle is used to report the start and end of the
  compaction, and frequently report the written bytes, and CPU consumption.
  In the implementation of CompactionGrantHandle provided by CockroachDB's
  AC component, the CPU consumption will use the grunning package.
- WaitingCompaction is a struct used to prioritize the DB's compaction
  relative to other long-lived work (including compactions by other DBs).
  makeWaitingCompaction is a helper that constructs this struct.

For integrating the CompactionScheduler with DB, there are a number of
changes:
- The entry paths to ask to schedule a compaction are reduced to 1, by
  removing DB.maybeScheduleCompactionPicker. The only path is
  DB.maybeScheduleCompaction.
- versionSet.{curCompactionConcurrency,pickedCompactionCache} are added
  to satisfy the interface expected by CompactionScheduler. Specifically,
  pickedCompactionCache allows us to safely cache a pickedCompaction
  that cannot be run. There is some commentary on the worst-case waste
  in compaction picking -- with the default ConcurrencyLimitScheduler
  on average there should be no wastage.
- versionSet.curCompactionConcurrency and DB.mu.compact.manualLen are two
  atomic variables introduced to implement DB.GetAllowedWithoutPermission,
  which allows the DB to adjust what minimum compaction concurrency it
  desires based on the backlog of automatic and manual compactions. The
  encoded logic is meant to be equivalent to our current logic.

The CompactionSlot and CompactionLimiter introduced in a recent PR are
deleted. CompactionGrantHandle is analogous to CompactionSlot, and allows for
measuring of CPU usage since the implementation of CompactionScheduler in AC
will explicitly monitor usage and capacity. CompactionScheduler is analogous to
CompactionLimiter. CompactionLimiter had a non-queueing interface in
that it never called into the DB. This worked since the only events that
allowed another compaction to run were also ones that caused another
call to maybeScheduleCompaction. This is not true when a
CompactionScheduler is scheduling across multiple DBs, or managing a
compaction and other long-lived work (snapshot reception), since something
unrelated to the DB can cause resources to become available to run a
compaction.

There is a partial implementation of a resource aware scheduler in
https://github.com/sumeerbhola/cockroach/tree/long_lived_granter/pkg/util/admission/admit_long.

Informs cockroachdb#3813, cockroachdb/cockroach#74697, cockroachdb#1329
sumeerbhola added a commit to sumeerbhola/pebble that referenced this issue Mar 2, 2025
CompactionScheduler is an interface that encompasses (a) our current
compaction scheduling behavior, (b) compaction scheduling in a multi-store
setting that adds a per-node limit in addition to the per-store limit, and
prioritizes across stores, (c) compaction scheduling that includes (b) plus
is aware of resource usage and can prioritize across stores and across
other long-lived work in addition to compactions (e.g. range snapshot
reception).

CompactionScheduler calls into DB and the DB calls into the
CompactionScheduler. This requires some care in specification of the
synchronization expectations, to avoid deadlock. For the most part, the
complexity is borne by the CompactionScheduler -- see the code comments
for details.

ConcurrencyLimitScheduler is an implementation for (a), and is paired with
a single DB. It has no knowledge of delete-only compactions, so we have
redefined the meaning of Options.MaxConcurrentCompactions, as discussed
in the code comment.

CompactionScheduler has some related interfaces/structs:
- CompactionGrantHandle is used to report the start and end of the
  compaction, and frequently report the written bytes, and CPU consumption.
  In the implementation of CompactionGrantHandle provided by CockroachDB's
  AC component, the CPU consumption will use the grunning package.
- WaitingCompaction is a struct used to prioritize the DB's compaction
  relative to other long-lived work (including compactions by other DBs).
  makeWaitingCompaction is a helper that constructs this struct.

For integrating the CompactionScheduler with DB, there are a number of
changes:
- The entry paths to ask to schedule a compaction are reduced to 1, by
  removing DB.maybeScheduleCompactionPicker. The only path is
  DB.maybeScheduleCompaction.
- versionSet.{curCompactionConcurrency,pickedCompactionCache} are added
  to satisfy the interface expected by CompactionScheduler. Specifically,
  pickedCompactionCache allows us to safely cache a pickedCompaction
  that cannot be run. There is some commentary on the worst-case waste
  in compaction picking -- with the default ConcurrencyLimitScheduler
  on average there should be no wastage.
- versionSet.curCompactionConcurrency and DB.mu.compact.manualLen are two
  atomic variables introduced to implement DB.GetAllowedWithoutPermission,
  which allows the DB to adjust what minimum compaction concurrency it
  desires based on the backlog of automatic and manual compactions. The
  encoded logic is meant to be equivalent to our current logic.

The CompactionSlot and CompactionLimiter introduced in a recent PR are
deleted. CompactionGrantHandle is analogous to CompactionSlot, and allows for
measuring of CPU usage since the implementation of CompactionScheduler in AC
will explicitly monitor usage and capacity. CompactionScheduler is analogous to
CompactionLimiter. CompactionLimiter had a non-queueing interface in
that it never called into the DB. This worked since the only events that
allowed another compaction to run were also ones that caused another
call to maybeScheduleCompaction. This is not true when a
CompactionScheduler is scheduling across multiple DBs, or managing a
compaction and other long-lived work (snapshot reception), since something
unrelated to the DB can cause resources to become available to run a
compaction.

There is a partial implementation of a resource aware scheduler in
https://github.com/sumeerbhola/cockroach/tree/long_lived_granter/pkg/util/admission/admit_long.

Informs cockroachdb#3813, cockroachdb/cockroach#74697, cockroachdb#1329
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster T-storage Storage Team
Projects
No open projects
Status: 24.2 candidates
Development

No branches or pull requests

6 participants