db: automatically tune compaction concurrency based on available CPU/disk headroom and read-amp #1329

sumeerbhola · 2021-10-07T18:54:12Z

This issue was originally about the need to reduce tuning knobs in Pebble that govern performance more generally. Over time, some of these have been adjusted, or have had adjustments considered. The one major one that remains is MaxConcurrentCompactions which gets set to min(NumCPU(), 3) which is too few compactions at once than can be handled by many beefy nodes with fast NVMe drives. Pebble, along with Cockroach's admission control, should be able to schedule additional compactions in the presence of greater CPU and disk IO headroom to allow for greater compaction concurrency, without necessitating manual operator intervention to unlock additional performance.

There's a similar case to be made about increasing compaction concurrency even in the presence of heavy foreground write traffic, as it reduces the work necessary to incorporate future foreground write traffic into a well-formed LSM. Even in this case, adaptive increases in compaction concurrency will yield better foreground write performance in the longer run and can be considered even though it might seem instantaneously unintuitive to take away more disk / CPU bandwidth for a background operation.

Original text of the issue follows below the horizontal line.

Pebble has various tuning knobs that affect its CPU and disk IOPS/bandwidth utilization. Examples include L0CompactionConcurrency, CompactionDebtConcurrency, MaxConcurrentCompactions, DeleteRangeFlushDelay, MinDeletionRate.
Some of these, like MinDeletionRate and DeleteRangeFlushDelay, affect how fast we reclaim disk space, so are possibly not very interesting in most circumstances -- it is relatively easier to properly provision disk byte capacity.

The compaction ones affect how many concurrent compactions we can run, and being able to run more concurrent compactions would allow Pebble to handle a higher write throughput without having a misshapen LSM. CockroachDB accepts the default value for these knobs except for MaxConcurrentCompactions, which is set to min(numCPU, 3). It is likely that these knob settings are sub-optimal in most settings: (a) the limit of 3 is likely to be too low on large machines, (b) even 3 compactions could be too many on a large machine if there is a spike in read-only user-facing traffic that wants to consume all cpu, or if disk IOPs/bandwidth are saturated due to user-facing traffic.

We have discussed ways to speed compactions in the past by parallelizing them into multiple sub-compactions that partition the key span of the compaction. More recently, there are ideas to parallelize the compression/decompression of ssblocks within a compaction.

These compaction-related ideas have the following need in common: detect the current resource utilization (or overload of a resource), and use that to adjust the amount of concurrent background work, so as to not affect foreground traffic. It is possible that such adjustments would need to be fine-grained and slow down an already started compaction.

Detecting resource utilization or overload for CPU is relatively easier -- the CockroachDB context already has hooks that look at goroutine scheduler state. Disk IOPs/bandwidth is harder: log commit latency is one signal, but it may not always be a good indicator, so we would need to run some overload experiments to understand this better.

Jira issue: PEBBLE-119

The text was updated successfully, but these errors were encountered:

bananabrick · 2021-10-07T19:15:35Z

DeleteRangeFlushDelay and MinDeletionRate have default values of 0, and if we're not resetting those in cockroach(unless I'm missing where these are set), then we're not using the functionality provided by those, currently.

sumeerbhola · 2021-10-07T19:22:15Z

DeleteRangeFlushDelay and MinDeletionRate have default values of 0, and if we're not resetting those in cockroach(unless I'm missing where these are set), then we're not using the functionality provided by those, currently.

We are setting these here https://github.com/cockroachdb/cockroach/blob/9208567a280d61e0621c154eff6e39ed900f77c0/pkg/storage/pebble.go#L346-L353

sumeerbhola · 2021-10-07T20:40:17Z

The very high-level idea is to have some kind of centralized scheduler/pacer that is consulted before starting some background work, and at periodic intervals during that work. This could also be invoked for other background work like table stats and table validation. But the hard issues here are in working out the details.

bananabrick · 2021-10-20T02:05:07Z

I ran some file system level benchmarks to see whether sampling sync latencies could be a way to detect high IO utilization, and if we could rate limit based on that(by throttling writes when utilization is high).

Machine used:
aws c5d.4xlarge (the same machine which is used to run our nightly benchmarks).

Methodology
I ran each benchmark for 100 seconds, and each benchmark measures the latencies of syncing 1MB to disk while there are go routines running in the background and doing other file system level operations. Each benchmark only differs by the background operation which was being run.

Code used to run the benchmark: #1344

Results

No background ops.

____optype__elapsed_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
write_sync   100.0s       33573            335.7 0.825039 0.819199  0.950271  1.114111  4.456447

A background goroutine which writes 4KB to the file in a loop with no rate limiting.

____optype__elapsed_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
write_sync   100.0s         3322             33.2 27.791096 0.884735 48.234495 1207.959551 2147.483647

The latencies increase quite a bit, but this is what they look like when the benchmark is running.

____optype__elapsed__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
write_sync      41s            8.0           30.7 20.971519 2550.136831 2550.136831 2550.136831
write_sync      42s            6.0           30.1 62.914559 260.046847 260.046847 260.046847
write_sync      43s            2.0           29.5 62.914559 1342.177279 1342.177279 1342.177279
write_sync      44s            0.0           28.8 0.000000 0.000000 0.000000 0.000000
write_sync      45s            0.0           28.2 0.000000 0.000000 0.000000 0.000000
write_sync      46s            7.0           27.7 41.943039 2818.572287 2818.572287 2818.572287
write_sync      47s            4.0           27.2 71.303167 318.767103 318.767103 318.767103
write_sync      48s            1.0           26.7 1409.286143 1409.286143 1409.286143 1409.286143
write_sync      49s            0.0           26.1 0.000000 0.000000 0.000000 0.000000
write_sync      50s            1.0           25.6 1275.068415 1275.068415 1275.068415 1275.068415
write_sync      51s            6.0           25.2 54.525951 1409.286143 1409.286143 1409.286143
write_sync      52s            1.0           24.8 201.326591 201.326591 201.326591 201.326591

The latencies vary quite a bit, but they remain high.

Any thoughts on other benchmarks I could run to determine if sync latencies are a good metric to use? I'm thinking I could use other kind of background ops to determine if sync latencies are always increasing.

I'm also going to look into pebble's batch commit latencies to see if that's a good metric to use.

itsbilal · 2021-10-20T17:27:56Z

That sounds like an interesting experiment, thanks for running it! I would expect batch commit latencies to be an even better metric to measure user-observed write latencies for two reasons:

it's directly along the write hot path and gives us a good measure of everything in the commit pipeline, and
It measures both disk writes and syncs, and ends up being less FS-dependent

It's trickier to decide what to do when write latencies go up. Do we want compaction concurrency to go down to temporarily allow write-amp to go down, while also having a second mechanism to dial it up again if L0 read-amp goes unreasonably high? Or is admission control going to reduce the need for that second mechanism?

bananabrick · 2021-10-20T18:55:39Z

Thanks for the response, since it's forcing me to think about what exactly my plan is here.

My worry with the batch commit latency is that it won't reflect actual IO utilization which is my goal. It might, if I normalize the batch commit latency by the batch size, but I haven't looked at the code here yet, so there might be more complexities with using batch commit latencies.

It's trickier to decide what to do when write latencies go up. Do we want compaction concurrency to go down to temporarily allow write-amp to go down, while also having a second mechanism to dial it up again if L0 read-amp goes unreasonably high?

I was thinking this would be a bit more generic, and so I haven't thought about compactions specifically. And my plan was to have confidence that any metric we use actually reflects our IO utilization, and not look at metrics like read amp or write amps.

Here's my idea which is high level, and I've yet to explore this.

Let's assume that we have some latency metric L which accurately reflects IO utilization. So, let's assume that when L increases we are doing more IO, and when L is stable or small we have more available IO.

Now we will sample L, and keep a moving average E, a deviation metric S, both more heavily weighted towards more recent values of L.

Let's say the rate limiter has some variable bytesPerSec, which determines the write throughput, and that bytesPerSec has min/max values.

We'll also have some max threshold for L called T which will be dependent on both E and S. For example, it could be T = E + 10 * S(average + 10 * deviation). Now, we can have an algorithm which increases bytesPerSec, if L is within T, and decreases bytesPerSec, if L is greater than T.

For compactions specifically, in this approach, we won't limit compactions at all, if we have bytesPerSec available. One problem I can think of is a scenario where there's so many writes happening that L becomes high, and we therefore throttle compactions/flushes, which causes the memtables to fill up quickly. I don't know if that's a problem or not. Do we want to throttle compactions/flushes if there are many writes happening, but there's no IO available?

itsbilal · 2021-10-20T20:50:40Z

We both ask a similar question at the end about what to actually throttle when the problematic metric worsens, and I think the story there is going to be incomplete without admission control. We can do (or at least explore) two different types of throttles:

Throttle down memtable writes when write IO latency is high beyond a high-watermark amount, to let compactions catch up.
Throttle up concurrent compactions when write IO latency is low below a low-watermark amount (suggesting additional available resources), and retain existing checks to increase it when L0 read-amp is high.

You could change the code so the first compaction with no other compactions running is always scheduled, bypassing the pacer(s) completely. We could pace subsequent compactions - and that pacing can be mathematically determined by 1) how high the IO latency is, and 2) how low the L0 read-amp is. So high write-latency, low L0 read-amp gets throttled the most, low write-latency and high L0 read-amp gets the most concurrent compactions, and somewhere in between gets a moderate amount of pacing.

(Side note: the current pacer(s) only look at compaction debt, not L0 read-amp, so this might be a quick change to try out if you're experimenting with pacers).

The general idea is we're trying to compact a bit more aggressively when we have the resources to do so and can easily spend that cost without affecting user writes. And we want ability to let read-amp fall behind to favour user writes when those are affected.

As for the choice of the metric, my concern was that syncs on files had the possibility of only measuring part of what we actually want to reduce, which is "are writes having to wait due to starved out IO". If a theoretical batch commit metric goes up because the average batch has to wait longer to apply due to batches ahead of it, that is an even earlier indicator of us wanting to become more write-friendly than actually letting IO starve out and then affect syncs or something.

All of the above is rough because I do expect it to be tuned by / guided by experimentation, so the experimentation you're doing is very valuable in helping make that decision. If fsync latency ends up being a solid indicator, I'm happy to go with it too.

itsbilal · 2021-10-20T21:11:09Z

On further thought: we probably never want to throttle writes to memtables (as that’ll just get more KV operations to queue up in memory and increases chances of OOM), and we could probably have flushes remain unpaced; there can only be one flush at a time anyway.

The better place for control of incoming write traffic is going to be admission control. Other than compaction concurrency and compaction pacing I’m not sure if we have control over that many throughput valves in Pebble.

The other thing that I’m trying to stay away from, is having variables (whether automatic or manually set) around provisioned IOPS or throughput in MB/s. We might see noise from other stores, other processes, other “noisy neighbours” which we see from time to time when benchmarking on AWS, and we need to be able to react to that instead of trying to use up a theoretical target throughput.

bananabrick · 2021-10-20T22:17:59Z

Thanks for the feedback! My idea was only at a high level, so it will be difficult to give concrete answers without further experimentation.

We both ask a similar question at the end about what to actually throttle when the problematic metric worsens, and I think the story there is going to be incomplete without admission control.

I want to throttle or increase bytesPerSecond according to L. Whatever mechanism we want to throttle will make calls into the rate limiter. It could be either flushes, or compactions, or both, or even other pathways which issue writes to the disk. We can have some notion of priority to make sure that compactions/flushes get higher priority, if we want.

I'm not trying to auto-tune the compaction concurrency, although, maybe that's something we should consider.

Throttle down memtable writes when write IO latency is high beyond a high-watermark amount, to let compactions catch up.

I don't think we should be throttling foreground traffic using this rate limiter.

2.Throttle up concurrent compactions when write IO latency is low below a low-watermark amount (suggesting additional available resources), and retain existing checks to increase it when L0 read-amp is high.

I was thinking of making this rate limiter centre around the bytesPerSecond. I'm not sure if concurrent compactions are relevant. If we write both a cpu/io based rate limiter, then we can potentially get rid of the concurrent compactions variable.

We could pace subsequent compactions - and that pacing can be mathematically determined by 1) how high the IO latency is, and 2) how low the L0 read-amp is.

My understanding is that low L0 read-amp shouldn't limit compaction throughput. It should only be limited by not having enough IO resources available.

The general idea is we're trying to compact a bit more aggressively when we have the resources to do so and can easily spend that cost without affecting user writes. And we want ability to let read-amp fall behind to favour user writes when those are affected.

This should happen naturally using the mechanism I described. If we don't have IO resources, then background write throughput will be limited. I'm guessing eventually, we'll stall user writes as the memtables fill up, which will free up some IO resources.

As for the choice of the metric, my concern was that syncs on files had the possibility of only measuring part of what we actually want to reduce, which is "are writes having to wait due to starved out IO". If a theoretical batch commit metric goes up because the average batch has to wait longer to apply due to batches ahead of it, that is an even earlier indicator of us wanting to become more write-friendly than actually letting IO starve out and then affect syncs or something.

I agree that file sync latency isn't guaranteed to work, but I'll be doing more benchmarking for that. I haven't looked at the batch commit code/that entire pipeline, but to me, batch commit latency depends on pebble level bottlenecks too, and it seems like we aren't directly measuring the filesystem/ssd level bottlenecks, which is what we want.

I'll have to read more about our batch commit mechanisms. I'm not against the idea at all. It'll probably be simpler to use that if it works anyway.

The other thing that I’m trying to stay away from, is having variables (whether automatic or manually set) around provisioned IOPS or throughput in MB/s. We might see noise from other stores, other processes, other “noisy neighbours” which we see from time to time when benchmarking on AWS, and we need to be able to react to that instead of trying to use up a theoretical target throughput.

If you're referring to the bytesPerSecond variable, that's auto tuned according to the latencies, so I wouldn't say it's a theoretical limit. It'll be as high as it can be, without causing increased latencies. I think rocks db also has a similar variable in their rate limiter.

Other stores isn't an issue, because the rate limiter will be shared amongst all the stores, just like the block/table caches. Even if it wasn't, I think bytesPerSecond, should still just be what the maximum throughput is, without causing increased latencies.

itsbilal · 2021-10-21T19:17:20Z

The principle that Sumeer mentioned in the standup and something that I'm also thinking along the lines of, is twofold:

Leave existing compaction processes largely untouched and unthrottled, like they already are, with the exception of concurrent compactions beyond the first one (more on this later). Leaving flushes and memtables/WAL writes unthrottled is also a good idea, as that's closest to the write-hot path observed by the user.
Schedule additional compactions in a measured amount, when necessary (high L0 read-amp, as we already do by increasing compaction concurrency), OR when feasible (writes far from saturation). Any additional compactions beyond the first one could have the pacer set.

The definition of what "writes far from saturation" is, is something that's going to come from your experimentation, and what you're describing around bytesPerSecond and L will likely work once we validate it with experiments.

The reason why I keep bringing up compaction concurrency is because it's the most direct (and maybe the only way?) in actually increasing compaction throughput; do more of them at the same time! The change that Peter made to reduce compaction concurrency to below MaxConcurrentCompactions showed a significant improvement in write-heavy workloads, because concurrent compactions can steal away write IOPS from WAL writes and flushes (which we don't want to slow down for aforementioned reasons). So we probably want to reduce compaction concurrency when we see user writes slowing down (using what metric and thresholds you find most helpful), and increase it again if compactions slowed down enough (this mechanism already exists, it looks at L0 sublevel count).

I know it seems unintuitive at first that we'd actually slow compactions when we observe more user writes, but we're just lowering write-amp by not duplicating too much effort.

My understanding is that low L0 read-amp shouldn't limit compaction throughput. It should only be limited by not having enough IO resources available.

I see what you mean, but I worry it might be too reactive. If we don't look at L0 read-amp and just schedule the max number of concurrent compactions always, and then we observe writes slow down, our only option is to pace down existing compactions, which (if there are a lot of them) are going to hold up new compactions as you can't conflict with running compactions. The advantage of reacting to high user writes as early as possible is that we can sidestep this conflict. Letting one compaction run at full speed and not start new ones is cleaner, more CPU/IO efficient, and blocks future compactions less.

Other stores isn't an issue, because the rate limiter will be shared amongst all the stores, just like the block/table caches.

Sharing the pacer(s) is probably not a good idea. The stores could all be on separate disks, or they could not. Each store should figure out its own pacing. We can share config variables, but we don't want a hot store to starve out writes from a cold store.

An additional thread of thought that we haven't brought up yet: measuring just how much we actually need to worry about reads. Can we look into iterator counts / stats to see how aggressively we need to react to high read-amp? This might be a whole new pandora's box so we can split that into a separate conversation / issue.

bananabrick · 2021-10-28T04:01:41Z

I did some more benchmarking of ycsb/F/1024 while observing the IO latencies and throughputs.

1. ycsb/F/1024 with 10 concurrent workers.

For some reason the benchmark starts off doing lots of writes with high w-amps, so we see the throughput of the ssd being saturated, and the latencies spiking up to 100s of ms, but later the throughput decreases to about 50% of total capacity, and latencies become sub millisecond.

For the commit pipeline commit latency I was sampling, it was initially between 1-30 ms, but later becomes sub 5 ms.

pebble ops/sec

pebble fsync latencies

pebble batch commit latencies

2. ycsb/F/1024 with 25 concurrent workers.

The benchmark starts doing lots of writes and with high w-amps, so we see the ssd being saturated, and latencies spiking up again, but later the throughput decreases to ~80% of max throughput, and latencies shown through the gce console become lower.

After the initial increase in commit latencies, they consistently stay in the 0-5ms range.

pebble ops/sec

pebble fsync latencies

pebble batch commit latencies

3. ycsb/F/1024 with 100 concurrent workers.

With a 100 writers, we see the same initial increase in latency, and we see the throughput hitting the max throughput of the ssd, but the IO latencies are in the 1-4 ms range.

Since the ssd throughput is getting saturated, the commit pipeline commit latencies, fluctuate between 0-30 ms, pretty randomly.

pebble ops/sec

pebble fsync latencies

pebble batch commit latencies

4. ycsb/F/1024 with 300 concurrent workers.

With 300 writers, we see the same initial increase in latency, and we see the throughput hitting the max throughput of the ssd, but the IO latencies are still in the 1-4 ms range.

Since the ssd throughput is getting saturated, the commit pipeline commit latencies still fluctuate between 0-30+ ms, pretty randomly.

pebble ops/sec

pebble fsync latencies

pebble batch commit latencies

Note that the initial spike in latencies seem to correspond with a higher KB/IO_op in the graph. I'll also be plotting the commit latencies with time/concurrent workers, to see if there's any other patterns there.

bananabrick · 2021-10-28T21:27:37Z

Turns out that commit latencies just spike up when there's a lot of writes happening in the system, but they don't accurately reflect any ssd level bottlenecks. Maybe there's some variables there we can eliminate?

ycsb/F/1024 with 300 concurrent workers, but with an ssd which can handle the load.

pebble ops/sec

pebble fsync latencies

pebble batch commit latencies

There's no IO level bottlenecks here, as seen from the throughput numbers, and the fsync latencies, which are in the 0-2ms range. But the commit latencies are massive. Which makes me think that there are other bottlenecks to batch commit latencies.

The fluctuation in the batch commit latencies can be explained by the constant write stalls, I think.

itsbilal · 2021-10-28T21:44:18Z

Your last question can likely be answered by time spent in the pending queue, if a lot of writes are happening but the ssd can handle them, I can see the synchronization in the commit pipeline slowing things down more than actual disk writes / syncs. If you want to leave the mechanism of the commit pipeline behind and just measure the write performance, you could try measuring latencies of the actual WAL write (not sync) that happens in prepare - that might yield some interesting results. You could also normalize it by bytes being written on that Write call, which should be easy given the buffer is right there.

The other question is whether a "too many writes are happening" signal is useful on its own too, and I'm not sure if it is - we're more worried about resource starvation. But maybe someone else has other thoughts.

bananabrick · 2021-11-01T21:16:23Z

@itsbilal thanks, I'll look into those aspects.

I ran ycsb/F/1024 with 300 concurrent workers again, after removing the pebble level write stall bottleneck(by increasing compaction concurrency), to see what the latencies look like.

pebble ops/sec

pebble fsync latencies

pebble batch commit latencies

I don't think commit pipeline latencies are usable, but the specific write/sync latencies in the pipeline should be usable. \

bananabrick · 2021-11-04T18:25:21Z

Summarizing the direction this is headed, and also summarizing the discussion in this doc: https://docs.google.com/document/d/1wYqLFBqfqX18FzC1f4rEkK-VDtB1ggNQBdeQwrCLrI4/edit.

The high level idea is that we want to use low IO latencies along the pebble write paths to determine if it's okay to increase the compaction concurrency. So, we'll be auto tuning compaction concurrency based on write path IO latencies.

The graphs/benchmarks in this issue were used to observe IO latencies under various ssd loads.

Some of the concerns are:

sync/write latencies depend on the size of the data being written.
- not sure what the solution is here, but in the worse case, we just won't increase compactions, if sync latencies are too
  high.
If we hardcode any latency thresholds, then they won't work perfectly across different ssds with different latencies.
- They'll still work across faster ssds, and we just won't auto tune for slower devices.
We want to make sure that increasing compactions/doing more IO won't slow down read performance.
- Need to run experiments for this.
What write/read latencies are "good enough" for pebble, where "good enough" latencies won't bottleneck the rest of pebble.

bananabrick · 2021-11-05T00:15:58Z

Posting some results with 64 byte values.

ycsb/F/64 with 10 concurrent workers, on a single local ssd.

pebble ops/sec

pebble fsync latencies

ycsb/F/64 with 25 concurrent workers, on a single local ssd.

pebble ops/sec

pebble fsync latencies

ycsb/F/64 with 50 concurrent workers, on a single local ssd.

pebble ops/sec

pebble fsync latencies

ycsb/F/64 with 100 concurrent workers, on a single local ssd.

pebble ops/sec

pebble fsync latencies

ycsb/F/64 with 300 concurrent workers, on a single local ssd.

pebble ops/sec

pebble fsync latencies

ycsb/F/64 with 500 concurrent workers, on a single local ssd.

pebble ops/sec

pebble fsync latencies

bananabrick · 2021-11-08T18:15:25Z

I thought a bit about the ways to get past the issue of sync latencies depending on the size of the data being synced, and different ssds/file systems having different performance characteristics. This problem needs to be solved, as it prevents auto tuning from working on a majority of the machines.

Proposal:
Assuming that there is only one cockroach process running on a single machine, cockroach will call into a pebble function preOpen, before calling Open even once. The preOpen function will issue some commands to the file system and sample latencies, and return a struct S which contains the latency information. The struct S, will then be passed in as a pebble option, when Open is called, to any Pebble engine which is created from cockroach.

The struct S will contain latency information for the hardware under low(Is this guaranteed?) resource utilization, because preOpen samples latencies before any Pebble engine is created.

preOpen will sample sync latencies for different data sizes. For example, we can sample latencies for sizes of 10KB, 100KB, 250KB, 500KB, 1MB. This will give us a baseline of how fast the machine is, and will also give us a baseline for sync latencies at different data sizes.

The tradeoff here is that, preOpen function will take some time t to run, which slows down the startup time of cockroach by t. We could also call preOpen in parallel, with the rest of the cockroach startup. What values of t are acceptable? I think t=10 seconds could work, but I haven't tested this. Maybe even t = 1 second of sampling will work, but I haven't tested it, yet.

sumeerbhola · 2021-11-08T20:00:55Z

Regarding auto tuning, I agree this is a hard problem. And I don't necessarily think we need auto tuning for the first version. But if we do decide to do something here, here is a rough summary of my current thinking on various signal options (including latency) and how suitable they are for auto tuning.

WALMinSyncInterval set to 0, but could be increased by a CockroachDB cluster setting. Want a signal that is robust to different settings of this, under the assumption that the frequency of altering this setting in a running system is extremely low.
Assume the workload in terms of the size of batches and the frequency of batches requesting sync changes slowly over time, and we don't need to worry about it when considering signal values within the last 1min.
Signal options:
- Measure latency of write+sync in LogWriter.flushPending. We need to accumulate these latency values across many samples. Difficulties:
  - Samples may vary in size. If queue is building up due to slowness the byte size of these samples will also increase, so they are not simply a function of the user workload.
  - Some samples sync and some don't, which affects the latency.
- Write throughput measured over time interval T in LogWriter.flushLoop by only timing the parts where work is being done. Ignore whether work is syncing or not. Difficulties:
  - Write throughput may not immediately decline due to sync latency increase since the LogWriter can increase the batch and amortize the sync over larger number of bytes.
What is the baseline "low load" value for the signal? Difficulties:
- Noisy neighbor: don't know when it is present, so don't know the baseline.
- Don't know if reducing compaction load will actually improve the signal. Maybe it is already as good as it can be because something has changed in the hardware etc.
Can we expect users to provide a threshold that represents the boundary of what is acceptable? Seems hard. If their workload is typically running with write+sync of B bytes taking 100ms, and they configure the threshold to 500ms, then we can decrease their commit throughput by 5x without realizing we have done something wrong. Usually users are not running close to the cliff in terms of no idle time in the LogWriter, but users typically don't know if they are, until something goes bad.
Another alternative would be measure work time/total time in the LogWriter.flushLoop. This would represent "utilization". If utilization were say < 80% we could increment compaction concurrency. We could measure utilization over 5s intervals and do exponential smoothing, so we don't react too quickly to a short write burst. This has the benefit that we are (a) directly measuring whether user writes are suffering, and (b) don't need a baseline low load value. Deficiencies:
- Doesn't tell us whether reducing compactions will actually make these user writes faster. It is possible that the answer is no and the hardware provisioning is good enough that we shoud run with more compactions to absorb these writes.
- Utilization can be equally high if we are writing and syncing 10 bytes in every loop, versus writing and syncing 100 bytes in every loop. That is, the fixed overhead is not well accounted for.
We should consider measuring this utilization metric with some experiments with moderate to high load and see whether it is worth pursuing. I think the deficiencies could be addressed by using latency or throughput as a secondary signal (and can be addressed in future improvements).

sumeerbhola · 2021-11-09T15:24:27Z

measure work time/total time in the LogWriter.flushLoop. This would represent "utilization".

We can scale this time utilization, computed over an interval T, by the mean value of
bytes written by flushPending/(blockSize * 16) which is the buffer utilization. This overall utilization number should not suffer from the second deficiency I mentioned above.

sumeerbhola · 2021-11-09T19:25:00Z

The remaining weakness with this WAL utilization signal is:

Doesn't tell us whether reducing compactions will actually make these user writes faster. It is possible that the answer is no and the hardware provisioning is good enough that we should run with more compactions to absorb these writes.

I don't think this will actually be a problem -- here is a cleaner version of a claim from a side conversation with @bananabrick (and look at this with a critical eye, since there is a likelihood of error in reasoning):

Consider a scenario where the WAL utilization is high. The concern is that this could be high even if there is no resource bottleneck, because we have single-threaded WAL writing, and that we will needlessly not increase compaction concurrency (from our baseline of 3) just when we need that extra compaction concurrency.
For the WAL utilization to be high, the buffer utilization needs to be high. So we will be writing a lower bound of blockSize*16 for each sync. So at least 512KB per sync.
Consider a write amplification of W. For every byte written to the WAL we need to compact W to keep up.
Throughput of a single threaded compaction vs single threaded WAL write (LogWriter.flushLoop):
- CPU: The cpu per byte written will be lower for the WAL write.
- IO: The main difference is the syncs. Compactions also do sync_file_range every 512KB (Options.BytesPerSync), so the sync costs may not differ much.
Let R be the ratio of WAL byte throughput to a single compaction write throughput, when there is no resource bottleneck (and so syncs are not slower than usual).
For compactions to keep up with the WAL at high WAL utilization, we need R*W concurrent compactions. Being pessimistic about WAL throughput and optimistic about write amplification, say W=20 and R=0.5. That is 10 concurrent compactions. If high WAL utilization prevents the auto tuned compaction concurrency to not exceed 10, even though we are not resource bottlenecked, I would consider that a big win over our current cap of 3.

nicktrav · 2021-11-10T17:16:25Z

I did a re-read through of the conversation so far, and adding my thoughts on the implementation and the operational implications of our work here.

Regarding auto tuning, I agree this is a hard problem.

Agreed. I even wonder whether this is worthwhile pursuing at the Pebble layer at all.

My own experience elsewhere with systems that auto-tune is that the operator will likely end up in a situation where they are "fighting" against the "intelligent system" that is trying to shift performance in one direction, but the operator wants it to do something different. There's inevitably a bug in the control loop, and debugging such a thing in the midst of an outage is painful.

Add in the fact that we have admission control at the Cockroach layer. Adding a layer of auto-tuning below this feels like we might be multiplying the complexity of the system. There's some irony here in that this issue is about reducing complexity by decreasing the number of "knobs". Instead we're keeping the knobs and letting some code fiddle with them in real-time.

A thought / suggestion: can we expose an API that allows the caller to a) query Pebble's observed view of its own performance (various IO-related metrics, and eventually CPU-bound metrics for compute heavy tasks), and b) alter the various knobs. For example, alter compaction concurrency, WAL sync interval, etc. My understanding is that many of these params can be set, but it's not clear to me whether altering them at runtime has any effect, or would be safe to do.

Presumably we'd then bake in the "sentience" into admission control in Cockroach to call into this Pebble API and twiddle the knobs from there. An operator could also just completely opt-out of admission control, and opt to tune the knobs directly. Though presumably, this would be for very advanced and knowledgeable operators.

Separately, I was also thinking about whether it would be possible to directly query the operating system for the performance of the filesystem (and other things). On Linux, there's eBPF which gives you a means of hooking into various structures in kernel-land. This is like a looking glass, through which we'd be able to see at a very low level what is going on.

In exploring this idea, I think I convinced myself that this would be unsavory (and probably dangerous), for a few reasons - it's not portable (even if not using eBPF, we'd need multiple system-specific ways of querying for these metrics); it's a leaky abstraction (the metrics are system-wide and not specific to Pebble - e.g. should Pebble slow down if another process is eating up IO bandwidth?); it's wildly complex.

With that said, it seems like we're already agreed in the approach of using Pebble-specific metrics to expose and gauge how it is performing. I'll leave the above as just food for thought.

tldr: I'm not sold on auto-tuning in Pebble. Can we instead expose an API for dynamic control? Using metrics specific to Pebble seems like a decent approach of inferring Pebble's own performance / progress, rather than using raw FS / OS metrics.

sumeerbhola · 2021-11-22T20:56:02Z

My own experience elsewhere with systems that auto-tune is that the operator will likely end up in a situation where they are "fighting" against the "intelligent system" that is trying to shift performance in one direction, but the operator wants it to do something different.

Agreed that this happens, though rarely. Adding a temporary override is an easy way to fix it.
But without auto-tuning in my experience one is worse of.

And exposing an "API for dynamic control" externalizes the complexity to a piece of code who may not have the requisite signals to make the right decision, and will need to understand the various control points in Pebble. For instance, the "WAL utilization" signal (which I am very hopeful about) mentioned in #1329 (comment) has to be gathered inside Pebble.

btw, I don't think this will interfere with admission control. The background work tuning will not try to come anywhere close to 100% utilization, while admission control is trying to do queueing and relative prioritization of work when utilization is close to 100%.

itsbilal · 2021-11-22T22:00:51Z

Ah, this conversation has progressed quite a bit since I last engaged in it.

One additional thing on the "too much automation can fight against the operator" worry, while I agree with it as a general concern, we've narrowly scoped this project to just be the divvying up of additional IO resources when we know it'd likely be a good idea and a safe idea to use those. In particular, our focus seems to be on tuning compaction concurrency. We will still fall back to the same baseline as before under any sort of IO-constrained situation, so that still gets us a net win of finishing this project.

All the additional knobs we're talking about just control how aggressive or conservative we are with assuming we have excess resources available, but our baseline should still be "good enough" to get us out of any hairy resource-constrained situation.

It's also generally true that we have very few user-documented or user-visible knobs in Pebble to start, so whether we add more or remove more doesn't make too big a difference. If anything, I interpreted the higher level goal of this project to be "have Pebble make better decisions on its own, and tune Pebble's existing knobs better by default so they don't leave performance on the table".

The first commit is from 82440 We assume that: - There is a provisioned known limit on the sum of read and write bandwidth. This limit is allowed to change. - Admission control can only shape the rate of admission of writes. Writes also cause reads, since compactions do reads and writes. There are multiple challenges: - We are unable to precisely track the causes of disk read bandwidth, since we do not have observability into what reads missed the OS page cache. That is, we don't know how much of the reads were due to incoming reads (that we don't shape) and how much due to compaction read bandwidth. - We don't shape incoming reads. - There can be a large time lag between the shaping of incoming writes, and when it affects actual writes in the system, since compaction backlog can build up in various levels of the LSM store. - Signals of overload are coarse, since we cannot view all the internal queues that can build up due to resource overload. For instance, different examples of bandwidth saturation exhibit wildly different latency effects, presumably because the queue buildup is different. So it is non-trivial to approach full utilization without risking high latency. Due to these challenges, and previous design attempts that were quite complicated (and incomplete), we adopt a goal of simplicity of design, and strong abstraction boundaries. - The disk load is abstracted using an enum. The diskLoadWatcher can be evolved independently. - The approach uses easy to understand additive increase and multiplicative decrease, (unlike what we do for flush and compaction tokens, where we try to more precisely calculate the sustainable rates). Since we are using a simple approach that is somewhat coarse in its behavior, we start by limiting its application to two kinds of writes: - Incoming writes that are deemed "elastic": This can be done by introducing a work-class (in addition to admissionpb.WorkPriority), or by implying a work-class from the priority (e.g. priorities < NormalPri are deemed elastic). This prototype does the latter. - Optional compactions: We assume that the LSM store is configured with a ceiling on number of regular concurrent compactions, and if it needs more it can request resources for additional (optional) compactions. These latter compactions can be limited by this approach. See cockroachdb/pebble/issues/1329 for motivation. The reader should start with disk_bandwidth.go, consisting of - diskLoadWatcher: which computes load levels. - compactionLimiter: which tracks all compaction slots and limits optional compactions. - diskBandwidthLimiter: It composes the previous two objects and uses load information to limit write tokens for elastic writes and limit compactions. There is significant refactoring and changes in granter.go and work_queue.go. This is driven by the fact that: - Previously the tokens were for L0 and now we need to support tokens for bytes into L0 and tokens for bytes into the LSM (the former being a subset of the latter). - Elastic work is in a different WorkQueue than regular work, but they are competing for the same tokens. The latter is handled by allowing kvSlotGranter to multiplex across multiple requesters, via multiple child granters. A number of interfaces are adjusted to make this viable. In general, the GrantCoordinator is now slightly dumber and some of that logic is moved into the granters. For the former (two kinds of tokens), I considered adding multiple resource dimensions to the granter-requester interaction but found it too complicated. Instead we rely on the observation that we can request tokens based on the total incoming bytes of the request (not just L0), and when the request is completed, can tell the granter how many bytes went into L0. The latter allows us to return tokens to L0. There was also the (unrelated) realization that we can use the information of the size of the batch in the call to AdmittedWorkDone and fix estimation that we had to make pre-evaluation. This resulted in a bunch of changes to how we do estimation to adjust the tokens consumed: we now estimate how much we need to compensate what is being asked for at (a) admission time, (b) work done time, for the bytes added to the LSM, (c) work done time, for the bytes added to L0. Since we are askinf for tokens at admission time based on the full incoming bytes, the estimation for what fraction of an ingest goes into L0 is eliminated. This had the consequence of simplifying some of the estimation logic that was distinguishing writes from ingests. There are no tests, so this code is probably littered with bugs. Next steps: - Unit tests - Pebble changes for IntervalCompactionInfo - CockroachDB changes for IntervalDiskLoadInfo - Experimental evaluation and tuning - Separate into multiple PRs for review - KV and storage package plumbing for properly populating StoreWriteWorkInfo.{WriteBytes,IngestRequest} for ingestions and StoreWorkDoneInfo.{ActualBytes,ActualBytesIntoL0} for writes and ingestions. Release note: None

This change adds an interface that users of Pebble can implement to integrate a compaction limiter that gets information about compaction bytes read and bytes written. This change also threads that interface (and a simple no-op implementation by default) through the compaction and flush code paths. Informs cockroachdb#1329.

This change adds an interface that users of Pebble can implement to integrate a compaction limiter that gets information about compaction bytes read and bytes written. This change also threads that interface (and a simple no-op implementation by default) through the compaction and flush code paths. Informs #1329.

TODO Informs cockroachdb/pebble#1329 Release note: None

CompactionScheduler is an interface that encompasses (a) our current compaction scheduling behavior, (b) compaction scheduling in a multi-store setting that adds a per-node limit in addition to the per-store limit, and prioritizes across stores, (c) compaction scheduling that includes (b) plus is aware of resource usage and can prioritize across stores and across other long-lived work in addition to compactions (e.g. range snapshot reception). CompactionScheduler calls into DB and the DB calls into the CompactionScheduler. This requires some care in specification of the synchronization expectations, to avoid deadlock. For the most part, the complexity is borne by the CompactionScheduler -- see the code comments for details. ConcurrencyLimitScheduler is an implementation for (a), and is paired with a single DB. It has no knowledge of delete-only compactions, so we have redefined the meaning of Options.MaxConcurrentCompactions, as discussed in the code comment. CompactionScheduler has some related interfaces/structs: - CompactionGrantHandle is used to report the start and end of the compaction, and frequently report the written bytes, and CPU consumption. In the implementation of CompactionGrantHandle provided by CockroachDB's AC component, the CPU consumption will use the grunning package. - WaitingCompaction is a struct used to prioritize the DB's compaction relative to other long-lived work (including compactions by other DBs). makeWaitingCompaction is a helper that constructs this struct. For integrating the CompactionScheduler with DB, there are a number of changes: - The entry paths to ask to schedule a compaction are reduced to 1, by removing DB.maybeScheduleCompactionPicker. The only path is DB.maybeScheduleCompaction. - versionSet.{curCompactionConcurrency,pickedCompactionCache} are added to satisfy the interface expected by CompactionScheduler. Specifically, pickedCompactionCache allows us to safely cache a pickedCompaction that cannot be run. There is some commentary on the worst-case waste in compaction picking -- with the default ConcurrencyLimitScheduler on average there should be no wastage. - versionSet.curCompactionConcurrency and DB.mu.compact.manualLen are two atomic variables introduced to implement DB.GetAllowedWithoutPermission, which allows the DB to adjust what minimum compaction concurrency it desires based on the backlog of automatic and manual compactions. The encoded logic is meant to be equivalent to our current logic. The CompactionSlot and CompactionLimiter introduced in a recent PR are deleted. CompactionGrantHandle is analogous to CompactionSlot, and allows for measuring of CPU usage since the implementation of CompactionScheduler in AC will explicitly monitor usage and capacity. CompactionScheduler is analogous to CompactionLimiter. CompactionLimiter had a non-queueing interface in that it never called into the DB. This worked since the only events that allowed another compaction to run were also ones that caused another call to maybeScheduleCompaction. This is not true when a CompactionScheduler is scheduling across multiple DBs, or managing a compaction and other long-lived work (snapshot reception), since something unrelated to the DB can cause resources to become available to run a compaction. There is a partial implementation of a resource aware scheduler in https://github.com/sumeerbhola/cockroach/tree/long_lived_granter/pkg/util/admission/admit_long. Informs cockroachdb#3813, cockroachdb/cockroach#74697, cockroachdb#1329

sumeerbhola mentioned this issue Oct 13, 2021

perf: investigate reenabling flush/compaction pacing #687

Open

jbowens mentioned this issue Jan 11, 2022

kv,storage: re-consider compaction concurrency for multi-store nodes cockroachdb/cockroach#74697

Open

sumeerbhola mentioned this issue Jun 13, 2022

[WIP] admission: add support for disk bandwidth as a bottleneck resource cockroachdb/cockroach#82813

Closed

itsbilal changed the title ~~db: reduce tuning knobs that control resource utilization by background work~~ db: automatically tune compaction concurrency based on available CPU/disk headroom and read-amp Jun 8, 2023

jbowens mentioned this issue Nov 27, 2023

Is there any way to limit the compaction rate in pebbledb #3105

Closed

jbowens added this to [Deprecated] Storage Jun 4, 2024

jbowens moved this to 24.2 candidates in [Deprecated] Storage Jun 4, 2024

sumeerbhola mentioned this issue Jul 31, 2024

db: shared compaction concurrency limit across multiple Pebble instances #3813

Open

aadityasondhi mentioned this issue Nov 20, 2024

roachtest: perturbation/metamorphic/decommission failed cockroachdb/cockroach#135241

Closed

itsbilal mentioned this issue Dec 3, 2024

admission, storage: prototype AC limiter for concurrent compactions cockroachdb/cockroach#136615

Closed

itsbilal mentioned this issue Dec 4, 2024

db: add an interface for a slot-based compaction limiter #4190

Merged

itsbilal added a commit to itsbilal/cockroach that referenced this issue Jan 8, 2025

admission: implement compaction slot limiter prototype

f57b007

TODO Informs cockroachdb/pebble#1329 Release note: None

itsbilal mentioned this issue Jan 9, 2025

admission: implement compaction slot limiter prototype cockroachdb/cockroach#138769

Draft

sumeerbhola mentioned this issue Jan 27, 2025

db: introduce CompactionScheduler and integrate with DB #4297

Open

sumeerbhola self-assigned this Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

db: automatically tune compaction concurrency based on available CPU/disk headroom and read-amp #1329

db: automatically tune compaction concurrency based on available CPU/disk headroom and read-amp #1329

sumeerbhola commented Oct 7, 2021 •

edited by cockroach-jira-scripts

Loading

bananabrick commented Oct 7, 2021

sumeerbhola commented Oct 7, 2021

sumeerbhola commented Oct 7, 2021

bananabrick commented Oct 20, 2021 •

edited

Loading

itsbilal commented Oct 20, 2021

bananabrick commented Oct 20, 2021 •

edited

Loading

itsbilal commented Oct 20, 2021

itsbilal commented Oct 20, 2021

bananabrick commented Oct 20, 2021 •

edited

Loading

itsbilal commented Oct 21, 2021

bananabrick commented Oct 28, 2021 •

edited

Loading

bananabrick commented Oct 28, 2021 •

edited

Loading

itsbilal commented Oct 28, 2021

bananabrick commented Nov 1, 2021

bananabrick commented Nov 4, 2021 •

edited

Loading

bananabrick commented Nov 5, 2021 •

edited

Loading

bananabrick commented Nov 8, 2021 •

edited

Loading

sumeerbhola commented Nov 8, 2021

sumeerbhola commented Nov 9, 2021

sumeerbhola commented Nov 9, 2021

nicktrav commented Nov 10, 2021

sumeerbhola commented Nov 22, 2021

itsbilal commented Nov 22, 2021

db: automatically tune compaction concurrency based on available CPU/disk headroom and read-amp #1329

db: automatically tune compaction concurrency based on available CPU/disk headroom and read-amp #1329

Comments

sumeerbhola commented Oct 7, 2021 • edited by cockroach-jira-scripts Loading

bananabrick commented Oct 7, 2021

sumeerbhola commented Oct 7, 2021

sumeerbhola commented Oct 7, 2021

bananabrick commented Oct 20, 2021 • edited Loading

itsbilal commented Oct 20, 2021

bananabrick commented Oct 20, 2021 • edited Loading

itsbilal commented Oct 20, 2021

itsbilal commented Oct 20, 2021

bananabrick commented Oct 20, 2021 • edited Loading

itsbilal commented Oct 21, 2021

bananabrick commented Oct 28, 2021 • edited Loading

1. ycsb/F/1024 with 10 concurrent workers.

2. ycsb/F/1024 with 25 concurrent workers.

3. ycsb/F/1024 with 100 concurrent workers.

4. ycsb/F/1024 with 300 concurrent workers.

bananabrick commented Oct 28, 2021 • edited Loading

itsbilal commented Oct 28, 2021

bananabrick commented Nov 1, 2021

bananabrick commented Nov 4, 2021 • edited Loading

bananabrick commented Nov 5, 2021 • edited Loading

ycsb/F/64 with 10 concurrent workers, on a single local ssd.

ycsb/F/64 with 25 concurrent workers, on a single local ssd.

ycsb/F/64 with 50 concurrent workers, on a single local ssd.

ycsb/F/64 with 100 concurrent workers, on a single local ssd.

ycsb/F/64 with 300 concurrent workers, on a single local ssd.

ycsb/F/64 with 500 concurrent workers, on a single local ssd.

bananabrick commented Nov 8, 2021 • edited Loading

sumeerbhola commented Nov 8, 2021

sumeerbhola commented Nov 9, 2021

sumeerbhola commented Nov 9, 2021

nicktrav commented Nov 10, 2021

sumeerbhola commented Nov 22, 2021

itsbilal commented Nov 22, 2021

sumeerbhola commented Oct 7, 2021 •

edited by cockroach-jira-scripts

Loading

bananabrick commented Oct 20, 2021 •

edited

Loading

bananabrick commented Oct 20, 2021 •

edited

Loading

bananabrick commented Oct 20, 2021 •

edited

Loading

bananabrick commented Oct 28, 2021 •

edited

Loading

bananabrick commented Oct 28, 2021 •

edited

Loading

bananabrick commented Nov 4, 2021 •

edited

Loading

bananabrick commented Nov 5, 2021 •

edited

Loading

bananabrick commented Nov 8, 2021 •

edited

Loading