-
Notifications
You must be signed in to change notification settings - Fork 482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
db: automatically tune compaction concurrency based on available CPU/disk headroom and read-amp #1329
Comments
DeleteRangeFlushDelay and MinDeletionRate have default values of 0, and if we're not resetting those in cockroach(unless I'm missing where these are set), then we're not using the functionality provided by those, currently. |
We are setting these here https://github.com/cockroachdb/cockroach/blob/9208567a280d61e0621c154eff6e39ed900f77c0/pkg/storage/pebble.go#L346-L353 |
The very high-level idea is to have some kind of centralized scheduler/pacer that is consulted before starting some background work, and at periodic intervals during that work. This could also be invoked for other background work like table stats and table validation. But the hard issues here are in working out the details. |
I ran some file system level benchmarks to see whether sampling sync latencies could be a way to detect high IO utilization, and if we could rate limit based on that(by throttling writes when utilization is high). Machine used: Methodology Code used to run the benchmark: #1344 Results
The latencies increase quite a bit, but this is what they look like when the benchmark is running.
The latencies vary quite a bit, but they remain high. Any thoughts on other benchmarks I could run to determine if sync latencies are a good metric to use? I'm thinking I could use other kind of background ops to determine if sync latencies are always increasing. I'm also going to look into pebble's batch commit latencies to see if that's a good metric to use. |
That sounds like an interesting experiment, thanks for running it! I would expect batch commit latencies to be an even better metric to measure user-observed write latencies for two reasons:
It's trickier to decide what to do when write latencies go up. Do we want compaction concurrency to go down to temporarily allow write-amp to go down, while also having a second mechanism to dial it up again if L0 read-amp goes unreasonably high? Or is admission control going to reduce the need for that second mechanism? |
Thanks for the response, since it's forcing me to think about what exactly my plan is here. My worry with the batch commit latency is that it won't reflect actual IO utilization which is my goal. It might, if I normalize the batch commit latency by the batch size, but I haven't looked at the code here yet, so there might be more complexities with using batch commit latencies.
I was thinking this would be a bit more generic, and so I haven't thought about compactions specifically. And my plan was to have confidence that any metric we use actually reflects our IO utilization, and not look at metrics like read amp or write amps. Here's my idea which is high level, and I've yet to explore this. Let's assume that we have some latency metric Now we will sample Let's say the rate limiter has some variable We'll also have some max threshold for For compactions specifically, in this approach, we won't limit compactions at all, if we have |
We both ask a similar question at the end about what to actually throttle when the problematic metric worsens, and I think the story there is going to be incomplete without admission control. We can do (or at least explore) two different types of throttles:
You could change the code so the first compaction with no other compactions running is always scheduled, bypassing the pacer(s) completely. We could pace subsequent compactions - and that pacing can be mathematically determined by 1) how high the IO latency is, and 2) how low the L0 read-amp is. So high write-latency, low L0 read-amp gets throttled the most, low write-latency and high L0 read-amp gets the most concurrent compactions, and somewhere in between gets a moderate amount of pacing. (Side note: the current pacer(s) only look at compaction debt, not L0 read-amp, so this might be a quick change to try out if you're experimenting with pacers). The general idea is we're trying to compact a bit more aggressively when we have the resources to do so and can easily spend that cost without affecting user writes. And we want ability to let read-amp fall behind to favour user writes when those are affected. As for the choice of the metric, my concern was that syncs on files had the possibility of only measuring part of what we actually want to reduce, which is "are writes having to wait due to starved out IO". If a theoretical batch commit metric goes up because the average batch has to wait longer to apply due to batches ahead of it, that is an even earlier indicator of us wanting to become more write-friendly than actually letting IO starve out and then affect syncs or something. All of the above is rough because I do expect it to be tuned by / guided by experimentation, so the experimentation you're doing is very valuable in helping make that decision. If fsync latency ends up being a solid indicator, I'm happy to go with it too. |
On further thought: we probably never want to throttle writes to memtables (as that’ll just get more KV operations to queue up in memory and increases chances of OOM), and we could probably have flushes remain unpaced; there can only be one flush at a time anyway. The better place for control of incoming write traffic is going to be admission control. Other than compaction concurrency and compaction pacing I’m not sure if we have control over that many throughput valves in Pebble. The other thing that I’m trying to stay away from, is having variables (whether automatic or manually set) around provisioned IOPS or throughput in MB/s. We might see noise from other stores, other processes, other “noisy neighbours” which we see from time to time when benchmarking on AWS, and we need to be able to react to that instead of trying to use up a theoretical target throughput. |
Thanks for the feedback! My idea was only at a high level, so it will be difficult to give concrete answers without further experimentation.
I want to throttle or increase I'm not trying to auto-tune the compaction concurrency, although, maybe that's something we should consider.
I don't think we should be throttling foreground traffic using this rate limiter.
I was thinking of making this rate limiter centre around the
My understanding is that low L0 read-amp shouldn't limit compaction throughput. It should only be limited by not having enough IO resources available.
This should happen naturally using the mechanism I described. If we don't have IO resources, then background write throughput will be limited. I'm guessing eventually, we'll stall user writes as the memtables fill up, which will free up some IO resources.
I agree that file sync latency isn't guaranteed to work, but I'll be doing more benchmarking for that. I haven't looked at the batch commit code/that entire pipeline, but to me, batch commit latency depends on pebble level bottlenecks too, and it seems like we aren't directly measuring the filesystem/ssd level bottlenecks, which is what we want. I'll have to read more about our batch commit mechanisms. I'm not against the idea at all. It'll probably be simpler to use that if it works anyway.
If you're referring to the Other stores isn't an issue, because the rate limiter will be shared amongst all the stores, just like the block/table caches. Even if it wasn't, I think |
The principle that Sumeer mentioned in the standup and something that I'm also thinking along the lines of, is twofold:
The definition of what "writes far from saturation" is, is something that's going to come from your experimentation, and what you're describing around The reason why I keep bringing up compaction concurrency is because it's the most direct (and maybe the only way?) in actually increasing compaction throughput; do more of them at the same time! The change that Peter made to reduce compaction concurrency to below I know it seems unintuitive at first that we'd actually slow compactions when we observe more user writes, but we're just lowering write-amp by not duplicating too much effort.
I see what you mean, but I worry it might be too reactive. If we don't look at L0 read-amp and just schedule the max number of concurrent compactions always, and then we observe writes slow down, our only option is to pace down existing compactions, which (if there are a lot of them) are going to hold up new compactions as you can't conflict with running compactions. The advantage of reacting to high user writes as early as possible is that we can sidestep this conflict. Letting one compaction run at full speed and not start new ones is cleaner, more CPU/IO efficient, and blocks future compactions less.
Sharing the pacer(s) is probably not a good idea. The stores could all be on separate disks, or they could not. Each store should figure out its own pacing. We can share config variables, but we don't want a hot store to starve out writes from a cold store. An additional thread of thought that we haven't brought up yet: measuring just how much we actually need to worry about reads. Can we look into iterator counts / stats to see how aggressively we need to react to high read-amp? This might be a whole new pandora's box so we can split that into a separate conversation / issue. |
I did some more benchmarking of ycsb/F/1024 while observing the IO latencies and throughputs. 1. ycsb/F/1024 with 10 concurrent workers.For some reason the benchmark starts off doing lots of writes with high w-amps, so we see the throughput of the ssd being saturated, and the latencies spiking up to 100s of ms, but later the throughput decreases to about 50% of total capacity, and latencies become sub millisecond. For the commit pipeline commit latency I was sampling, it was initially between 1-30 ms, but later becomes sub 5 ms. 2. ycsb/F/1024 with 25 concurrent workers.The benchmark starts doing lots of writes and with high w-amps, so we see the ssd being saturated, and latencies spiking up again, but later the throughput decreases to ~80% of max throughput, and latencies shown through the gce console become lower. After the initial increase in commit latencies, they consistently stay in the 0-5ms range. 3. ycsb/F/1024 with 100 concurrent workers.With a 100 writers, we see the same initial increase in latency, and we see the throughput hitting the max throughput of the ssd, but the IO latencies are in the 1-4 ms range. Since the ssd throughput is getting saturated, the commit pipeline commit latencies, fluctuate between 0-30 ms, pretty randomly. 4. ycsb/F/1024 with 300 concurrent workers.With 300 writers, we see the same initial increase in latency, and we see the throughput hitting the max throughput of the ssd, but the IO latencies are still in the 1-4 ms range. Since the ssd throughput is getting saturated, the commit pipeline commit latencies still fluctuate between 0-30+ ms, pretty randomly. Note that the initial spike in latencies seem to correspond with a higher KB/IO_op in the graph. I'll also be plotting the commit latencies with time/concurrent workers, to see if there's any other patterns there. |
Turns out that commit latencies just spike up when there's a lot of writes happening in the system, but they don't accurately reflect any ssd level bottlenecks. Maybe there's some variables there we can eliminate?
There's no IO level bottlenecks here, as seen from the throughput numbers, and the fsync latencies, which are in the 0-2ms range. But the commit latencies are massive. Which makes me think that there are other bottlenecks to batch commit latencies. The fluctuation in the batch commit latencies can be explained by the constant write stalls, I think. |
Your last question can likely be answered by time spent in the The other question is whether a "too many writes are happening" signal is useful on its own too, and I'm not sure if it is - we're more worried about resource starvation. But maybe someone else has other thoughts. |
@itsbilal thanks, I'll look into those aspects. I ran ycsb/F/1024 with 300 concurrent workers again, after removing the pebble level write stall bottleneck(by increasing compaction concurrency), to see what the latencies look like. I don't think commit pipeline latencies are usable, but the specific write/sync latencies in the pipeline should be usable. \ |
Summarizing the direction this is headed, and also summarizing the discussion in this doc: https://docs.google.com/document/d/1wYqLFBqfqX18FzC1f4rEkK-VDtB1ggNQBdeQwrCLrI4/edit. The high level idea is that we want to use low IO latencies along the pebble write paths to determine if it's okay to increase the compaction concurrency. So, we'll be auto tuning compaction concurrency based on write path IO latencies. The graphs/benchmarks in this issue were used to observe IO latencies under various ssd loads. Some of the concerns are:
|
Posting some results with 64 byte values. ycsb/F/64 with 10 concurrent workers, on a single local ssd.ycsb/F/64 with 25 concurrent workers, on a single local ssd.ycsb/F/64 with 50 concurrent workers, on a single local ssd.ycsb/F/64 with 100 concurrent workers, on a single local ssd.ycsb/F/64 with 300 concurrent workers, on a single local ssd.ycsb/F/64 with 500 concurrent workers, on a single local ssd. |
I thought a bit about the ways to get past the issue of sync latencies depending on the size of the data being synced, and different ssds/file systems having different performance characteristics. This problem needs to be solved, as it prevents auto tuning from working on a majority of the machines. Proposal: The struct
The tradeoff here is that, |
Regarding auto tuning, I agree this is a hard problem. And I don't necessarily think we need auto tuning for the first version. But if we do decide to do something here, here is a rough summary of my current thinking on various signal options (including latency) and how suitable they are for auto tuning.
|
We can scale this time utilization, computed over an interval T, by the mean value of |
The remaining weakness with this WAL utilization signal is:
I don't think this will actually be a problem -- here is a cleaner version of a claim from a side conversation with @bananabrick (and look at this with a critical eye, since there is a likelihood of error in reasoning):
|
I did a re-read through of the conversation so far, and adding my thoughts on the implementation and the operational implications of our work here.
Agreed. I even wonder whether this is worthwhile pursuing at the Pebble layer at all. My own experience elsewhere with systems that auto-tune is that the operator will likely end up in a situation where they are "fighting" against the "intelligent system" that is trying to shift performance in one direction, but the operator wants it to do something different. There's inevitably a bug in the control loop, and debugging such a thing in the midst of an outage is painful. Add in the fact that we have admission control at the Cockroach layer. Adding a layer of auto-tuning below this feels like we might be multiplying the complexity of the system. There's some irony here in that this issue is about reducing complexity by decreasing the number of "knobs". Instead we're keeping the knobs and letting some code fiddle with them in real-time. A thought / suggestion: can we expose an API that allows the caller to a) query Pebble's observed view of its own performance (various IO-related metrics, and eventually CPU-bound metrics for compute heavy tasks), and b) alter the various knobs. For example, alter compaction concurrency, WAL sync interval, etc. My understanding is that many of these params can be set, but it's not clear to me whether altering them at runtime has any effect, or would be safe to do. Presumably we'd then bake in the "sentience" into admission control in Cockroach to call into this Pebble API and twiddle the knobs from there. An operator could also just completely opt-out of admission control, and opt to tune the knobs directly. Though presumably, this would be for very advanced and knowledgeable operators. Separately, I was also thinking about whether it would be possible to directly query the operating system for the performance of the filesystem (and other things). On Linux, there's eBPF which gives you a means of hooking into various structures in kernel-land. This is like a looking glass, through which we'd be able to see at a very low level what is going on. In exploring this idea, I think I convinced myself that this would be unsavory (and probably dangerous), for a few reasons - it's not portable (even if not using eBPF, we'd need multiple system-specific ways of querying for these metrics); it's a leaky abstraction (the metrics are system-wide and not specific to Pebble - e.g. should Pebble slow down if another process is eating up IO bandwidth?); it's wildly complex. With that said, it seems like we're already agreed in the approach of using Pebble-specific metrics to expose and gauge how it is performing. I'll leave the above as just food for thought. tldr: I'm not sold on auto-tuning in Pebble. Can we instead expose an API for dynamic control? Using metrics specific to Pebble seems like a decent approach of inferring Pebble's own performance / progress, rather than using raw FS / OS metrics. |
Agreed that this happens, though rarely. Adding a temporary override is an easy way to fix it. And exposing an "API for dynamic control" externalizes the complexity to a piece of code who may not have the requisite signals to make the right decision, and will need to understand the various control points in Pebble. For instance, the "WAL utilization" signal (which I am very hopeful about) mentioned in #1329 (comment) has to be gathered inside Pebble. btw, I don't think this will interfere with admission control. The background work tuning will not try to come anywhere close to 100% utilization, while admission control is trying to do queueing and relative prioritization of work when utilization is close to 100%. |
Ah, this conversation has progressed quite a bit since I last engaged in it. One additional thing on the "too much automation can fight against the operator" worry, while I agree with it as a general concern, we've narrowly scoped this project to just be the divvying up of additional IO resources when we know it'd likely be a good idea and a safe idea to use those. In particular, our focus seems to be on tuning compaction concurrency. We will still fall back to the same baseline as before under any sort of IO-constrained situation, so that still gets us a net win of finishing this project. All the additional knobs we're talking about just control how aggressive or conservative we are with assuming we have excess resources available, but our baseline should still be "good enough" to get us out of any hairy resource-constrained situation. It's also generally true that we have very few user-documented or user-visible knobs in Pebble to start, so whether we add more or remove more doesn't make too big a difference. If anything, I interpreted the higher level goal of this project to be "have Pebble make better decisions on its own, and tune Pebble's existing knobs better by default so they don't leave performance on the table". |
The first commit is from 82440 We assume that: - There is a provisioned known limit on the sum of read and write bandwidth. This limit is allowed to change. - Admission control can only shape the rate of admission of writes. Writes also cause reads, since compactions do reads and writes. There are multiple challenges: - We are unable to precisely track the causes of disk read bandwidth, since we do not have observability into what reads missed the OS page cache. That is, we don't know how much of the reads were due to incoming reads (that we don't shape) and how much due to compaction read bandwidth. - We don't shape incoming reads. - There can be a large time lag between the shaping of incoming writes, and when it affects actual writes in the system, since compaction backlog can build up in various levels of the LSM store. - Signals of overload are coarse, since we cannot view all the internal queues that can build up due to resource overload. For instance, different examples of bandwidth saturation exhibit wildly different latency effects, presumably because the queue buildup is different. So it is non-trivial to approach full utilization without risking high latency. Due to these challenges, and previous design attempts that were quite complicated (and incomplete), we adopt a goal of simplicity of design, and strong abstraction boundaries. - The disk load is abstracted using an enum. The diskLoadWatcher can be evolved independently. - The approach uses easy to understand additive increase and multiplicative decrease, (unlike what we do for flush and compaction tokens, where we try to more precisely calculate the sustainable rates). Since we are using a simple approach that is somewhat coarse in its behavior, we start by limiting its application to two kinds of writes: - Incoming writes that are deemed "elastic": This can be done by introducing a work-class (in addition to admissionpb.WorkPriority), or by implying a work-class from the priority (e.g. priorities < NormalPri are deemed elastic). This prototype does the latter. - Optional compactions: We assume that the LSM store is configured with a ceiling on number of regular concurrent compactions, and if it needs more it can request resources for additional (optional) compactions. These latter compactions can be limited by this approach. See cockroachdb/pebble/issues/1329 for motivation. The reader should start with disk_bandwidth.go, consisting of - diskLoadWatcher: which computes load levels. - compactionLimiter: which tracks all compaction slots and limits optional compactions. - diskBandwidthLimiter: It composes the previous two objects and uses load information to limit write tokens for elastic writes and limit compactions. There is significant refactoring and changes in granter.go and work_queue.go. This is driven by the fact that: - Previously the tokens were for L0 and now we need to support tokens for bytes into L0 and tokens for bytes into the LSM (the former being a subset of the latter). - Elastic work is in a different WorkQueue than regular work, but they are competing for the same tokens. The latter is handled by allowing kvSlotGranter to multiplex across multiple requesters, via multiple child granters. A number of interfaces are adjusted to make this viable. In general, the GrantCoordinator is now slightly dumber and some of that logic is moved into the granters. For the former (two kinds of tokens), I considered adding multiple resource dimensions to the granter-requester interaction but found it too complicated. Instead we rely on the observation that we can request tokens based on the total incoming bytes of the request (not just L0), and when the request is completed, can tell the granter how many bytes went into L0. The latter allows us to return tokens to L0. There was also the (unrelated) realization that we can use the information of the size of the batch in the call to AdmittedWorkDone and fix estimation that we had to make pre-evaluation. This resulted in a bunch of changes to how we do estimation to adjust the tokens consumed: we now estimate how much we need to compensate what is being asked for at (a) admission time, (b) work done time, for the bytes added to the LSM, (c) work done time, for the bytes added to L0. Since we are askinf for tokens at admission time based on the full incoming bytes, the estimation for what fraction of an ingest goes into L0 is eliminated. This had the consequence of simplifying some of the estimation logic that was distinguishing writes from ingests. There are no tests, so this code is probably littered with bugs. Next steps: - Unit tests - Pebble changes for IntervalCompactionInfo - CockroachDB changes for IntervalDiskLoadInfo - Experimental evaluation and tuning - Separate into multiple PRs for review - KV and storage package plumbing for properly populating StoreWriteWorkInfo.{WriteBytes,IngestRequest} for ingestions and StoreWorkDoneInfo.{ActualBytes,ActualBytesIntoL0} for writes and ingestions. Release note: None
This change adds an interface that users of Pebble can implement to integrate a compaction limiter that gets information about compaction bytes read and bytes written. This change also threads that interface (and a simple no-op implementation by default) through the compaction and flush code paths. Informs cockroachdb#1329.
This change adds an interface that users of Pebble can implement to integrate a compaction limiter that gets information about compaction bytes read and bytes written. This change also threads that interface (and a simple no-op implementation by default) through the compaction and flush code paths. Informs cockroachdb#1329.
This change adds an interface that users of Pebble can implement to integrate a compaction limiter that gets information about compaction bytes read and bytes written. This change also threads that interface (and a simple no-op implementation by default) through the compaction and flush code paths. Informs cockroachdb#1329.
This change adds an interface that users of Pebble can implement to integrate a compaction limiter that gets information about compaction bytes read and bytes written. This change also threads that interface (and a simple no-op implementation by default) through the compaction and flush code paths. Informs cockroachdb#1329.
This change adds an interface that users of Pebble can implement to integrate a compaction limiter that gets information about compaction bytes read and bytes written. This change also threads that interface (and a simple no-op implementation by default) through the compaction and flush code paths. Informs #1329.
TODO Informs cockroachdb/pebble#1329 Release note: None
CompactionScheduler is an interface that encompasses (a) our current compaction scheduling behavior, (b) compaction scheduling in a multi-store setting that adds a per-node limit in addition to the per-store limit, and prioritizes across stores, (c) compaction scheduling that includes (b) plus is aware of resource usage and can prioritize across stores and across other long-lived work in addition to compactions (e.g. range snapshot reception). CompactionScheduler calls into DB and the DB calls into the CompactionScheduler. This requires some care in specification of the synchronization expectations, to avoid deadlock. For the most part, the complexity is borne by the CompactionScheduler -- see the code comments for details. ConcurrencyLimitScheduler is an implementation for (a), and is paired with a single DB. It has no knowledge of delete-only compactions, so we have redefined the meaning of Options.MaxConcurrentCompactions, as discussed in the code comment. CompactionScheduler has some related interfaces/structs: - CompactionGrantHandle is used to report the start and end of the compaction, and frequently report the written bytes, and CPU consumption. In the implementation of CompactionGrantHandle provided by CockroachDB's AC component, the CPU consumption will use the grunning package. - WaitingCompaction is a struct used to prioritize the DB's compaction relative to other long-lived work (including compactions by other DBs). makeWaitingCompaction is a helper that constructs this struct. For integrating the CompactionScheduler with DB, there are a number of changes: - The entry paths to ask to schedule a compaction are reduced to 1, by removing DB.maybeScheduleCompactionPicker. The only path is DB.maybeScheduleCompaction. - versionSet.{curCompactionConcurrency,pickedCompactionCache} are added to satisfy the interface expected by CompactionScheduler. Specifically, pickedCompactionCache allows us to safely cache a pickedCompaction that cannot be run. There is some commentary on the worst-case waste in compaction picking -- with the default ConcurrencyLimitScheduler on average there should be no wastage. - versionSet.curCompactionConcurrency and DB.mu.compact.manualLen are two atomic variables introduced to implement DB.GetAllowedWithoutPermission, which allows the DB to adjust what minimum compaction concurrency it desires based on the backlog of automatic and manual compactions. The encoded logic is meant to be equivalent to our current logic. The CompactionSlot and CompactionLimiter introduced in a recent PR are deleted. CompactionGrantHandle is analogous to CompactionSlot, and allows for measuring of CPU usage since the implementation of CompactionScheduler in AC will explicitly monitor usage and capacity. CompactionScheduler is analogous to CompactionLimiter. CompactionLimiter had a non-queueing interface in that it never called into the DB. This worked since the only events that allowed another compaction to run were also ones that caused another call to maybeScheduleCompaction. This is not true when a CompactionScheduler is scheduling across multiple DBs, or managing a compaction and other long-lived work (snapshot reception), since something unrelated to the DB can cause resources to become available to run a compaction. There is a partial implementation of a resource aware scheduler in https://github.com/sumeerbhola/cockroach/tree/long_lived_granter/pkg/util/admission/admit_long. Informs cockroachdb#3813, cockroachdb/cockroach#74697, cockroachdb#1329
CompactionScheduler is an interface that encompasses (a) our current compaction scheduling behavior, (b) compaction scheduling in a multi-store setting that adds a per-node limit in addition to the per-store limit, and prioritizes across stores, (c) compaction scheduling that includes (b) plus is aware of resource usage and can prioritize across stores and across other long-lived work in addition to compactions (e.g. range snapshot reception). CompactionScheduler calls into DB and the DB calls into the CompactionScheduler. This requires some care in specification of the synchronization expectations, to avoid deadlock. For the most part, the complexity is borne by the CompactionScheduler -- see the code comments for details. ConcurrencyLimitScheduler is an implementation for (a), and is paired with a single DB. It has no knowledge of delete-only compactions, so we have redefined the meaning of Options.MaxConcurrentCompactions, as discussed in the code comment. CompactionScheduler has some related interfaces/structs: - CompactionGrantHandle is used to report the start and end of the compaction, and frequently report the written bytes, and CPU consumption. In the implementation of CompactionGrantHandle provided by CockroachDB's AC component, the CPU consumption will use the grunning package. - WaitingCompaction is a struct used to prioritize the DB's compaction relative to other long-lived work (including compactions by other DBs). makeWaitingCompaction is a helper that constructs this struct. For integrating the CompactionScheduler with DB, there are a number of changes: - The entry paths to ask to schedule a compaction are reduced to 1, by removing DB.maybeScheduleCompactionPicker. The only path is DB.maybeScheduleCompaction. - versionSet.{curCompactionConcurrency,pickedCompactionCache} are added to satisfy the interface expected by CompactionScheduler. Specifically, pickedCompactionCache allows us to safely cache a pickedCompaction that cannot be run. There is some commentary on the worst-case waste in compaction picking -- with the default ConcurrencyLimitScheduler on average there should be no wastage. - versionSet.curCompactionConcurrency and DB.mu.compact.manualLen are two atomic variables introduced to implement DB.GetAllowedWithoutPermission, which allows the DB to adjust what minimum compaction concurrency it desires based on the backlog of automatic and manual compactions. The encoded logic is meant to be equivalent to our current logic. The CompactionSlot and CompactionLimiter introduced in a recent PR are deleted. CompactionGrantHandle is analogous to CompactionSlot, and allows for measuring of CPU usage since the implementation of CompactionScheduler in AC will explicitly monitor usage and capacity. CompactionScheduler is analogous to CompactionLimiter. CompactionLimiter had a non-queueing interface in that it never called into the DB. This worked since the only events that allowed another compaction to run were also ones that caused another call to maybeScheduleCompaction. This is not true when a CompactionScheduler is scheduling across multiple DBs, or managing a compaction and other long-lived work (snapshot reception), since something unrelated to the DB can cause resources to become available to run a compaction. There is a partial implementation of a resource aware scheduler in https://github.com/sumeerbhola/cockroach/tree/long_lived_granter/pkg/util/admission/admit_long. Informs cockroachdb#3813, cockroachdb/cockroach#74697, cockroachdb#1329
CompactionScheduler is an interface that encompasses (a) our current compaction scheduling behavior, (b) compaction scheduling in a multi-store setting that adds a per-node limit in addition to the per-store limit, and prioritizes across stores, (c) compaction scheduling that includes (b) plus is aware of resource usage and can prioritize across stores and across other long-lived work in addition to compactions (e.g. range snapshot reception). CompactionScheduler calls into DB and the DB calls into the CompactionScheduler. This requires some care in specification of the synchronization expectations, to avoid deadlock. For the most part, the complexity is borne by the CompactionScheduler -- see the code comments for details. ConcurrencyLimitScheduler is an implementation for (a), and is paired with a single DB. It has no knowledge of delete-only compactions, so we have redefined the meaning of Options.MaxConcurrentCompactions, as discussed in the code comment. CompactionScheduler has some related interfaces/structs: - CompactionGrantHandle is used to report the start and end of the compaction, and frequently report the written bytes, and CPU consumption. In the implementation of CompactionGrantHandle provided by CockroachDB's AC component, the CPU consumption will use the grunning package. - WaitingCompaction is a struct used to prioritize the DB's compaction relative to other long-lived work (including compactions by other DBs). makeWaitingCompaction is a helper that constructs this struct. For integrating the CompactionScheduler with DB, there are a number of changes: - The entry paths to ask to schedule a compaction are reduced to 1, by removing DB.maybeScheduleCompactionPicker. The only path is DB.maybeScheduleCompaction. - versionSet.{curCompactionConcurrency,pickedCompactionCache} are added to satisfy the interface expected by CompactionScheduler. Specifically, pickedCompactionCache allows us to safely cache a pickedCompaction that cannot be run. There is some commentary on the worst-case waste in compaction picking -- with the default ConcurrencyLimitScheduler on average there should be no wastage. - versionSet.curCompactionConcurrency and DB.mu.compact.manualLen are two atomic variables introduced to implement DB.GetAllowedWithoutPermission, which allows the DB to adjust what minimum compaction concurrency it desires based on the backlog of automatic and manual compactions. The encoded logic is meant to be equivalent to our current logic. The CompactionSlot and CompactionLimiter introduced in a recent PR are deleted. CompactionGrantHandle is analogous to CompactionSlot, and allows for measuring of CPU usage since the implementation of CompactionScheduler in AC will explicitly monitor usage and capacity. CompactionScheduler is analogous to CompactionLimiter. CompactionLimiter had a non-queueing interface in that it never called into the DB. This worked since the only events that allowed another compaction to run were also ones that caused another call to maybeScheduleCompaction. This is not true when a CompactionScheduler is scheduling across multiple DBs, or managing a compaction and other long-lived work (snapshot reception), since something unrelated to the DB can cause resources to become available to run a compaction. There is a partial implementation of a resource aware scheduler in https://github.com/sumeerbhola/cockroach/tree/long_lived_granter/pkg/util/admission/admit_long. Informs cockroachdb#3813, cockroachdb/cockroach#74697, cockroachdb#1329
CompactionScheduler is an interface that encompasses (a) our current compaction scheduling behavior, (b) compaction scheduling in a multi-store setting that adds a per-node limit in addition to the per-store limit, and prioritizes across stores, (c) compaction scheduling that includes (b) plus is aware of resource usage and can prioritize across stores and across other long-lived work in addition to compactions (e.g. range snapshot reception). CompactionScheduler calls into DB and the DB calls into the CompactionScheduler. This requires some care in specification of the synchronization expectations, to avoid deadlock. For the most part, the complexity is borne by the CompactionScheduler -- see the code comments for details. ConcurrencyLimitScheduler is an implementation for (a), and is paired with a single DB. It has no knowledge of delete-only compactions, so we have redefined the meaning of Options.MaxConcurrentCompactions, as discussed in the code comment. CompactionScheduler has some related interfaces/structs: - CompactionGrantHandle is used to report the start and end of the compaction, and frequently report the written bytes, and CPU consumption. In the implementation of CompactionGrantHandle provided by CockroachDB's AC component, the CPU consumption will use the grunning package. - WaitingCompaction is a struct used to prioritize the DB's compaction relative to other long-lived work (including compactions by other DBs). makeWaitingCompaction is a helper that constructs this struct. For integrating the CompactionScheduler with DB, there are a number of changes: - The entry paths to ask to schedule a compaction are reduced to 1, by removing DB.maybeScheduleCompactionPicker. The only path is DB.maybeScheduleCompaction. - versionSet.{curCompactionConcurrency,pickedCompactionCache} are added to satisfy the interface expected by CompactionScheduler. Specifically, pickedCompactionCache allows us to safely cache a pickedCompaction that cannot be run. There is some commentary on the worst-case waste in compaction picking -- with the default ConcurrencyLimitScheduler on average there should be no wastage. - versionSet.curCompactionConcurrency and DB.mu.compact.manualLen are two atomic variables introduced to implement DB.GetAllowedWithoutPermission, which allows the DB to adjust what minimum compaction concurrency it desires based on the backlog of automatic and manual compactions. The encoded logic is meant to be equivalent to our current logic. The CompactionSlot and CompactionLimiter introduced in a recent PR are deleted. CompactionGrantHandle is analogous to CompactionSlot, and allows for measuring of CPU usage since the implementation of CompactionScheduler in AC will explicitly monitor usage and capacity. CompactionScheduler is analogous to CompactionLimiter. CompactionLimiter had a non-queueing interface in that it never called into the DB. This worked since the only events that allowed another compaction to run were also ones that caused another call to maybeScheduleCompaction. This is not true when a CompactionScheduler is scheduling across multiple DBs, or managing a compaction and other long-lived work (snapshot reception), since something unrelated to the DB can cause resources to become available to run a compaction. There is a partial implementation of a resource aware scheduler in https://github.com/sumeerbhola/cockroach/tree/long_lived_granter/pkg/util/admission/admit_long. Informs cockroachdb#3813, cockroachdb/cockroach#74697, cockroachdb#1329
CompactionScheduler is an interface that encompasses (a) our current compaction scheduling behavior, (b) compaction scheduling in a multi-store setting that adds a per-node limit in addition to the per-store limit, and prioritizes across stores, (c) compaction scheduling that includes (b) plus is aware of resource usage and can prioritize across stores and across other long-lived work in addition to compactions (e.g. range snapshot reception). CompactionScheduler calls into DB and the DB calls into the CompactionScheduler. This requires some care in specification of the synchronization expectations, to avoid deadlock. For the most part, the complexity is borne by the CompactionScheduler -- see the code comments for details. ConcurrencyLimitScheduler is an implementation for (a), and is paired with a single DB. It has no knowledge of delete-only compactions, so we have redefined the meaning of Options.MaxConcurrentCompactions, as discussed in the code comment. CompactionScheduler has some related interfaces/structs: - CompactionGrantHandle is used to report the start and end of the compaction, and frequently report the written bytes, and CPU consumption. In the implementation of CompactionGrantHandle provided by CockroachDB's AC component, the CPU consumption will use the grunning package. - WaitingCompaction is a struct used to prioritize the DB's compaction relative to other long-lived work (including compactions by other DBs). makeWaitingCompaction is a helper that constructs this struct. For integrating the CompactionScheduler with DB, there are a number of changes: - The entry paths to ask to schedule a compaction are reduced to 1, by removing DB.maybeScheduleCompactionPicker. The only path is DB.maybeScheduleCompaction. - versionSet.{curCompactionConcurrency,pickedCompactionCache} are added to satisfy the interface expected by CompactionScheduler. Specifically, pickedCompactionCache allows us to safely cache a pickedCompaction that cannot be run. There is some commentary on the worst-case waste in compaction picking -- with the default ConcurrencyLimitScheduler on average there should be no wastage. - versionSet.curCompactionConcurrency and DB.mu.compact.manualLen are two atomic variables introduced to implement DB.GetAllowedWithoutPermission, which allows the DB to adjust what minimum compaction concurrency it desires based on the backlog of automatic and manual compactions. The encoded logic is meant to be equivalent to our current logic. The CompactionSlot and CompactionLimiter introduced in a recent PR are deleted. CompactionGrantHandle is analogous to CompactionSlot, and allows for measuring of CPU usage since the implementation of CompactionScheduler in AC will explicitly monitor usage and capacity. CompactionScheduler is analogous to CompactionLimiter. CompactionLimiter had a non-queueing interface in that it never called into the DB. This worked since the only events that allowed another compaction to run were also ones that caused another call to maybeScheduleCompaction. This is not true when a CompactionScheduler is scheduling across multiple DBs, or managing a compaction and other long-lived work (snapshot reception), since something unrelated to the DB can cause resources to become available to run a compaction. There is a partial implementation of a resource aware scheduler in https://github.com/sumeerbhola/cockroach/tree/long_lived_granter/pkg/util/admission/admit_long. Informs cockroachdb#3813, cockroachdb/cockroach#74697, cockroachdb#1329
CompactionScheduler is an interface that encompasses (a) our current compaction scheduling behavior, (b) compaction scheduling in a multi-store setting that adds a per-node limit in addition to the per-store limit, and prioritizes across stores, (c) compaction scheduling that includes (b) plus is aware of resource usage and can prioritize across stores and across other long-lived work in addition to compactions (e.g. range snapshot reception). CompactionScheduler calls into DB and the DB calls into the CompactionScheduler. This requires some care in specification of the synchronization expectations, to avoid deadlock. For the most part, the complexity is borne by the CompactionScheduler -- see the code comments for details. ConcurrencyLimitScheduler is an implementation for (a), and is paired with a single DB. It has no knowledge of delete-only compactions, so we have redefined the meaning of Options.MaxConcurrentCompactions, as discussed in the code comment. CompactionScheduler has some related interfaces/structs: - CompactionGrantHandle is used to report the start and end of the compaction, and frequently report the written bytes, and CPU consumption. In the implementation of CompactionGrantHandle provided by CockroachDB's AC component, the CPU consumption will use the grunning package. - WaitingCompaction is a struct used to prioritize the DB's compaction relative to other long-lived work (including compactions by other DBs). makeWaitingCompaction is a helper that constructs this struct. For integrating the CompactionScheduler with DB, there are a number of changes: - The entry paths to ask to schedule a compaction are reduced to 1, by removing DB.maybeScheduleCompactionPicker. The only path is DB.maybeScheduleCompaction. - versionSet.{curCompactionConcurrency,pickedCompactionCache} are added to satisfy the interface expected by CompactionScheduler. Specifically, pickedCompactionCache allows us to safely cache a pickedCompaction that cannot be run. There is some commentary on the worst-case waste in compaction picking -- with the default ConcurrencyLimitScheduler on average there should be no wastage. - versionSet.curCompactionConcurrency and DB.mu.compact.manualLen are two atomic variables introduced to implement DB.GetAllowedWithoutPermission, which allows the DB to adjust what minimum compaction concurrency it desires based on the backlog of automatic and manual compactions. The encoded logic is meant to be equivalent to our current logic. The CompactionSlot and CompactionLimiter introduced in a recent PR are deleted. CompactionGrantHandle is analogous to CompactionSlot, and allows for measuring of CPU usage since the implementation of CompactionScheduler in AC will explicitly monitor usage and capacity. CompactionScheduler is analogous to CompactionLimiter. CompactionLimiter had a non-queueing interface in that it never called into the DB. This worked since the only events that allowed another compaction to run were also ones that caused another call to maybeScheduleCompaction. This is not true when a CompactionScheduler is scheduling across multiple DBs, or managing a compaction and other long-lived work (snapshot reception), since something unrelated to the DB can cause resources to become available to run a compaction. There is a partial implementation of a resource aware scheduler in https://github.com/sumeerbhola/cockroach/tree/long_lived_granter/pkg/util/admission/admit_long. Informs cockroachdb#3813, cockroachdb/cockroach#74697, cockroachdb#1329
CompactionScheduler is an interface that encompasses (a) our current compaction scheduling behavior, (b) compaction scheduling in a multi-store setting that adds a per-node limit in addition to the per-store limit, and prioritizes across stores, (c) compaction scheduling that includes (b) plus is aware of resource usage and can prioritize across stores and across other long-lived work in addition to compactions (e.g. range snapshot reception). CompactionScheduler calls into DB and the DB calls into the CompactionScheduler. This requires some care in specification of the synchronization expectations, to avoid deadlock. For the most part, the complexity is borne by the CompactionScheduler -- see the code comments for details. ConcurrencyLimitScheduler is an implementation for (a), and is paired with a single DB. It has no knowledge of delete-only compactions, so we have redefined the meaning of Options.MaxConcurrentCompactions, as discussed in the code comment. CompactionScheduler has some related interfaces/structs: - CompactionGrantHandle is used to report the start and end of the compaction, and frequently report the written bytes, and CPU consumption. In the implementation of CompactionGrantHandle provided by CockroachDB's AC component, the CPU consumption will use the grunning package. - WaitingCompaction is a struct used to prioritize the DB's compaction relative to other long-lived work (including compactions by other DBs). makeWaitingCompaction is a helper that constructs this struct. For integrating the CompactionScheduler with DB, there are a number of changes: - The entry paths to ask to schedule a compaction are reduced to 1, by removing DB.maybeScheduleCompactionPicker. The only path is DB.maybeScheduleCompaction. - versionSet.{curCompactionConcurrency,pickedCompactionCache} are added to satisfy the interface expected by CompactionScheduler. Specifically, pickedCompactionCache allows us to safely cache a pickedCompaction that cannot be run. There is some commentary on the worst-case waste in compaction picking -- with the default ConcurrencyLimitScheduler on average there should be no wastage. - versionSet.curCompactionConcurrency and DB.mu.compact.manualLen are two atomic variables introduced to implement DB.GetAllowedWithoutPermission, which allows the DB to adjust what minimum compaction concurrency it desires based on the backlog of automatic and manual compactions. The encoded logic is meant to be equivalent to our current logic. The CompactionSlot and CompactionLimiter introduced in a recent PR are deleted. CompactionGrantHandle is analogous to CompactionSlot, and allows for measuring of CPU usage since the implementation of CompactionScheduler in AC will explicitly monitor usage and capacity. CompactionScheduler is analogous to CompactionLimiter. CompactionLimiter had a non-queueing interface in that it never called into the DB. This worked since the only events that allowed another compaction to run were also ones that caused another call to maybeScheduleCompaction. This is not true when a CompactionScheduler is scheduling across multiple DBs, or managing a compaction and other long-lived work (snapshot reception), since something unrelated to the DB can cause resources to become available to run a compaction. There is a partial implementation of a resource aware scheduler in https://github.com/sumeerbhola/cockroach/tree/long_lived_granter/pkg/util/admission/admit_long. Informs cockroachdb#3813, cockroachdb/cockroach#74697, cockroachdb#1329
This issue was originally about the need to reduce tuning knobs in Pebble that govern performance more generally. Over time, some of these have been adjusted, or have had adjustments considered. The one major one that remains is
MaxConcurrentCompactions
which gets set tomin(NumCPU(), 3)
which is too few compactions at once than can be handled by many beefy nodes with fast NVMe drives. Pebble, along with Cockroach's admission control, should be able to schedule additional compactions in the presence of greater CPU and disk IO headroom to allow for greater compaction concurrency, without necessitating manual operator intervention to unlock additional performance.There's a similar case to be made about increasing compaction concurrency even in the presence of heavy foreground write traffic, as it reduces the work necessary to incorporate future foreground write traffic into a well-formed LSM. Even in this case, adaptive increases in compaction concurrency will yield better foreground write performance in the longer run and can be considered even though it might seem instantaneously unintuitive to take away more disk / CPU bandwidth for a background operation.
Original text of the issue follows below the horizontal line.
Pebble has various tuning knobs that affect its CPU and disk IOPS/bandwidth utilization. Examples include L0CompactionConcurrency, CompactionDebtConcurrency, MaxConcurrentCompactions, DeleteRangeFlushDelay, MinDeletionRate.
Some of these, like MinDeletionRate and DeleteRangeFlushDelay, affect how fast we reclaim disk space, so are possibly not very interesting in most circumstances -- it is relatively easier to properly provision disk byte capacity.
The compaction ones affect how many concurrent compactions we can run, and being able to run more concurrent compactions would allow Pebble to handle a higher write throughput without having a misshapen LSM. CockroachDB accepts the default value for these knobs except for
MaxConcurrentCompactions
, which is set tomin(numCPU, 3)
. It is likely that these knob settings are sub-optimal in most settings: (a) the limit of 3 is likely to be too low on large machines, (b) even 3 compactions could be too many on a large machine if there is a spike in read-only user-facing traffic that wants to consume all cpu, or if disk IOPs/bandwidth are saturated due to user-facing traffic.We have discussed ways to speed compactions in the past by parallelizing them into multiple sub-compactions that partition the key span of the compaction. More recently, there are ideas to parallelize the compression/decompression of ssblocks within a compaction.
These compaction-related ideas have the following need in common: detect the current resource utilization (or overload of a resource), and use that to adjust the amount of concurrent background work, so as to not affect foreground traffic. It is possible that such adjustments would need to be fine-grained and slow down an already started compaction.
Detecting resource utilization or overload for CPU is relatively easier -- the CockroachDB context already has hooks that look at goroutine scheduler state. Disk IOPs/bandwidth is harder: log commit latency is one signal, but it may not always be a good indicator, so we would need to run some overload experiments to understand this better.
Jira issue: PEBBLE-119
The text was updated successfully, but these errors were encountered: