-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
turn Timeline::layers into tokio::sync::RwLock #4441
turn Timeline::layers into tokio::sync::RwLock #4441
Conversation
8f6c8ed
to
a5daa1a
Compare
1012 tests run: 971 passed, 0 failed, 41 skipped (full report)The comment gets automatically updated with the latest test results
a4c4d96 at 2023-06-13T13:43:30.758Z :recycle: |
14e21e2
to
6fe36bf
Compare
a5daa1a
to
2a80ef1
Compare
This comment was marked as outdated.
This comment was marked as outdated.
6fe36bf
to
a07d172
Compare
2a80ef1
to
928e245
Compare
This comment was marked as outdated.
This comment was marked as outdated.
a07d172
to
bc5ade7
Compare
928e245
to
acd8996
Compare
bc5ade7
to
d46abaa
Compare
acd8996
to
cc272c2
Compare
... by switching the internal RwLock to a OnceCell. This is preliminary work for/from #4220 (async `Layer::get_value_reconstruct_data`). See #4462 (comment) for more context. fixes #4471
…ers (#4476) This is preliminary work for/from #4220 (async `Layer::get_value_reconstruct_data`). There, we want to switch `Timeline::layers` to be a `tokio::sync::RwLock`. That will require the `TimelineWriter` to become async. That will require `freeze_inmem_layer` to become async. So, inside check_checkpoint_distance, we will have `freeze_inmem_layer().await`. But current rustc isn't smart enough to understand that we `drop(layers)` earlier, and hence, will complain about the `!Send` `layers` being held across the `freeze_inmem_layer().await`-point. This patch puts the guard into a scope, so rustc will shut up in the next patch where we make the transition for `TimelineWriter`. obsoletes #4474
fails with: ``` error[E0277]: `std::sync::MutexGuard<'_, ()>` cannot be sent between threads safely --> pageserver/src/tenant/timeline.rs:4644:20 | 4644 | _assert_send::<TimelineWriter<'_>>(); | ^^^^^^^^^^^^^^^^^^ `std::sync::MutexGuard<'_, ()>` cannot be sent between threads safely | = help: within `tenant::timeline::TimelineWriter<'_>`, the trait `std::marker::Send` is not implemented for `std::sync::MutexGuard<'_, ()>` note: required because it appears within the type `TimelineWriter<'_>` --> pageserver/src/tenant/timeline.rs:4596:12 | 4596 | pub struct TimelineWriter<'a> { | ^^^^^^^^^^^^^^ note: required by a bound in `tenant::timeline::is_send::_assert_send` --> pageserver/src/tenant/timeline.rs:4643:24 | 4643 | fn _assert_send<T: Send>() {} | ^^^^ required by this bound in `_assert_send` For more information about this error, try `rustc --explain E0277`. error: could not compile `pageserver` due to previous error ```
This is preliminary work for/from #4220 (async `Layer::get_value_reconstruct_data`). Stacked on top of #4477. Thanks to the countless preliminary PRs, this conversion is relatively straight-forward. Note that this commit partially reverts "pgserver: spawn_blocking in compaction (#4265)" 4e359db. because, if we didn't revert it, we'd have to use `Timeline::layers.blocking_read()` inside compact_level0_phase1. That would use up a thread in the spawn_blocking thread pool. I considered wrapping the code that follows the second `layers.read().await` into `spawn_blocking`, but there are lifetime issues with `deltas_to_compact`.
cc272c2
to
a05df95
Compare
Timeline::layers
to async RwLock atop #4364There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except the fsync changes. Some of them should be kept as-is because the compaction spawn_blocking
PR also fixes some bugs that should not be reverted... 🤪 (See comment at timeline.rs L3269)
…ne-get/timeline-layers-tokio-sync-atop-4364
This is preliminary work for/from #4220 (async `Layer::get_value_reconstruct_data`). Or more specifically, #4441, where we turn Timeline::layers into a tokio::sync::RwLock. By using try_write() here, we can avoid turning init_empty_layer_map async, which is nice because much of its transitive call(er) graph isn't async.
Ok, so, I re-cherry-picked @skyzh 's To the best of my knowledge, all that it reverts now is the So, we're doing sync IO again inside the async But what remains is the CPU work of the kmerge. |
Scoped this out. I don't think it's a good idea to wrap just the So, I'm leaning toward just taking the regression here. @skyzh do you have a benchmark / numbers that you ran when you implemented the |
This is preliminary work for/from #4220 (async `Layer::get_value_reconstruct_data`). Or more specifically, #4441, where we turn Timeline::layers into a tokio::sync::RwLock. By using try_write() here, we can avoid turning init_empty_layer_map async, which is nice because much of its transitive call(er) graph isn't async.
Unluckily no :( We can look at the daily benchmark number after merging this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Discussed follow ups verbally
if result.is_ok() && (is_rel_block_key(key) || is_slru_block_key(key)) { | ||
result = writer.put(key, self.lsn, value); | ||
false | ||
let mut retained_pending_updates = HashMap::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use reserve to avoid reallocations
This is preliminary work for/from #4220 (async `Layer::get_value_reconstruct_data`). Or more specifically, #4441, where we turn Timeline::layers into a tokio::sync::RwLock. By using try_write() here, we can avoid turning init_empty_layer_map async, which is nice because much of its transitive call(er) graph isn't async.
This is preliminary work for/from #4220 (async `Layer::get_value_reconstruct_data`). # Full Stack Of Preliminary PRs Thanks to the countless preliminary PRs, this conversion is relatively straight-forward. 1. Clean-ups * #4316 * #4317 * #4318 * #4319 * #4321 * Note: these were mostly to find an alternative to #4291, which I thought we'd need in my original plan where we would need to convert `Tenant::timelines` into an async locking primitive (#4333). In reviews, we walked away from that, but these cleanups were still quite useful. 2. #4364 3. #4472 4. #4476 5. #4477 6. #4485 # Significant Changes In This PR ## `compact_level0_phase1` & `create_delta_layer` This commit partially reverts "pgserver: spawn_blocking in compaction (#4265)" 4e359db. Specifically, it reverts the `spawn_blocking`-ificiation of `compact_level0_phase1`. If we didn't revert it, we'd have to use `Timeline::layers.blocking_read()` inside `compact_level0_phase1`. That would use up a thread in the `spawn_blocking` thread pool, which is hard-capped. I considered wrapping the code that follows the second `layers.read().await` into `spawn_blocking`, but there are lifetime issues with `deltas_to_compact`. Also, this PR switches the `create_delta_layer` _function_ back to async, and uses `spawn_blocking` inside to run the code that does sync IO, while keeping the code that needs to lock `Timeline::layers` async. ## `LayerIter` and `LayerKeyIter` `Send` bounds I had to add a `Send` bound on the `dyn` type that `LayerIter` and `LayerKeyIter` wrap. Why? Because we now have the second `layers.read().await` inside `compact_level0_phase`, and these iterator instances are held across that await-point. More background: #4462 (comment) ## `DatadirModification::flush` Needed to replace the `HashMap::retain` with a hand-rolled variant because `TimelineWriter::put` is now async.
We already do it inside `frozen_layer.write_to_disk()`. Context: #4441 (comment)
This PR concludes the "async `Layer::get_value_reconstruct_data`" project. The problem we're solving is that, before this patch, we'd execute `Layer::get_value_reconstruct_data` on the tokio executor threads. This function is IO- and/or CPU-intensive. The IO is using VirtualFile / std::fs; hence it's blocking. This results in unfairness towards other tokio tasks, especially under (disk) load. Some context can be found at #4154 where I suspect (but can't prove) load spikes of logical size calculation to cause heavy eviction skew. Sadly we don't have tokio runtime/scheduler metrics to quantify the unfairness. But generally, we know blocking the executor threads on std::fs IO is bad. So, let's have this change and watch out for severe perf regressions in staging & during rollout. ## Changes * rename `Layer::get_value_reconstruct_data` to `Layer::get_value_reconstruct_data_blocking` * add a new blanket impl'd `Layer::get_value_reconstruct_data` `async_trait` method that runs `get_value_reconstruct_data_blocking` inside `spawn_blocking`. * The `spawn_blocking` requires `'static` lifetime of the captured variables; hence I had to change the data flow to _move_ the `ValueReconstructState` into and back out of get_value_reconstruct_data instead of passing a reference. It's a small struct, so I don't expect a big performance penalty. ## Performance Fundamentally, the code changes cause the following performance-relevant changes: * Latency & allocations: each `get_value_reconstruct_data` call now makes a short-lived allocation because `async_trait` is just sugar for boxed futures under the hood * Latency: `spawn_blocking` adds some latency because it needs to move the work to a thread pool * using `spawn_blocking` plus the existing synchronous code inside is probably more efficient better than switching all the synchronous code to tokio::fs because _each_ tokio::fs call does `spawn_blocking` under the hood. * Throughput: the `spawn_blocking` thread pool is much larger than the async executor thread pool. Hence, as long as the disks can keep up, which they should according to AWS specs, we will be able to deliver higher `get_value_reconstruct_data` throughput. * Disk IOPS utilization: we will see higher disk utilization if we get more throughput. Not a problem because the disks in prod are currently under-utilized, according to node_exporter metrics & the AWS specs. * CPU utilization: at higher throughput, CPU utilization will be higher. Slightly higher latency under regular load is acceptable given the throughput gains and expected better fairness during disk load peaks, such as logical size calculation peaks uncovered in #4154. ## Full Stack Of Preliminary PRs This PR builds on top of the following preliminary PRs 1. Clean-ups * #4316 * #4317 * #4318 * #4319 * #4321 * Note: these were mostly to find an alternative to #4291, which I thought we'd need in my original plan where we would need to convert `Tenant::timelines` into an async locking primitive (#4333). In reviews, we walked away from that, but these cleanups were still quite useful. 2. #4364 3. #4472 4. #4476 5. #4477 6. #4485 7. #4441
This is preliminary work for/from #4220 (async
Layer::get_value_reconstruct_data
).Full Stack Of Preliminary PRs
Thanks to the countless preliminary PRs, this conversion is relatively straight-forward.
Tenant::state.send_modify
#4291, which I thought we'd need in my original plan where we would need to convertTenant::timelines
into an async locking primitive (make Tenant::timelines a tokio::sync::RwLock #4333). In reviews, we walked away from that, but these cleanups were still quite useful.Send
by usingtokio::sync Mutex
internally #4477try_write
#4485Significant Changes In This PR
compact_level0_phase1
&create_delta_layer
This commit partially reverts
"pgserver: spawn_blocking in compaction (#4265)"
4e359db.
Specifically, it reverts the
spawn_blocking
-ificiation ofcompact_level0_phase1
.If we didn't revert it, we'd have to use
Timeline::layers.blocking_read()
insidecompact_level0_phase1
.That would use up a thread in the
spawn_blocking
thread pool, which is hard-capped.I considered wrapping the code that follows the second
layers.read().await
into
spawn_blocking
, but there are lifetime issues withdeltas_to_compact
.Also, this PR switches the
create_delta_layer
function back to async, and usesspawn_blocking
inside to run the code that does sync IO, while keeping the codethat needs to lock
Timeline::layers
async.LayerIter
andLayerKeyIter
Send
boundsI had to add a
Send
bound on thedyn
type thatLayerIter
andLayerKeyIter
wrap.Why? Because we now have the second
layers.read().await
insidecompact_level0_phase
,and these iterator instances are held across that await-point.
More background: #4462 (comment)
DatadirModification::flush
Needed to replace the
HashMap::retain
with a hand-rolled variant becauseTimelineWriter::put
is now async.