turn Timeline::layers into tokio::sync::RwLock #4441

problame · 2023-06-07T12:41:32Z

This is preliminary work for/from #4220 (async Layer::get_value_reconstruct_data).

Full Stack Of Preliminary PRs

Thanks to the countless preliminary PRs, this conversion is relatively straight-forward.

Clean-ups

tenant_map_insert: don't expose the vacant entry to the closure #4316
refactor responsibility for tenant/timeline activation #4317
refactor: eliminate global storage_broker client state #4318
refactor: make timeline activation infallible #4319
refactor TenantState transitions #4321
Note: these were mostly to find an alternative to allow async code inside Tenant::state.send_modify #4291, which I thought we'd need in my original plan where we would need to convert Tenant::timelines into an async locking primitive (make Tenant::timelines a tokio::sync::RwLock #4333). In reviews, we walked away from that, but these cleanups were still quite useful.

Significant Changes In This PR

`compact_level0_phase1` & `create_delta_layer`

This commit partially reverts

"pgserver: spawn_blocking in compaction (#4265)"
4e359db.

Specifically, it reverts the spawn_blocking-ificiation of compact_level0_phase1.
If we didn't revert it, we'd have to use Timeline::layers.blocking_read() inside compact_level0_phase1.
That would use up a thread in the spawn_blocking thread pool, which is hard-capped.

I considered wrapping the code that follows the second layers.read().await
into spawn_blocking, but there are lifetime issues with deltas_to_compact.

Also, this PR switches the create_delta_layer function back to async, and uses spawn_blocking inside to run the code that does sync IO, while keeping the code
that needs to lock Timeline::layers async.

`LayerIter` and `LayerKeyIter` `Send` bounds

I had to add a Send bound on the dyn type that LayerIter and LayerKeyIter wrap.
Why? Because we now have the second layers.read().await inside compact_level0_phase,
and these iterator instances are held across that await-point.

More background: #4462 (comment)

`DatadirModification::flush`

Needed to replace the HashMap::retain with a hand-rolled variant because TimelineWriter::put is now async.

github-actions · 2023-06-07T14:33:44Z

1012 tests run: 971 passed, 0 failed, 41 skipped (full report)

Flaky tests (1)

Postgres 15

test_delete_timeline_post_rm_failure[local_fs]: ✅ release

_{The comment gets automatically updated with the latest test results
a4c4d96 at 2023-06-13T13:43:30.758Z :recycle:}

... by switching the internal RwLock to a OnceCell. This is preliminary work for/from #4220 (async `Layer::get_value_reconstruct_data`). See #4462 (comment) for more context. fixes #4471

…ers (#4476) This is preliminary work for/from #4220 (async `Layer::get_value_reconstruct_data`). There, we want to switch `Timeline::layers` to be a `tokio::sync::RwLock`. That will require the `TimelineWriter` to become async. That will require `freeze_inmem_layer` to become async. So, inside check_checkpoint_distance, we will have `freeze_inmem_layer().await`. But current rustc isn't smart enough to understand that we `drop(layers)` earlier, and hence, will complain about the `!Send` `layers` being held across the `freeze_inmem_layer().await`-point. This patch puts the guard into a scope, so rustc will shut up in the next patch where we make the transition for `TimelineWriter`. obsoletes #4474

fails with: ``` error[E0277]: `std::sync::MutexGuard<'_, ()>` cannot be sent between threads safely --> pageserver/src/tenant/timeline.rs:4644:20 | 4644 | _assert_send::<TimelineWriter<'_>>(); | ^^^^^^^^^^^^^^^^^^ `std::sync::MutexGuard<'_, ()>` cannot be sent between threads safely | = help: within `tenant::timeline::TimelineWriter<'_>`, the trait `std::marker::Send` is not implemented for `std::sync::MutexGuard<'_, ()>` note: required because it appears within the type `TimelineWriter<'_>` --> pageserver/src/tenant/timeline.rs:4596:12 | 4596 | pub struct TimelineWriter<'a> { | ^^^^^^^^^^^^^^ note: required by a bound in `tenant::timeline::is_send::_assert_send` --> pageserver/src/tenant/timeline.rs:4643:24 | 4643 | fn _assert_send<T: Send>() {} | ^^^^ required by this bound in `_assert_send` For more information about this error, try `rustc --explain E0277`. error: could not compile `pageserver` due to previous error ```

This is preliminary work for/from #4220 (async `Layer::get_value_reconstruct_data`). Stacked on top of #4477. Thanks to the countless preliminary PRs, this conversion is relatively straight-forward. Note that this commit partially reverts "pgserver: spawn_blocking in compaction (#4265)" 4e359db. because, if we didn't revert it, we'd have to use `Timeline::layers.blocking_read()` inside compact_level0_phase1. That would use up a thread in the spawn_blocking thread pool. I considered wrapping the code that follows the second `layers.read().await` into `spawn_blocking`, but there are lifetime issues with `deltas_to_compact`.

skyzh

LGTM except the fsync changes. Some of them should be kept as-is because the compaction spawn_blocking PR also fixes some bugs that should not be reverted... 🤪 (See comment at timeline.rs L3269)

…ne-get/timeline-layers-tokio-sync-atop-4364

This is preliminary work for/from #4220 (async `Layer::get_value_reconstruct_data`). Or more specifically, #4441, where we turn Timeline::layers into a tokio::sync::RwLock. By using try_write() here, we can avoid turning init_empty_layer_map async, which is nice because much of its transitive call(er) graph isn't async.

…4265))

problame · 2023-06-13T09:20:34Z

Ok, so, I re-cherry-picked @skyzh 's pgserver: spawn_blocking in compaction (#4265) .

To the best of my knowledge, all that it reverts now is the spawn_blocking-ingification of compact_level0_phase1.

So, we're doing sync IO again inside the async compact_level0_phase1.
I'll try to eliminate that by wrapping the writes into block_in_place or spawn_blocking.
Will push such a commit later.

But what remains is the CPU work of the kmerge.
The lifetimes around that are non-trivial, I think it's better to fix that up / make it spawn_blocking in a later PR.

problame · 2023-06-13T09:27:35Z

I'll try to eliminate that by wrapping the writes into block_in_place or spawn_blocking.
Will push such a commit later

Scoped this out. I don't think it's a good idea to wrap just the put_value call into a spawn_blocking. Because we only do IO (or CPU-"intensive" prefix compression) in a small fraction of the overall put_value calls.

So, I'm leaning toward just taking the regression here.
Let's quantify the regression.

@skyzh do you have a benchmark / numbers that you ran when you implemented the spawn_blocking-ingification?

This is preliminary work for/from #4220 (async `Layer::get_value_reconstruct_data`). Or more specifically, #4441, where we turn Timeline::layers into a tokio::sync::RwLock. By using try_write() here, we can avoid turning init_empty_layer_map async, which is nice because much of its transitive call(er) graph isn't async.

pageserver/src/tenant/timeline.rs

…pawn_blocking

Context: #4441 (comment)

skyzh · 2023-06-13T13:40:52Z

do you have a benchmark / numbers that you ran when you implemented the spawn_blocking-ingification?

Unluckily no :( We can look at the daily benchmark number after merging this PR.

LizardWizzard

LGTM

Discussed follow ups verbally

shanyp · 2023-06-14T06:52:46Z

pageserver/src/pgdatadir_mapping.rs

-            if result.is_ok() && (is_rel_block_key(key) || is_slru_block_key(key)) {
-                result = writer.put(key, self.lsn, value);
-                false
+        let mut retained_pending_updates = HashMap::new();


use reserve to avoid reallocations

This is preliminary work for/from #4220 (async `Layer::get_value_reconstruct_data`). Or more specifically, #4441, where we turn Timeline::layers into a tokio::sync::RwLock. By using try_write() here, we can avoid turning init_empty_layer_map async, which is nice because much of its transitive call(er) graph isn't async.

This is preliminary work for/from #4220 (async `Layer::get_value_reconstruct_data`). # Full Stack Of Preliminary PRs Thanks to the countless preliminary PRs, this conversion is relatively straight-forward. 1. Clean-ups * #4316 * #4317 * #4318 * #4319 * #4321 * Note: these were mostly to find an alternative to #4291, which I thought we'd need in my original plan where we would need to convert `Tenant::timelines` into an async locking primitive (#4333). In reviews, we walked away from that, but these cleanups were still quite useful. 2. #4364 3. #4472 4. #4476 5. #4477 6. #4485 # Significant Changes In This PR ## `compact_level0_phase1` & `create_delta_layer` This commit partially reverts "pgserver: spawn_blocking in compaction (#4265)" 4e359db. Specifically, it reverts the `spawn_blocking`-ificiation of `compact_level0_phase1`. If we didn't revert it, we'd have to use `Timeline::layers.blocking_read()` inside `compact_level0_phase1`. That would use up a thread in the `spawn_blocking` thread pool, which is hard-capped. I considered wrapping the code that follows the second `layers.read().await` into `spawn_blocking`, but there are lifetime issues with `deltas_to_compact`. Also, this PR switches the `create_delta_layer` _function_ back to async, and uses `spawn_blocking` inside to run the code that does sync IO, while keeping the code that needs to lock `Timeline::layers` async. ## `LayerIter` and `LayerKeyIter` `Send` bounds I had to add a `Send` bound on the `dyn` type that `LayerIter` and `LayerKeyIter` wrap. Why? Because we now have the second `layers.read().await` inside `compact_level0_phase`, and these iterator instances are held across that await-point. More background: #4462 (comment) ## `DatadirModification::flush` Needed to replace the `HashMap::retain` with a hand-rolled variant because `TimelineWriter::put` is now async.

Context: #4441 (comment)

We already do it inside `frozen_layer.write_to_disk()`. Context: #4441 (comment)

This PR concludes the "async `Layer::get_value_reconstruct_data`" project. The problem we're solving is that, before this patch, we'd execute `Layer::get_value_reconstruct_data` on the tokio executor threads. This function is IO- and/or CPU-intensive. The IO is using VirtualFile / std::fs; hence it's blocking. This results in unfairness towards other tokio tasks, especially under (disk) load. Some context can be found at #4154 where I suspect (but can't prove) load spikes of logical size calculation to cause heavy eviction skew. Sadly we don't have tokio runtime/scheduler metrics to quantify the unfairness. But generally, we know blocking the executor threads on std::fs IO is bad. So, let's have this change and watch out for severe perf regressions in staging & during rollout. ## Changes * rename `Layer::get_value_reconstruct_data` to `Layer::get_value_reconstruct_data_blocking` * add a new blanket impl'd `Layer::get_value_reconstruct_data` `async_trait` method that runs `get_value_reconstruct_data_blocking` inside `spawn_blocking`. * The `spawn_blocking` requires `'static` lifetime of the captured variables; hence I had to change the data flow to _move_ the `ValueReconstructState` into and back out of get_value_reconstruct_data instead of passing a reference. It's a small struct, so I don't expect a big performance penalty. ## Performance Fundamentally, the code changes cause the following performance-relevant changes: * Latency & allocations: each `get_value_reconstruct_data` call now makes a short-lived allocation because `async_trait` is just sugar for boxed futures under the hood * Latency: `spawn_blocking` adds some latency because it needs to move the work to a thread pool * using `spawn_blocking` plus the existing synchronous code inside is probably more efficient better than switching all the synchronous code to tokio::fs because _each_ tokio::fs call does `spawn_blocking` under the hood. * Throughput: the `spawn_blocking` thread pool is much larger than the async executor thread pool. Hence, as long as the disks can keep up, which they should according to AWS specs, we will be able to deliver higher `get_value_reconstruct_data` throughput. * Disk IOPS utilization: we will see higher disk utilization if we get more throughput. Not a problem because the disks in prod are currently under-utilized, according to node_exporter metrics & the AWS specs. * CPU utilization: at higher throughput, CPU utilization will be higher. Slightly higher latency under regular load is acceptable given the throughput gains and expected better fairness during disk load peaks, such as logical size calculation peaks uncovered in #4154. ## Full Stack Of Preliminary PRs This PR builds on top of the following preliminary PRs 1. Clean-ups * #4316 * #4317 * #4318 * #4319 * #4321 * Note: these were mostly to find an alternative to #4291, which I thought we'd need in my original plan where we would need to convert `Tenant::timelines` into an async locking primitive (#4333). In reviews, we walked away from that, but these cleanups were still quite useful. 2. #4364 3. #4472 4. #4476 5. #4477 6. #4485 7. #4441

problame changed the base branch from main to dont-hold-timelines-lock-over-load_layer_map June 7, 2023 12:48

problame force-pushed the problame/async-timeline-get/timeline-layers-tokio-sync-atop-4364 branch 2 times, most recently from 8f6c8ed to a5daa1a Compare June 7, 2023 14:03

problame mentioned this pull request Jun 7, 2023

convert Timeline::layers to async RwLock atop #4321 #4360

Closed

problame force-pushed the dont-hold-timelines-lock-over-load_layer_map branch from 14e21e2 to 6fe36bf Compare June 7, 2023 15:24

problame force-pushed the problame/async-timeline-get/timeline-layers-tokio-sync-atop-4364 branch from a5daa1a to 2a80ef1 Compare June 7, 2023 15:39

This comment was marked as outdated.

Sign in to view

problame force-pushed the dont-hold-timelines-lock-over-load_layer_map branch from 6fe36bf to a07d172 Compare June 7, 2023 15:47

problame force-pushed the problame/async-timeline-get/timeline-layers-tokio-sync-atop-4364 branch from 2a80ef1 to 928e245 Compare June 7, 2023 15:48

This comment was marked as outdated.

Sign in to view

problame force-pushed the dont-hold-timelines-lock-over-load_layer_map branch from a07d172 to bc5ade7 Compare June 7, 2023 16:01

problame force-pushed the problame/async-timeline-get/timeline-layers-tokio-sync-atop-4364 branch from 928e245 to acd8996 Compare June 7, 2023 16:01

problame force-pushed the dont-hold-timelines-lock-over-load_layer_map branch from bc5ade7 to d46abaa Compare June 12, 2023 10:18

Base automatically changed from dont-hold-timelines-lock-over-load_layer_map to main June 12, 2023 10:56

problame force-pushed the problame/async-timeline-get/timeline-layers-tokio-sync-atop-4364 branch from acd8996 to cc272c2 Compare June 12, 2023 12:09

problame mentioned this pull request Jun 12, 2023

keep holding layer map lock inside compact_level0_phase1 #4462

Closed

problame and others added 5 commits June 12, 2023 18:13

make Delta{Value,Key}Iter Send (#4472)

5abe335

... by switching the internal RwLock to a OnceCell. This is preliminary work for/from #4220 (async `Layer::get_value_reconstruct_data`). See #4462 (comment) for more context. fixes #4471

make TimelineWriter Send by using tokio::sync Mutex internally

782b843

problame force-pushed the problame/async-timeline-get/timeline-layers-tokio-sync-atop-4364 branch from cc272c2 to a05df95 Compare June 12, 2023 16:35

problame changed the base branch from main to problame/async-timeline-get/make-timelinewriter-send June 12, 2023 16:35

problame changed the title ~~convert Timeline::layers to async RwLock atop #4364~~ turn Timeline::layers into tokio::sync::RwLock Jun 12, 2023

problame requested a review from skyzh June 12, 2023 16:36

This was referenced Jun 12, 2023

convert Timeline::layers to async RwLock atop #4333 #4361

Closed

convert Timeline::layers to async RwLock atop #4350 #4362

Closed

problame requested review from LizardWizzard and shanyp June 12, 2023 16:45

skyzh approved these changes Jun 12, 2023

View reviewed changes

Base automatically changed from problame/async-timeline-get/make-timelinewriter-send to main June 13, 2023 08:15

problame added 2 commits June 13, 2023 10:40

Merge remote-tracking branch 'origin/main' into problame/async-timeli…

b32acb4

…ne-get/timeline-layers-tokio-sync-atop-4364

variant of heikki's proposal for pending_updates.retain() replacement

63a4081

problame mentioned this pull request Jun 13, 2023

init_empty_layer_map: use try_write #4485

Merged

problame and others added 2 commits June 13, 2023 10:58

don't Arc-wrap Timeline::layers

6c4024b

undo most of the revert of (pgserver: spawn_blocking in compaction (#…

95bd6c4

…4265))

formatting issue that rustfmt gave up on(?)

6bbff2d

LizardWizzard reviewed Jun 13, 2023

View reviewed changes

address TODO wrt create_delta_layer using block_in_place instead of s…

a4c4d96

…pawn_blocking

problame added a commit that referenced this pull request Jun 13, 2023

create_delta_layer: improve misleading TODO comment

2b8e097

Context: #4441 (comment)

This was referenced Jun 13, 2023

create_delta_layer: improve misleading TODO comment #4488

Merged

create_delta_layer: avoid needless stat #4489

Merged

skyzh approved these changes Jun 13, 2023

View reviewed changes

LizardWizzard approved these changes Jun 13, 2023

View reviewed changes

problame mentioned this pull request Jun 13, 2023

compact_level0_phase1: add back spawn_blocking support #4492

Closed

problame merged commit 3693d1f into main Jun 13, 2023

problame deleted the problame/async-timeline-get/timeline-layers-tokio-sync-atop-4364 branch June 13, 2023 16:38

shanyp reviewed Jun 14, 2023

View reviewed changes

problame mentioned this pull request Jun 14, 2023

run Layer::get_value_reconstruct_data in spawn_blocking #4498

Merged

problame added a commit that referenced this pull request Jun 16, 2023

create_delta_layer: improve misleading TODO comment (#4488)

14d495a

Context: #4441 (comment)

problame added a commit that referenced this pull request Jun 16, 2023

create_delta_layer: avoid needless stat (#4489)

78082d0

We already do it inside `frozen_layer.write_to_disk()`. Context: #4441 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

turn Timeline::layers into tokio::sync::RwLock #4441

turn Timeline::layers into tokio::sync::RwLock #4441

problame commented Jun 7, 2023 •

edited

Loading

github-actions bot commented Jun 7, 2023 •

edited

Loading

Postgres 15

This comment was marked as outdated.

This comment was marked as outdated.

skyzh left a comment

problame commented Jun 13, 2023

problame commented Jun 13, 2023

skyzh commented Jun 13, 2023

LizardWizzard left a comment

shanyp Jun 14, 2023

turn Timeline::layers into tokio::sync::RwLock #4441

turn Timeline::layers into tokio::sync::RwLock #4441

Conversation

problame commented Jun 7, 2023 • edited Loading

Full Stack Of Preliminary PRs

Significant Changes In This PR

compact_level0_phase1 & create_delta_layer

LayerIter and LayerKeyIter Send bounds

DatadirModification::flush

github-actions bot commented Jun 7, 2023 • edited Loading

1012 tests run: 971 passed, 0 failed, 41 skipped (full report)

Postgres 15

This comment was marked as outdated.

This comment was marked as outdated.

skyzh left a comment

Choose a reason for hiding this comment

problame commented Jun 13, 2023

problame commented Jun 13, 2023

skyzh commented Jun 13, 2023

LizardWizzard left a comment

Choose a reason for hiding this comment

shanyp Jun 14, 2023

Choose a reason for hiding this comment

problame commented Jun 7, 2023 •

edited

Loading

`compact_level0_phase1` & `create_delta_layer`

`LayerIter` and `LayerKeyIter` `Send` bounds

`DatadirModification::flush`

github-actions bot commented Jun 7, 2023 •

edited

Loading