Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

turn Timeline::layers into tokio::sync::RwLock #4441

Merged

Conversation

problame
Copy link
Contributor

@problame problame commented Jun 7, 2023

This is preliminary work for/from #4220 (async Layer::get_value_reconstruct_data).

Full Stack Of Preliminary PRs

Thanks to the countless preliminary PRs, this conversion is relatively straight-forward.

  1. Clean-ups
  1. Don't hold timelines lock over load layer map #4364
  2. make Delta{Value,Key}Iter Send #4472
  3. refactor check_checkpoint_distance to prepare for async Timeline::layers #4476
  4. make TimelineWriter Send by using tokio::sync Mutex internally #4477
  5. init_empty_layer_map: use try_write #4485

Significant Changes In This PR

compact_level0_phase1 & create_delta_layer

This commit partially reverts

"pgserver: spawn_blocking in compaction (#4265)"
4e359db.

Specifically, it reverts the spawn_blocking-ificiation of compact_level0_phase1.
If we didn't revert it, we'd have to use Timeline::layers.blocking_read() inside compact_level0_phase1.
That would use up a thread in the spawn_blocking thread pool, which is hard-capped.

I considered wrapping the code that follows the second layers.read().await
into spawn_blocking, but there are lifetime issues with deltas_to_compact.

Also, this PR switches the create_delta_layer function back to async, and uses spawn_blocking inside to run the code that does sync IO, while keeping the code
that needs to lock Timeline::layers async.

LayerIter and LayerKeyIter Send bounds

I had to add a Send bound on the dyn type that LayerIter and LayerKeyIter wrap.
Why? Because we now have the second layers.read().await inside compact_level0_phase,
and these iterator instances are held across that await-point.

More background: #4462 (comment)

DatadirModification::flush

Needed to replace the HashMap::retain with a hand-rolled variant because TimelineWriter::put is now async.

@problame problame changed the base branch from main to dont-hold-timelines-lock-over-load_layer_map June 7, 2023 12:48
@problame problame force-pushed the problame/async-timeline-get/timeline-layers-tokio-sync-atop-4364 branch 2 times, most recently from 8f6c8ed to a5daa1a Compare June 7, 2023 14:03
@github-actions
Copy link

github-actions bot commented Jun 7, 2023

1012 tests run: 971 passed, 0 failed, 41 skipped (full report)


Flaky tests (1)

Postgres 15

  • test_delete_timeline_post_rm_failure[local_fs]: ✅ release
The comment gets automatically updated with the latest test results
a4c4d96 at 2023-06-13T13:43:30.758Z :recycle:

@problame problame force-pushed the dont-hold-timelines-lock-over-load_layer_map branch from 14e21e2 to 6fe36bf Compare June 7, 2023 15:24
@problame problame force-pushed the problame/async-timeline-get/timeline-layers-tokio-sync-atop-4364 branch from a5daa1a to 2a80ef1 Compare June 7, 2023 15:39
@problame

This comment was marked as outdated.

@problame problame force-pushed the dont-hold-timelines-lock-over-load_layer_map branch from 6fe36bf to a07d172 Compare June 7, 2023 15:47
@problame problame force-pushed the problame/async-timeline-get/timeline-layers-tokio-sync-atop-4364 branch from 2a80ef1 to 928e245 Compare June 7, 2023 15:48
@problame

This comment was marked as outdated.

@problame problame force-pushed the dont-hold-timelines-lock-over-load_layer_map branch from a07d172 to bc5ade7 Compare June 7, 2023 16:01
@problame problame force-pushed the problame/async-timeline-get/timeline-layers-tokio-sync-atop-4364 branch from 928e245 to acd8996 Compare June 7, 2023 16:01
@problame problame force-pushed the dont-hold-timelines-lock-over-load_layer_map branch from bc5ade7 to d46abaa Compare June 12, 2023 10:18
Base automatically changed from dont-hold-timelines-lock-over-load_layer_map to main June 12, 2023 10:56
@problame problame force-pushed the problame/async-timeline-get/timeline-layers-tokio-sync-atop-4364 branch from acd8996 to cc272c2 Compare June 12, 2023 12:09
problame and others added 5 commits June 12, 2023 18:13
... by switching the internal RwLock to a OnceCell.

This is preliminary work for/from #4220 (async `Layer::get_value_reconstruct_data`).

See #4462 (comment)
for more context.

fixes #4471
…ers (#4476)

This is preliminary work for/from #4220 (async
`Layer::get_value_reconstruct_data`).

There, we want to switch `Timeline::layers` to be a
`tokio::sync::RwLock`.

That will require the `TimelineWriter` to become async.

That will require `freeze_inmem_layer` to become async.

So, inside check_checkpoint_distance, we will have
`freeze_inmem_layer().await`.

But current rustc isn't smart enough to understand that we
`drop(layers)` earlier, and hence, will complain about the `!Send`
`layers` being held across the `freeze_inmem_layer().await`-point.

This patch puts the guard into a scope, so rustc will shut up in the
next patch where we make the transition for `TimelineWriter`.

obsoletes #4474
fails with:

```
error[E0277]: `std::sync::MutexGuard<'_, ()>` cannot be sent between threads safely
    --> pageserver/src/tenant/timeline.rs:4644:20
     |
4644 |     _assert_send::<TimelineWriter<'_>>();
     |                    ^^^^^^^^^^^^^^^^^^ `std::sync::MutexGuard<'_, ()>` cannot be sent between threads safely
     |
     = help: within `tenant::timeline::TimelineWriter<'_>`, the trait `std::marker::Send` is not implemented for `std::sync::MutexGuard<'_, ()>`
note: required because it appears within the type `TimelineWriter<'_>`
    --> pageserver/src/tenant/timeline.rs:4596:12
     |
4596 | pub struct TimelineWriter<'a> {
     |            ^^^^^^^^^^^^^^
note: required by a bound in `tenant::timeline::is_send::_assert_send`
    --> pageserver/src/tenant/timeline.rs:4643:24
     |
4643 |     fn _assert_send<T: Send>() {}
     |                        ^^^^ required by this bound in `_assert_send`

For more information about this error, try `rustc --explain E0277`.
error: could not compile `pageserver` due to previous error
```
This is preliminary work for/from #4220 (async `Layer::get_value_reconstruct_data`).
Stacked on top of  #4477.

Thanks to the countless preliminary PRs, this conversion is relatively straight-forward.

Note that this commit partially reverts

   "pgserver: spawn_blocking in compaction (#4265)"
    4e359db.

because, if we didn't revert it, we'd have to use
`Timeline::layers.blocking_read()` inside compact_level0_phase1.
That would use up a thread in the spawn_blocking thread pool.

I considered wrapping the code that follows the second `layers.read().await`
into `spawn_blocking`, but there are lifetime issues with `deltas_to_compact`.
@problame problame force-pushed the problame/async-timeline-get/timeline-layers-tokio-sync-atop-4364 branch from cc272c2 to a05df95 Compare June 12, 2023 16:35
@problame problame changed the base branch from main to problame/async-timeline-get/make-timelinewriter-send June 12, 2023 16:35
@problame problame changed the title convert Timeline::layers to async RwLock atop #4364 turn Timeline::layers into tokio::sync::RwLock Jun 12, 2023
@problame problame requested a review from skyzh June 12, 2023 16:36
@problame problame requested review from LizardWizzard and shanyp June 12, 2023 16:45
Copy link
Member

@skyzh skyzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except the fsync changes. Some of them should be kept as-is because the compaction spawn_blocking PR also fixes some bugs that should not be reverted... 🤪 (See comment at timeline.rs L3269)

Base automatically changed from problame/async-timeline-get/make-timelinewriter-send to main June 13, 2023 08:15
problame added a commit that referenced this pull request Jun 13, 2023
This is preliminary work for/from #4220 (async `Layer::get_value_reconstruct_data`).
Or more specifically, #4441, where we turn Timeline::layers into a tokio::sync::RwLock.

By using try_write() here, we can avoid turning init_empty_layer_map async,
which is nice because much of its transitive call(er) graph isn't async.
@problame
Copy link
Contributor Author

Ok, so, I re-cherry-picked @skyzh 's pgserver: spawn_blocking in compaction (#4265) .

To the best of my knowledge, all that it reverts now is the spawn_blocking-ingification of compact_level0_phase1.

So, we're doing sync IO again inside the async compact_level0_phase1.
I'll try to eliminate that by wrapping the writes into block_in_place or spawn_blocking.
Will push such a commit later.

But what remains is the CPU work of the kmerge.
The lifetimes around that are non-trivial, I think it's better to fix that up / make it spawn_blocking in a later PR.

@problame
Copy link
Contributor Author

I'll try to eliminate that by wrapping the writes into block_in_place or spawn_blocking.
Will push such a commit later

Scoped this out. I don't think it's a good idea to wrap just the put_value call into a spawn_blocking. Because we only do IO (or CPU-"intensive" prefix compression) in a small fraction of the overall put_value calls.

So, I'm leaning toward just taking the regression here.
Let's quantify the regression.

@skyzh do you have a benchmark / numbers that you ran when you implemented the spawn_blocking-ingification?

problame added a commit that referenced this pull request Jun 13, 2023
This is preliminary work for/from #4220 (async
`Layer::get_value_reconstruct_data`).
Or more specifically, #4441, where we turn Timeline::layers into a
tokio::sync::RwLock.

By using try_write() here, we can avoid turning init_empty_layer_map
async,
which is nice because much of its transitive call(er) graph isn't async.
pageserver/src/tenant/timeline.rs Outdated Show resolved Hide resolved
pageserver/src/tenant/timeline.rs Show resolved Hide resolved
pageserver/src/tenant/timeline.rs Show resolved Hide resolved
pageserver/src/tenant/timeline.rs Show resolved Hide resolved
pageserver/src/tenant/timeline.rs Show resolved Hide resolved
@skyzh
Copy link
Member

skyzh commented Jun 13, 2023

do you have a benchmark / numbers that you ran when you implemented the spawn_blocking-ingification?

Unluckily no :( We can look at the daily benchmark number after merging this PR.

Copy link
Contributor

@LizardWizzard LizardWizzard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Discussed follow ups verbally

@problame problame merged commit 3693d1f into main Jun 13, 2023
@problame problame deleted the problame/async-timeline-get/timeline-layers-tokio-sync-atop-4364 branch June 13, 2023 16:38
if result.is_ok() && (is_rel_block_key(key) || is_slru_block_key(key)) {
result = writer.put(key, self.lsn, value);
false
let mut retained_pending_updates = HashMap::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use reserve to avoid reallocations

awestover pushed a commit that referenced this pull request Jun 14, 2023
This is preliminary work for/from #4220 (async
`Layer::get_value_reconstruct_data`).
Or more specifically, #4441, where we turn Timeline::layers into a
tokio::sync::RwLock.

By using try_write() here, we can avoid turning init_empty_layer_map
async,
which is nice because much of its transitive call(er) graph isn't async.
awestover pushed a commit that referenced this pull request Jun 14, 2023
This is preliminary work for/from #4220 (async `Layer::get_value_reconstruct_data`).

# Full Stack Of Preliminary PRs

Thanks to the countless preliminary PRs, this conversion is relatively
straight-forward.

1. Clean-ups
  * #4316
  * #4317
  * #4318
  * #4319
  * #4321
* Note: these were mostly to find an alternative to #4291, which I
   thought we'd need in my original plan where we would need to convert
   `Tenant::timelines` into an async locking primitive (#4333). In reviews,
   we walked away from that, but these cleanups were still quite useful.
2. #4364
3. #4472
4. #4476
5. #4477
6. #4485

# Significant Changes In This PR

## `compact_level0_phase1` & `create_delta_layer`

This commit partially reverts

   "pgserver: spawn_blocking in compaction (#4265)"
    4e359db.

Specifically, it reverts the `spawn_blocking`-ificiation of
`compact_level0_phase1`.
If we didn't revert it, we'd have to use `Timeline::layers.blocking_read()`
inside `compact_level0_phase1`. That would use up a thread in the
`spawn_blocking` thread pool, which is hard-capped.

I considered wrapping the code that follows the second
`layers.read().await` into `spawn_blocking`, but there are lifetime
issues with `deltas_to_compact`.

Also, this PR switches the `create_delta_layer` _function_ back to
async, and uses `spawn_blocking` inside to run the code that does sync
IO, while keeping the code that needs to lock `Timeline::layers` async.

## `LayerIter` and `LayerKeyIter` `Send` bounds

I had to add a `Send` bound on the `dyn` type that `LayerIter`
and `LayerKeyIter` wrap. Why? Because we now have the second
`layers.read().await` inside `compact_level0_phase`, and these
iterator instances are held across that await-point.

More background:
#4462 (comment)

## `DatadirModification::flush`

Needed to replace the `HashMap::retain` with a hand-rolled variant
because `TimelineWriter::put` is now async.
problame added a commit that referenced this pull request Jun 16, 2023
We already do it inside `frozen_layer.write_to_disk()`.

Context:
#4441 (comment)
problame added a commit that referenced this pull request Jun 26, 2023
This PR concludes the "async `Layer::get_value_reconstruct_data`"
project.

The problem we're solving is that, before this patch, we'd execute
`Layer::get_value_reconstruct_data` on the tokio executor threads.
This function is IO- and/or CPU-intensive.
The IO is using VirtualFile / std::fs; hence it's blocking.
This results in unfairness towards other tokio tasks, especially under
(disk) load.

Some context can be found at
#4154
where I suspect (but can't prove) load spikes of logical size
calculation to
cause heavy eviction skew.

Sadly we don't have tokio runtime/scheduler metrics to quantify the
unfairness.
But generally, we know blocking the executor threads on std::fs IO is
bad.
So, let's have this change and watch out for severe perf regressions in
staging & during rollout.

## Changes

* rename `Layer::get_value_reconstruct_data` to
`Layer::get_value_reconstruct_data_blocking`
* add a new blanket impl'd `Layer::get_value_reconstruct_data`
`async_trait` method that runs `get_value_reconstruct_data_blocking`
inside `spawn_blocking`.
* The `spawn_blocking` requires `'static` lifetime of the captured
variables; hence I had to change the data flow to _move_ the
`ValueReconstructState` into and back out of get_value_reconstruct_data
instead of passing a reference. It's a small struct, so I don't expect a
big performance penalty.

## Performance

Fundamentally, the code changes cause the following performance-relevant
changes:

* Latency & allocations: each `get_value_reconstruct_data` call now
makes a short-lived allocation because `async_trait` is just sugar for
boxed futures under the hood
* Latency: `spawn_blocking` adds some latency because it needs to move
the work to a thread pool
* using `spawn_blocking` plus the existing synchronous code inside is
probably more efficient better than switching all the synchronous code
to tokio::fs because _each_ tokio::fs call does `spawn_blocking` under
the hood.
* Throughput: the `spawn_blocking` thread pool is much larger than the
async executor thread pool. Hence, as long as the disks can keep up,
which they should according to AWS specs, we will be able to deliver
higher `get_value_reconstruct_data` throughput.
* Disk IOPS utilization: we will see higher disk utilization if we get
more throughput. Not a problem because the disks in prod are currently
under-utilized, according to node_exporter metrics & the AWS specs.
* CPU utilization: at higher throughput, CPU utilization will be higher.

Slightly higher latency under regular load is acceptable given the
throughput gains and expected better fairness during disk load peaks,
such as logical size calculation peaks uncovered in #4154.


## Full Stack Of Preliminary PRs

This PR builds on top of the following preliminary PRs

1. Clean-ups
  * #4316
  * #4317
  * #4318
  * #4319
  * #4321
* Note: these were mostly to find an alternative to #4291, which I
thought we'd need in my original plan where we would need to convert
`Tenant::timelines` into an async locking primitive (#4333). In reviews,
we walked away from that, but these cleanups were still quite useful.
2. #4364
3. #4472
4. #4476
5. #4477
6. #4485
7. #4441
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants