Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert Timeline::layers to async RwLock atop #4321 #4360

Conversation

problame
Copy link
Contributor

@problame problame commented May 26, 2023

convert Timeline::layers to async RwLock atop #4321

This implements @hlinnaka 's suggestion from #4333 (comment)

problame added 30 commits May 23, 2023 19:25
This tightens up the API a little.
Byproduct of some refactoring work that I'm doing right now.
(This is prep work to make `Timeline::activate()` infallible.)

The current possibility for failure in `Timeline::activate()`
is the broker client's presence / absence. It should be an assert, but
we're careful with these. So, I'm planning to pass in the broker
client to activate(), thereby eliminating the possiblity of its absence.

In the unit tests, we don't have a broker client. So, I thought I'd be
in trouble because the unit tests also called `activate()` before this
PR.

However, closer inspection reveals a long-standing FIXME about this,
which is addressed by this patch.

It turns out that the unit tests don't actually need the background
loops to be running. They just need the state value to be `Active`.
So, for the tests, we just set it to that value but don't spawn
the background loops.

We'll need to revisit this if we ever do more Rust unit tests in the
future. But right now, this refactoring improves the code, so, let's
revisit when we get there.
(This is prep work to make `Timeline::activate` infallible.)

This patch removes the global storage_broker client instance from the
pageserver codebase.

Instead, pageserver startup instantiates it and passes it down to
the `Timeline::activate` function, which in turn passes it to
the WalReceiver, which is the entity that actually uses it.
Timeline::activate() was only fallible because `launch_wal_receiver` was.

`launch_wal_receiver` was fallible only because of some preliminary
checks in `WalReceiver::start`.

Turns out these checks can be shifted to the type system by delaying
creatinon of the `WalReceiver` struct to the point where we activate the timeline.

The changes in this PR were enabled by my previous refactoring that funneled
the broker_client from pageserver startup to the activate() call sites.
…nd-timeline-activation' into problame/infallible-timeline-activate/3-funnel-storage-broker-client
…broker-client' into problame/infallible-timeline-activate/4-make-infallible
…' into problame/async-timeline-get/dont-hold-timelines-lock-inside-tenant-state-send-modify
…imeline-activate/2-pushup-tenant-and-timeline-activation
…nd-timeline-activation' into problame/infallible-timeline-activate/3-funnel-storage-broker-client
…nd-timeline-activation' into problame/infallible-timeline-activate/3-funnel-storage-broker-client
It's a wrapper around an inner Arc anyways

Also, this gets rid of the OnceCell
…imeline-activate/3-funnel-storage-broker-client
…ivate/3-funnel-storage-broker-client' into problame/infallible-timeline-activate/4-make-infallible
not_broken_timelines is an iterator, doesn't have `len()`.

This reverts commit 4001f44.
…' into problame/async-timeline-get/dont-hold-timelines-lock-inside-tenant-state-send-modify
…imeline-activate/3-funnel-storage-broker-client
…broker-client' into problame/infallible-timeline-activate/4-make-infallible
problame added 19 commits May 25, 2023 14:53
…broker-client' into problame/infallible-timeline-activate/4-make-infallible
…' into problame/async-timeline-get/dont-hold-timelines-lock-inside-tenant-state-send-modify
…/ spawn_attach

See the Mermaid diagram in the doc comment for the now-possible state transitions.

The two core insights / changes are:
- spawn_load and spawn_attach own the tenant state until they're done
- once load()/attach() calls are done
    - if they failed, transition them to Broken directly (we know
      that there's no background activity because we didn't call activate yet)
    - if they succeed, call activate. We can make it infallible. How? Later.

- set_broken() and set_stopping() are changed to wait for spawn_load() /
  spawn_attach() to finish. This sounds scary because it might hinder
  detach or shutdown, but actually, concurrent attach+detach, or
  attach+shutdown, or load+shutdown, or attach+shutdown were just racy.
  With this change, they're not anymore.
  We can add a CancellationToken stored in Tenant for load/attach and cancel
  it from set_stopping() or set_broken() if necessary in the future.

So, why can activate() be infallible now: because we declare that
spawn_load and spawn_attach own the tenant state until they're done.
And we enforce that ownership using the wait_for at the start of
set_stopping and set_broken.
…ne-get/dont-hold-timelines-lock-inside-tenant-state-send-modify
…ne-get/dont-hold-timelines-lock-inside-tenant-state-send-modify
…also info!()

This should fix the faile regress tests that barked on allowed_errors
…_map

This patch inlines `initialize_with_lock` and then reorganizes the code
such that we can `load_layer_map` without holding the
`Tenant::timelines` lock.

As a nice aside, we can get rid of the dummy() uninit mark, which has
always been a terrible hack.
…ct_level0_phase1

Without this, the seocnd read().unwrap() becomes an await point,
which makes the future not-Send, but, we require it to be Send
because it runs inside task_mgr::spawn, which requires the Fut's to be Send

(cherry picked from commit a1ae23b)
The checkpointer loop isn't running anyway, so, there's no risk of
blocking it through the pre-lock.

(cherry picked from commit 1b26633)
(cherry picked from commit a1680b1)
…utex internally

fails with

cs@devvm:[~/src/neon]: cargo check -p pageserver  --features testing
    Checking pageserver v0.1.0 (/home/cs/src/neon/pageserver)
error: future cannot be sent between threads safely
   --> pageserver/src/tenant/timeline/walreceiver/connection_manager.rs:426:33
    |
426 |         let connection_handle = TaskHandle::spawn(move |events_sender, cancellation| {
    |                                 ^^^^^^^^^^^^^^^^^ future created by async block is not `Send`
    |
    = help: within `Instrumented<[async block@pageserver/src/tenant/timeline/walreceiver/connection_manager.rs:427:13: 439:14]>`, the trait `std::marker::Send` is not implemented for `std::sync::RwLockReadGuard<'_, LayerMap<dyn PersistentLayer>>`
note: future is not `Send` as this value is used across an await
   --> pageserver/src/tenant/timeline.rs:872:46
    |
850 |         let layers = self.layers.read().unwrap();
    |             ------ has type `std::sync::RwLockReadGuard<'_, LayerMap<dyn PersistentLayer>>` which is not `Send`
...
872 |                 self.freeze_inmem_layer(true).await;
    |                                              ^^^^^^ await occurs here, with `layers` maybe used later
...
881 |     }
    |     - `layers` is later dropped here
note: required by a bound in `TaskHandle::<E>::spawn`
   --> pageserver/src/tenant/timeline/walreceiver.rs:196:52
    |
192 |     fn spawn<Fut>(
    |        ----- required by a bound in this
...
196 |         Fut: Future<Output = anyhow::Result<()>> + Send,
    |                                                    ^^^^ required by this bound in `TaskHandle::<E>::spawn`

error: could not compile `pageserver` due to previous error
(cherry picked from commit 7de3799)
@problame problame changed the base branch from main to problame/async-timeline-get/dont-hold-timelines-lock-inside-tenant-state-send-modify May 26, 2023 18:20
let mut level0_deltas = layers.get_level0_deltas()?;
drop(layers);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll need to understand why this needed to be dropped.

Copy link
Member

@koivunej koivunej May 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This thread in re: 24df184.

Oki the problem becomes all_keys_iter and all_values_iter are not Send. And they are not Send because they hold the internal to delta layer RwLockGuard.

I don't think this is so controversial.

The send requirement has to be at task_mgr::spawn because it's at tokio::spawn. Theoretically we could use a LocalSet but this is sort of "on the way" to make Layer internals handle evictedness, so I think this is fine.

Also: rust type system saves the day.

@github-actions
Copy link

1004 tests run: 961 passed, 2 failed, 41 skipped (full report)


Failures on Posgres 15

  • test_close_on_connections_exit: debug
# Run failed on Postgres 15 tests locally:
DEFAULT_PG_VERSION=15 scripts/pytest -k "test_close_on_connections_exit[debug-pg15]"

Failures on Posgres 14

  • test_remote_timeline_client_calls_started_metric[local_fs]: release
# Run failed on Postgres 14 tests locally:
DEFAULT_PG_VERSION=14 scripts/pytest -k "test_remote_timeline_client_calls_started_metric[release-pg14-local_fs]"
The comment gets automatically updated with the latest test results
218af17 at 2023-05-26T18:41:57.293Z :recycle:

Base automatically changed from problame/async-timeline-get/dont-hold-timelines-lock-inside-tenant-state-send-modify to main May 29, 2023 14:52
@problame
Copy link
Contributor Author

problame commented Jun 7, 2023

Heikki extracted a few preliminary commits in this PR into #4364 , then added stuff on top.

I then rebased the remaining changes from this PR onto #4364 , resulting in #4441

So, this PR is obsolete now in favor of #4441

@problame problame closed this Jun 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants