-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
convert Timeline::layers
to async RwLock atop #4321
#4360
convert Timeline::layers
to async RwLock atop #4321
#4360
Conversation
This tightens up the API a little. Byproduct of some refactoring work that I'm doing right now.
(This is prep work to make `Timeline::activate()` infallible.) The current possibility for failure in `Timeline::activate()` is the broker client's presence / absence. It should be an assert, but we're careful with these. So, I'm planning to pass in the broker client to activate(), thereby eliminating the possiblity of its absence. In the unit tests, we don't have a broker client. So, I thought I'd be in trouble because the unit tests also called `activate()` before this PR. However, closer inspection reveals a long-standing FIXME about this, which is addressed by this patch. It turns out that the unit tests don't actually need the background loops to be running. They just need the state value to be `Active`. So, for the tests, we just set it to that value but don't spawn the background loops. We'll need to revisit this if we ever do more Rust unit tests in the future. But right now, this refactoring improves the code, so, let's revisit when we get there.
(This is prep work to make `Timeline::activate` infallible.) This patch removes the global storage_broker client instance from the pageserver codebase. Instead, pageserver startup instantiates it and passes it down to the `Timeline::activate` function, which in turn passes it to the WalReceiver, which is the entity that actually uses it.
Timeline::activate() was only fallible because `launch_wal_receiver` was. `launch_wal_receiver` was fallible only because of some preliminary checks in `WalReceiver::start`. Turns out these checks can be shifted to the type system by delaying creatinon of the `WalReceiver` struct to the point where we activate the timeline. The changes in this PR were enabled by my previous refactoring that funneled the broker_client from pageserver startup to the activate() call sites.
…s lock inside Tenant::activate
…nd-timeline-activation' into problame/infallible-timeline-activate/3-funnel-storage-broker-client
…broker-client' into problame/infallible-timeline-activate/4-make-infallible
…' into problame/async-timeline-get/dont-hold-timelines-lock-inside-tenant-state-send-modify
…imeline-activate/2-pushup-tenant-and-timeline-activation
…nd-timeline-activation' into problame/infallible-timeline-activate/3-funnel-storage-broker-client
…nd-timeline-activation' into problame/infallible-timeline-activate/3-funnel-storage-broker-client
It's a wrapper around an inner Arc anyways Also, this gets rid of the OnceCell
…imeline-activate/3-funnel-storage-broker-client
…ivate/3-funnel-storage-broker-client' into problame/infallible-timeline-activate/4-make-infallible
…where it's produced
not_broken_timelines is an iterator, doesn't have `len()`. This reverts commit 4001f44.
…' into problame/async-timeline-get/dont-hold-timelines-lock-inside-tenant-state-send-modify
…imeline-activate/3-funnel-storage-broker-client
…broker-client' into problame/infallible-timeline-activate/4-make-infallible
…broker-client' into problame/infallible-timeline-activate/4-make-infallible
This reverts commit fe4ef12.
…' into problame/async-timeline-get/dont-hold-timelines-lock-inside-tenant-state-send-modify
This reverts commit eaf270c.
…/ spawn_attach See the Mermaid diagram in the doc comment for the now-possible state transitions. The two core insights / changes are: - spawn_load and spawn_attach own the tenant state until they're done - once load()/attach() calls are done - if they failed, transition them to Broken directly (we know that there's no background activity because we didn't call activate yet) - if they succeed, call activate. We can make it infallible. How? Later. - set_broken() and set_stopping() are changed to wait for spawn_load() / spawn_attach() to finish. This sounds scary because it might hinder detach or shutdown, but actually, concurrent attach+detach, or attach+shutdown, or load+shutdown, or attach+shutdown were just racy. With this change, they're not anymore. We can add a CancellationToken stored in Tenant for load/attach and cancel it from set_stopping() or set_broken() if necessary in the future. So, why can activate() be infallible now: because we declare that spawn_load and spawn_attach own the tenant state until they're done. And we enforce that ownership using the wait_for at the start of set_stopping and set_broken.
…ne-get/dont-hold-timelines-lock-inside-tenant-state-send-modify
…ne-get/dont-hold-timelines-lock-inside-tenant-state-send-modify
…also info!() This should fix the faile regress tests that barked on allowed_errors
…_map This patch inlines `initialize_with_lock` and then reorganizes the code such that we can `load_layer_map` without holding the `Tenant::timelines` lock. As a nice aside, we can get rid of the dummy() uninit mark, which has always been a terrible hack.
…:timelines held code
…ct_level0_phase1 Without this, the seocnd read().unwrap() becomes an await point, which makes the future not-Send, but, we require it to be Send because it runs inside task_mgr::spawn, which requires the Fut's to be Send (cherry picked from commit a1ae23b)
…utex internally fails with cs@devvm:[~/src/neon]: cargo check -p pageserver --features testing Checking pageserver v0.1.0 (/home/cs/src/neon/pageserver) error: future cannot be sent between threads safely --> pageserver/src/tenant/timeline/walreceiver/connection_manager.rs:426:33 | 426 | let connection_handle = TaskHandle::spawn(move |events_sender, cancellation| { | ^^^^^^^^^^^^^^^^^ future created by async block is not `Send` | = help: within `Instrumented<[async block@pageserver/src/tenant/timeline/walreceiver/connection_manager.rs:427:13: 439:14]>`, the trait `std::marker::Send` is not implemented for `std::sync::RwLockReadGuard<'_, LayerMap<dyn PersistentLayer>>` note: future is not `Send` as this value is used across an await --> pageserver/src/tenant/timeline.rs:872:46 | 850 | let layers = self.layers.read().unwrap(); | ------ has type `std::sync::RwLockReadGuard<'_, LayerMap<dyn PersistentLayer>>` which is not `Send` ... 872 | self.freeze_inmem_layer(true).await; | ^^^^^^ await occurs here, with `layers` maybe used later ... 881 | } | - `layers` is later dropped here note: required by a bound in `TaskHandle::<E>::spawn` --> pageserver/src/tenant/timeline/walreceiver.rs:196:52 | 192 | fn spawn<Fut>( | ----- required by a bound in this ... 196 | Fut: Future<Output = anyhow::Result<()>> + Send, | ^^^^ required by this bound in `TaskHandle::<E>::spawn` error: could not compile `pageserver` due to previous error (cherry picked from commit 7de3799)
(cherry picked from commit 2001c31)
let mut level0_deltas = layers.get_level0_deltas()?; | ||
drop(layers); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll need to understand why this needed to be dropped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This thread in re: 24df184.
Oki the problem becomes all_keys_iter
and all_values_iter
are not Send. And they are not Send because they hold the internal to delta layer RwLockGuard.
I don't think this is so controversial.
The send requirement has to be at task_mgr::spawn
because it's at tokio::spawn
. Theoretically we could use a LocalSet but this is sort of "on the way" to make Layer internals handle evictedness, so I think this is fine.
Also: rust type system saves the day.
1004 tests run: 961 passed, 2 failed, 41 skipped (full report)Failures on Posgres 15
Failures on Posgres 14
The comment gets automatically updated with the latest test results
218af17 at 2023-05-26T18:41:57.293Z :recycle: |
convert Timeline::layers to async RwLock atop #4321
This implements @hlinnaka 's suggestion from #4333 (comment)