refactor TenantState transitions #4321

problame · 2023-05-23T18:20:12Z

Motivation

This is preliminary work for/from #4220 (async Layer::get_value_reconstruct_data).
The motivation is to avoid locking Tenant::timelines in places that can't be async, because in #4333 we want to convert Tenant::timelines from std::sync::Mutex to tokio::sync::Mutex.

But, the changes here are useful in general because they clean up & document tenant state transitions.
That also paves the way for #4350, which is an alternative to #4333 that refactors the pageserver code so that we can keep the Tenant::timelines mutex sync.

Changes

This patch consists of the following core insights and changes:

spawn_load and spawn_attach own the tenant state until they're done
once load()/attach() calls are done ...
- if they failed, transition them to Broken directly (we know that there's no background activity because we didn't call activate yet)
- if they succeed, call activate. We can make it infallible. How? Later.
- set_broken() and set_stopping() are changed to wait for spawn_load() / spawn_attach() to finish.
- This sounds scary because it might hinder detach or shutdown, but actually, concurrent attach+detach, or attach+shutdown, or load+shutdown, or attach+shutdown were just racy before this PR.
  So, with this change, they're not anymore.
  In the future, we can add a CancellationToken stored in Tenant to cancel load and attach faster, i.e., make spawn_load / spawn_attach transition them to Broken state sooner.

See the doc comments on TenantState for the state transitions that are now possible.
It might seem scary, but actually, this patch reduces the possible state transitions.

We introduce a new state TenantState::Activating to avoid grabbing the Tenant::timelines lock inside the send_modify closure.
These were the humble beginnings of this PR (see Motivation section), and I think it's still the right thing to have this Activating state, even if we decide against async Tenant::timelines mutex. The reason is that send_modify locks internally, and by moving locking of Tenant::timelines out of the closure, the internal locking of send_modify becomes a leaf of the lock graph, and so, we eliminate deadlock risk.

Patch Stack

This PR sits on top of

refactor: make timeline activation infallible #4319

because otherwise we'd need to deal with falliblity of Timeline::activate() inside the if activating { branch (e.g., stopping already started background loops, etc).

This tightens up the API a little. Byproduct of some refactoring work that I'm doing right now.

(This is prep work to make `Timeline::activate()` infallible.) The current possibility for failure in `Timeline::activate()` is the broker client's presence / absence. It should be an assert, but we're careful with these. So, I'm planning to pass in the broker client to activate(), thereby eliminating the possiblity of its absence. In the unit tests, we don't have a broker client. So, I thought I'd be in trouble because the unit tests also called `activate()` before this PR. However, closer inspection reveals a long-standing FIXME about this, which is addressed by this patch. It turns out that the unit tests don't actually need the background loops to be running. They just need the state value to be `Active`. So, for the tests, we just set it to that value but don't spawn the background loops. We'll need to revisit this if we ever do more Rust unit tests in the future. But right now, this refactoring improves the code, so, let's revisit when we get there.

(This is prep work to make `Timeline::activate` infallible.) This patch removes the global storage_broker client instance from the pageserver codebase. Instead, pageserver startup instantiates it and passes it down to the `Timeline::activate` function, which in turn passes it to the WalReceiver, which is the entity that actually uses it.

Timeline::activate() was only fallible because `launch_wal_receiver` was. `launch_wal_receiver` was fallible only because of some preliminary checks in `WalReceiver::start`. Turns out these checks can be shifted to the type system by delaying creatinon of the `WalReceiver` struct to the point where we activate the timeline. The changes in this PR were enabled by my previous refactoring that funneled the broker_client from pageserver startup to the activate() call sites.

…s lock inside Tenant::activate

…nt_id

github-actions · 2023-05-23T18:36:58Z

1004 tests run: 963 passed, 0 failed, 41 skipped (full report)

Flaky tests (1)

Postgres 15

test_threshold_based_eviction: ✅ debug

_{The comment gets automatically updated with the latest test results
d6d0c74 at 2023-05-29T11:22:49.845Z :recycle:}

…nd-timeline-activation' into problame/infallible-timeline-activate/3-funnel-storage-broker-client

…o runtime thread

…broker-client' into problame/infallible-timeline-activate/4-make-infallible

…' into problame/async-timeline-get/dont-hold-timelines-lock-inside-tenant-state-send-modify

pageserver/src/tenant.rs

…imeline-activate/2-pushup-tenant-and-timeline-activation

…nd-timeline-activation' into problame/infallible-timeline-activate/3-funnel-storage-broker-client

stale

problame · 2023-05-26T16:57:29Z

Again, I would recommend https://github.com/orgs/neondatabase/projects/38/views/6 for a view on all the in-flight PRs.

pageserver/src/tenant.rs

pageserver/src/tenant/mgr.rs

the panic hook will include the current span, so we will not lose it. report the total of observed panics as a number if any.

because that's where the span is.

pageserver/src/tenant/mgr.rs

problame

changes on top lgtm, @koivunej please rerun tests to pass and merge, or get another review from third party monday

koivunej · 2023-05-26T18:17:58Z

Yep, thanks. I don't think those latest changes are interesting enough. I'll approve next.

koivunej · 2023-05-29T08:06:54Z

It seems there's an actual test failure:

test_timeline_resurrection_on_attach[True-local_fs]: debug with:

AssertionError: assert not ['2023-05-29T07:51:54.527416Z ERROR request{method=GET path=/v1/tenant/bdc883cb95af183d3c72095b938eeb95 request_id=2098afa1-e542-4213-acae-94ff27cfd86f}:tenant_status_handler{tenant=bdc883cb95af183d3c72095b938eeb95}:panic{thread=mgmt request worker location=libs/pageserver_api/src/models.rs:106:33}: not yet implemented\n']

It does seem it was changed on this PR, I think we need a better way for this. Link to conversation: #4321 (comment) Also noted on the issue #4344 that we shouldn't merge this to cause havoc for tests.

Confirmed that flakyness is just because load can take a long time, we do get to doing the tenant query before that so the todo!() got to happen. Fixing this soon regardless.

Fixed with d6d0c74

it was never used ("".parse<TenantState>()) but it would had performed poorly; it uses Default::default on any contained fields within variants.

this is in line of the idea of the state being a transparent or RWLock technical in nature.

This is preliminary work for/from #4220 (async `Layer::get_value_reconstruct_data`). # Full Stack Of Preliminary PRs Thanks to the countless preliminary PRs, this conversion is relatively straight-forward. 1. Clean-ups * #4316 * #4317 * #4318 * #4319 * #4321 * Note: these were mostly to find an alternative to #4291, which I thought we'd need in my original plan where we would need to convert `Tenant::timelines` into an async locking primitive (#4333). In reviews, we walked away from that, but these cleanups were still quite useful. 2. #4364 3. #4472 4. #4476 5. #4477 6. #4485 # Significant Changes In This PR ## `compact_level0_phase1` & `create_delta_layer` This commit partially reverts "pgserver: spawn_blocking in compaction (#4265)" 4e359db. Specifically, it reverts the `spawn_blocking`-ificiation of `compact_level0_phase1`. If we didn't revert it, we'd have to use `Timeline::layers.blocking_read()` inside `compact_level0_phase1`. That would use up a thread in the `spawn_blocking` thread pool, which is hard-capped. I considered wrapping the code that follows the second `layers.read().await` into `spawn_blocking`, but there are lifetime issues with `deltas_to_compact`. Also, this PR switches the `create_delta_layer` _function_ back to async, and uses `spawn_blocking` inside to run the code that does sync IO, while keeping the code that needs to lock `Timeline::layers` async. ## `LayerIter` and `LayerKeyIter` `Send` bounds I had to add a `Send` bound on the `dyn` type that `LayerIter` and `LayerKeyIter` wrap. Why? Because we now have the second `layers.read().await` inside `compact_level0_phase`, and these iterator instances are held across that await-point. More background: #4462 (comment) ## `DatadirModification::flush` Needed to replace the `HashMap::retain` with a hand-rolled variant because `TimelineWriter::put` is now async.

This PR concludes the "async `Layer::get_value_reconstruct_data`" project. The problem we're solving is that, before this patch, we'd execute `Layer::get_value_reconstruct_data` on the tokio executor threads. This function is IO- and/or CPU-intensive. The IO is using VirtualFile / std::fs; hence it's blocking. This results in unfairness towards other tokio tasks, especially under (disk) load. Some context can be found at #4154 where I suspect (but can't prove) load spikes of logical size calculation to cause heavy eviction skew. Sadly we don't have tokio runtime/scheduler metrics to quantify the unfairness. But generally, we know blocking the executor threads on std::fs IO is bad. So, let's have this change and watch out for severe perf regressions in staging & during rollout. ## Changes * rename `Layer::get_value_reconstruct_data` to `Layer::get_value_reconstruct_data_blocking` * add a new blanket impl'd `Layer::get_value_reconstruct_data` `async_trait` method that runs `get_value_reconstruct_data_blocking` inside `spawn_blocking`. * The `spawn_blocking` requires `'static` lifetime of the captured variables; hence I had to change the data flow to _move_ the `ValueReconstructState` into and back out of get_value_reconstruct_data instead of passing a reference. It's a small struct, so I don't expect a big performance penalty. ## Performance Fundamentally, the code changes cause the following performance-relevant changes: * Latency & allocations: each `get_value_reconstruct_data` call now makes a short-lived allocation because `async_trait` is just sugar for boxed futures under the hood * Latency: `spawn_blocking` adds some latency because it needs to move the work to a thread pool * using `spawn_blocking` plus the existing synchronous code inside is probably more efficient better than switching all the synchronous code to tokio::fs because _each_ tokio::fs call does `spawn_blocking` under the hood. * Throughput: the `spawn_blocking` thread pool is much larger than the async executor thread pool. Hence, as long as the disks can keep up, which they should according to AWS specs, we will be able to deliver higher `get_value_reconstruct_data` throughput. * Disk IOPS utilization: we will see higher disk utilization if we get more throughput. Not a problem because the disks in prod are currently under-utilized, according to node_exporter metrics & the AWS specs. * CPU utilization: at higher throughput, CPU utilization will be higher. Slightly higher latency under regular load is acceptable given the throughput gains and expected better fairness during disk load peaks, such as logical size calculation peaks uncovered in #4154. ## Full Stack Of Preliminary PRs This PR builds on top of the following preliminary PRs 1. Clean-ups * #4316 * #4317 * #4318 * #4319 * #4321 * Note: these were mostly to find an alternative to #4291, which I thought we'd need in my original plan where we would need to convert `Tenant::timelines` into an async locking primitive (#4333). In reviews, we walked away from that, but these cleanups were still quite useful. 2. #4364 3. #4472 4. #4476 5. #4477 6. #4485 7. #4441

problame added 7 commits May 23, 2023 19:25

tenant_map_insert: don't expose the vacant entry to the closure

cc96a51

This tightens up the API a little. Byproduct of some refactoring work that I'm doing right now.

refactor: introduce TenantState::Activating to avoid holding timeline…

3e604ea

…s lock inside Tenant::activate

don't hold timelines lock inside set_stopping()

ee22e81

tests were failing because activate() was outside of a span with tena…

feb2e80

…nt_id

problame added 4 commits May 23, 2023 20:42

Merge branch 'problame/infallible-timeline-activate/2-pushup-tenant-a…

4f586ac

…nd-timeline-activation' into problame/infallible-timeline-activate/3-funnel-storage-broker-client

tests would fail because broker client needs to be launched on a toki…

a55d224

…o runtime thread

Merge branch 'problame/infallible-timeline-activate/3-funnel-storage-…

94f30f0

…broker-client' into problame/infallible-timeline-activate/4-make-infallible

Merge branch 'problame/infallible-timeline-activate/4-make-infallible…

b2e0c58

…' into problame/async-timeline-get/dont-hold-timelines-lock-inside-tenant-state-send-modify

problame marked this pull request as ready for review May 23, 2023 19:02

problame requested review from a team as code owners May 23, 2023 19:02

problame requested review from save-buffer, skyzh, koivunej and a team and removed request for a team, save-buffer and skyzh May 23, 2023 19:02

problame mentioned this pull request May 23, 2023

allow async code inside Tenant::state.send_modify #4291

Closed

shanyp reviewed May 24, 2023

View reviewed changes

pageserver/src/tenant.rs Show resolved Hide resolved

shanyp reviewed May 24, 2023

View reviewed changes

pageserver/src/tenant.rs Show resolved Hide resolved

shanyp reviewed May 24, 2023

View reviewed changes

pageserver/src/tenant.rs Outdated Show resolved Hide resolved

problame added 3 commits May 24, 2023 11:31

Merge remote-tracking branch 'origin/main' into problame/infallible-t…

32c85fa

…imeline-activate/2-pushup-tenant-and-timeline-activation

Merge branch 'problame/infallible-timeline-activate/2-pushup-tenant-a…

bdf03ea

…nd-timeline-activation' into problame/infallible-timeline-activate/3-funnel-storage-broker-client

don't unwrap() the activate() result in spawn_load / spawn_attach

75c3c43

koivunej reviewed May 26, 2023

View reviewed changes

pageserver/src/tenant.rs Show resolved Hide resolved

koivunej reviewed May 26, 2023

View reviewed changes

pageserver/src/tenant/mgr.rs Outdated Show resolved Hide resolved

koivunej reviewed May 26, 2023

View reviewed changes

pageserver/src/tenant/mgr.rs Outdated Show resolved Hide resolved

koivunej reviewed May 26, 2023

View reviewed changes

pageserver/src/tenant/mgr.rs Outdated Show resolved Hide resolved

koivunej added 4 commits May 26, 2023 20:56

fix: report tenant_id with the spawned set_stopping

6bee7df

the panic hook will include the current span, so we will not lose it. report the total of observed panics as a number if any.

fix: add spans for shutdown_pageserver and shutdown_all_tenants

71fa6d9

fix: use join_set for freeze_and_flush

f46121e

refactor: log within the spawned task

fbf94c0

because that's where the span is.

koivunej force-pushed the problame/async-timeline-get/dont-hold-timelines-lock-inside-tenant-state-send-modify branch from 745e3d4 to fbf94c0 Compare May 26, 2023 17:56

koivunej reviewed May 26, 2023

View reviewed changes

pageserver/src/tenant/mgr.rs Outdated Show resolved Hide resolved

koivunej reviewed May 26, 2023

View reviewed changes

pageserver/src/tenant/mgr.rs Outdated Show resolved Hide resolved

doc: minor typo fixes

a295ea4

problame commented May 26, 2023

View reviewed changes

pageserver/src/tenant/mgr.rs Show resolved Hide resolved

problame commented May 26, 2023

View reviewed changes

koivunej approved these changes May 26, 2023

View reviewed changes

problame mentioned this pull request May 26, 2023

convert Timeline::layers to async RwLock atop #4321 #4360

Closed

empty for new ci run

7c507d4

koivunej added 2 commits May 29, 2023 13:34

fix: drop unused strum::EnumString on TenantState

c9ec933

it was never used ("".parse<TenantState>()) but it would had performed poorly; it uses Default::default on any contained fields within variants.

fix: solve todo by using attachment_status of earlier states

d6d0c74

this is in line of the idea of the state being a transparent or RWLock technical in nature.

koivunej merged commit f4f3007 into main May 29, 2023

koivunej deleted the problame/async-timeline-get/dont-hold-timelines-lock-inside-tenant-state-send-modify branch May 29, 2023 14:52

problame mentioned this pull request Jun 13, 2023

turn Timeline::layers into tokio::sync::RwLock #4441

Merged

problame mentioned this pull request Jun 14, 2023

run Layer::get_value_reconstruct_data in spawn_blocking #4498

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor TenantState transitions #4321

refactor TenantState transitions #4321

problame commented May 23, 2023 •

edited

Loading

github-actions bot commented May 23, 2023 •

edited

Loading

Postgres 15

problame commented May 26, 2023

problame left a comment

koivunej commented May 26, 2023

koivunej commented May 29, 2023 •

edited

Loading

refactor TenantState transitions #4321

refactor TenantState transitions #4321

Conversation

problame commented May 23, 2023 • edited Loading

Motivation

Changes

Patch Stack

github-actions bot commented May 23, 2023 • edited Loading

1004 tests run: 963 passed, 0 failed, 41 skipped (full report)

Postgres 15

problame commented May 26, 2023

problame left a comment

Choose a reason for hiding this comment

koivunej commented May 26, 2023

koivunej commented May 29, 2023 • edited Loading

problame commented May 23, 2023 •

edited

Loading

github-actions bot commented May 23, 2023 •

edited

Loading

koivunej commented May 29, 2023 •

edited

Loading