-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor TenantState transitions #4321
refactor TenantState transitions #4321
Conversation
This tightens up the API a little. Byproduct of some refactoring work that I'm doing right now.
(This is prep work to make `Timeline::activate()` infallible.) The current possibility for failure in `Timeline::activate()` is the broker client's presence / absence. It should be an assert, but we're careful with these. So, I'm planning to pass in the broker client to activate(), thereby eliminating the possiblity of its absence. In the unit tests, we don't have a broker client. So, I thought I'd be in trouble because the unit tests also called `activate()` before this PR. However, closer inspection reveals a long-standing FIXME about this, which is addressed by this patch. It turns out that the unit tests don't actually need the background loops to be running. They just need the state value to be `Active`. So, for the tests, we just set it to that value but don't spawn the background loops. We'll need to revisit this if we ever do more Rust unit tests in the future. But right now, this refactoring improves the code, so, let's revisit when we get there.
(This is prep work to make `Timeline::activate` infallible.) This patch removes the global storage_broker client instance from the pageserver codebase. Instead, pageserver startup instantiates it and passes it down to the `Timeline::activate` function, which in turn passes it to the WalReceiver, which is the entity that actually uses it.
Timeline::activate() was only fallible because `launch_wal_receiver` was. `launch_wal_receiver` was fallible only because of some preliminary checks in `WalReceiver::start`. Turns out these checks can be shifted to the type system by delaying creatinon of the `WalReceiver` struct to the point where we activate the timeline. The changes in this PR were enabled by my previous refactoring that funneled the broker_client from pageserver startup to the activate() call sites.
…s lock inside Tenant::activate
1004 tests run: 963 passed, 0 failed, 41 skipped (full report)The comment gets automatically updated with the latest test results
d6d0c74 at 2023-05-29T11:22:49.845Z :recycle: |
…nd-timeline-activation' into problame/infallible-timeline-activate/3-funnel-storage-broker-client
…broker-client' into problame/infallible-timeline-activate/4-make-infallible
…' into problame/async-timeline-get/dont-hold-timelines-lock-inside-tenant-state-send-modify
…imeline-activate/2-pushup-tenant-and-timeline-activation
…nd-timeline-activation' into problame/infallible-timeline-activate/3-funnel-storage-broker-client
Again, I would recommend https://github.com/orgs/neondatabase/projects/38/views/6 for a view on all the in-flight PRs. |
the panic hook will include the current span, so we will not lose it. report the total of observed panics as a number if any.
because that's where the span is.
745e3d4
to
fbf94c0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes on top lgtm, @koivunej please rerun tests to pass and merge, or get another review from third party monday
Yep, thanks. I don't think those latest changes are interesting enough. I'll approve next. |
It seems there's an actual test failure: test_timeline_resurrection_on_attach[True-local_fs]: debug with:
It does seem it was changed on this PR, I think we need a better way for this. Link to conversation: #4321 (comment) Also noted on the issue #4344 that we shouldn't merge this to cause havoc for tests. Confirmed that flakyness is just because load can take a long time, we do get to doing the tenant query before that so the Fixed with d6d0c74 |
it was never used ("".parse<TenantState>()) but it would had performed poorly; it uses Default::default on any contained fields within variants.
this is in line of the idea of the state being a transparent or RWLock technical in nature.
This is preliminary work for/from #4220 (async `Layer::get_value_reconstruct_data`). # Full Stack Of Preliminary PRs Thanks to the countless preliminary PRs, this conversion is relatively straight-forward. 1. Clean-ups * #4316 * #4317 * #4318 * #4319 * #4321 * Note: these were mostly to find an alternative to #4291, which I thought we'd need in my original plan where we would need to convert `Tenant::timelines` into an async locking primitive (#4333). In reviews, we walked away from that, but these cleanups were still quite useful. 2. #4364 3. #4472 4. #4476 5. #4477 6. #4485 # Significant Changes In This PR ## `compact_level0_phase1` & `create_delta_layer` This commit partially reverts "pgserver: spawn_blocking in compaction (#4265)" 4e359db. Specifically, it reverts the `spawn_blocking`-ificiation of `compact_level0_phase1`. If we didn't revert it, we'd have to use `Timeline::layers.blocking_read()` inside `compact_level0_phase1`. That would use up a thread in the `spawn_blocking` thread pool, which is hard-capped. I considered wrapping the code that follows the second `layers.read().await` into `spawn_blocking`, but there are lifetime issues with `deltas_to_compact`. Also, this PR switches the `create_delta_layer` _function_ back to async, and uses `spawn_blocking` inside to run the code that does sync IO, while keeping the code that needs to lock `Timeline::layers` async. ## `LayerIter` and `LayerKeyIter` `Send` bounds I had to add a `Send` bound on the `dyn` type that `LayerIter` and `LayerKeyIter` wrap. Why? Because we now have the second `layers.read().await` inside `compact_level0_phase`, and these iterator instances are held across that await-point. More background: #4462 (comment) ## `DatadirModification::flush` Needed to replace the `HashMap::retain` with a hand-rolled variant because `TimelineWriter::put` is now async.
This is preliminary work for/from #4220 (async `Layer::get_value_reconstruct_data`). # Full Stack Of Preliminary PRs Thanks to the countless preliminary PRs, this conversion is relatively straight-forward. 1. Clean-ups * #4316 * #4317 * #4318 * #4319 * #4321 * Note: these were mostly to find an alternative to #4291, which I thought we'd need in my original plan where we would need to convert `Tenant::timelines` into an async locking primitive (#4333). In reviews, we walked away from that, but these cleanups were still quite useful. 2. #4364 3. #4472 4. #4476 5. #4477 6. #4485 # Significant Changes In This PR ## `compact_level0_phase1` & `create_delta_layer` This commit partially reverts "pgserver: spawn_blocking in compaction (#4265)" 4e359db. Specifically, it reverts the `spawn_blocking`-ificiation of `compact_level0_phase1`. If we didn't revert it, we'd have to use `Timeline::layers.blocking_read()` inside `compact_level0_phase1`. That would use up a thread in the `spawn_blocking` thread pool, which is hard-capped. I considered wrapping the code that follows the second `layers.read().await` into `spawn_blocking`, but there are lifetime issues with `deltas_to_compact`. Also, this PR switches the `create_delta_layer` _function_ back to async, and uses `spawn_blocking` inside to run the code that does sync IO, while keeping the code that needs to lock `Timeline::layers` async. ## `LayerIter` and `LayerKeyIter` `Send` bounds I had to add a `Send` bound on the `dyn` type that `LayerIter` and `LayerKeyIter` wrap. Why? Because we now have the second `layers.read().await` inside `compact_level0_phase`, and these iterator instances are held across that await-point. More background: #4462 (comment) ## `DatadirModification::flush` Needed to replace the `HashMap::retain` with a hand-rolled variant because `TimelineWriter::put` is now async.
This PR concludes the "async `Layer::get_value_reconstruct_data`" project. The problem we're solving is that, before this patch, we'd execute `Layer::get_value_reconstruct_data` on the tokio executor threads. This function is IO- and/or CPU-intensive. The IO is using VirtualFile / std::fs; hence it's blocking. This results in unfairness towards other tokio tasks, especially under (disk) load. Some context can be found at #4154 where I suspect (but can't prove) load spikes of logical size calculation to cause heavy eviction skew. Sadly we don't have tokio runtime/scheduler metrics to quantify the unfairness. But generally, we know blocking the executor threads on std::fs IO is bad. So, let's have this change and watch out for severe perf regressions in staging & during rollout. ## Changes * rename `Layer::get_value_reconstruct_data` to `Layer::get_value_reconstruct_data_blocking` * add a new blanket impl'd `Layer::get_value_reconstruct_data` `async_trait` method that runs `get_value_reconstruct_data_blocking` inside `spawn_blocking`. * The `spawn_blocking` requires `'static` lifetime of the captured variables; hence I had to change the data flow to _move_ the `ValueReconstructState` into and back out of get_value_reconstruct_data instead of passing a reference. It's a small struct, so I don't expect a big performance penalty. ## Performance Fundamentally, the code changes cause the following performance-relevant changes: * Latency & allocations: each `get_value_reconstruct_data` call now makes a short-lived allocation because `async_trait` is just sugar for boxed futures under the hood * Latency: `spawn_blocking` adds some latency because it needs to move the work to a thread pool * using `spawn_blocking` plus the existing synchronous code inside is probably more efficient better than switching all the synchronous code to tokio::fs because _each_ tokio::fs call does `spawn_blocking` under the hood. * Throughput: the `spawn_blocking` thread pool is much larger than the async executor thread pool. Hence, as long as the disks can keep up, which they should according to AWS specs, we will be able to deliver higher `get_value_reconstruct_data` throughput. * Disk IOPS utilization: we will see higher disk utilization if we get more throughput. Not a problem because the disks in prod are currently under-utilized, according to node_exporter metrics & the AWS specs. * CPU utilization: at higher throughput, CPU utilization will be higher. Slightly higher latency under regular load is acceptable given the throughput gains and expected better fairness during disk load peaks, such as logical size calculation peaks uncovered in #4154. ## Full Stack Of Preliminary PRs This PR builds on top of the following preliminary PRs 1. Clean-ups * #4316 * #4317 * #4318 * #4319 * #4321 * Note: these were mostly to find an alternative to #4291, which I thought we'd need in my original plan where we would need to convert `Tenant::timelines` into an async locking primitive (#4333). In reviews, we walked away from that, but these cleanups were still quite useful. 2. #4364 3. #4472 4. #4476 5. #4477 6. #4485 7. #4441
Motivation
This is preliminary work for/from #4220 (async
Layer::get_value_reconstruct_data
).The motivation is to avoid locking
Tenant::timelines
in places that can't beasync
, because in #4333 we want to convert Tenant::timelines fromstd::sync::Mutex
totokio::sync::Mutex
.But, the changes here are useful in general because they clean up & document tenant state transitions.
That also paves the way for #4350, which is an alternative to #4333 that refactors the pageserver code so that we can keep the
Tenant::timelines
mutex sync.Changes
This patch consists of the following core insights and changes:
So, with this change, they're not anymore.
In the future, we can add a
CancellationToken
stored in Tenant to cancelload
andattach
faster, i.e., makespawn_load
/spawn_attach
transition them to Broken state sooner.See the doc comments on TenantState for the state transitions that are now possible.
It might seem scary, but actually, this patch reduces the possible state transitions.
We introduce a new state
TenantState::Activating
to avoid grabbing theTenant::timelines
lock inside thesend_modify
closure.These were the humble beginnings of this PR (see Motivation section), and I think it's still the right thing to have this
Activating
state, even if we decide against asyncTenant::timelines
mutex. The reason is thatsend_modify
locks internally, and by moving locking of Tenant::timelines out of the closure, the internal locking ofsend_modify
becomes a leaf of the lock graph, and so, we eliminate deadlock risk.Patch Stack
This PR sits on top of
because otherwise we'd need to deal with falliblity of
Timeline::activate()
inside theif activating {
branch (e.g., stopping already started background loops, etc).