Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor TenantState transitions #4321

Conversation

problame
Copy link
Contributor

@problame problame commented May 23, 2023

Motivation

This is preliminary work for/from #4220 (async Layer::get_value_reconstruct_data).
The motivation is to avoid locking Tenant::timelines in places that can't be async, because in #4333 we want to convert Tenant::timelines from std::sync::Mutex to tokio::sync::Mutex.

But, the changes here are useful in general because they clean up & document tenant state transitions.
That also paves the way for #4350, which is an alternative to #4333 that refactors the pageserver code so that we can keep the Tenant::timelines mutex sync.

Changes

This patch consists of the following core insights and changes:

  • spawn_load and spawn_attach own the tenant state until they're done
  • once load()/attach() calls are done ...
    • if they failed, transition them to Broken directly (we know that there's no background activity because we didn't call activate yet)
    • if they succeed, call activate. We can make it infallible. How? Later.
    • set_broken() and set_stopping() are changed to wait for spawn_load() / spawn_attach() to finish.
    • This sounds scary because it might hinder detach or shutdown, but actually, concurrent attach+detach, or attach+shutdown, or load+shutdown, or attach+shutdown were just racy before this PR.
      So, with this change, they're not anymore.
      In the future, we can add a CancellationToken stored in Tenant to cancel load and attach faster, i.e., make spawn_load / spawn_attach transition them to Broken state sooner.

See the doc comments on TenantState for the state transitions that are now possible.
It might seem scary, but actually, this patch reduces the possible state transitions.

We introduce a new state TenantState::Activating to avoid grabbing the Tenant::timelines lock inside the send_modify closure.
These were the humble beginnings of this PR (see Motivation section), and I think it's still the right thing to have this Activating state, even if we decide against async Tenant::timelines mutex. The reason is that send_modify locks internally, and by moving locking of Tenant::timelines out of the closure, the internal locking of send_modify becomes a leaf of the lock graph, and so, we eliminate deadlock risk.

Patch Stack

This PR sits on top of

because otherwise we'd need to deal with falliblity of Timeline::activate() inside the if activating { branch (e.g., stopping already started background loops, etc).

problame added 7 commits May 23, 2023 19:25
This tightens up the API a little.
Byproduct of some refactoring work that I'm doing right now.
(This is prep work to make `Timeline::activate()` infallible.)

The current possibility for failure in `Timeline::activate()`
is the broker client's presence / absence. It should be an assert, but
we're careful with these. So, I'm planning to pass in the broker
client to activate(), thereby eliminating the possiblity of its absence.

In the unit tests, we don't have a broker client. So, I thought I'd be
in trouble because the unit tests also called `activate()` before this
PR.

However, closer inspection reveals a long-standing FIXME about this,
which is addressed by this patch.

It turns out that the unit tests don't actually need the background
loops to be running. They just need the state value to be `Active`.
So, for the tests, we just set it to that value but don't spawn
the background loops.

We'll need to revisit this if we ever do more Rust unit tests in the
future. But right now, this refactoring improves the code, so, let's
revisit when we get there.
(This is prep work to make `Timeline::activate` infallible.)

This patch removes the global storage_broker client instance from the
pageserver codebase.

Instead, pageserver startup instantiates it and passes it down to
the `Timeline::activate` function, which in turn passes it to
the WalReceiver, which is the entity that actually uses it.
Timeline::activate() was only fallible because `launch_wal_receiver` was.

`launch_wal_receiver` was fallible only because of some preliminary
checks in `WalReceiver::start`.

Turns out these checks can be shifted to the type system by delaying
creatinon of the `WalReceiver` struct to the point where we activate the timeline.

The changes in this PR were enabled by my previous refactoring that funneled
the broker_client from pageserver startup to the activate() call sites.
@github-actions
Copy link

github-actions bot commented May 23, 2023

1004 tests run: 963 passed, 0 failed, 41 skipped (full report)


Flaky tests (1)

Postgres 15

The comment gets automatically updated with the latest test results
d6d0c74 at 2023-05-29T11:22:49.845Z :recycle:

problame added 4 commits May 23, 2023 20:42
…nd-timeline-activation' into problame/infallible-timeline-activate/3-funnel-storage-broker-client
…broker-client' into problame/infallible-timeline-activate/4-make-infallible
…' into problame/async-timeline-get/dont-hold-timelines-lock-inside-tenant-state-send-modify
@problame problame marked this pull request as ready for review May 23, 2023 19:02
@problame problame requested review from a team as code owners May 23, 2023 19:02
@problame problame requested review from save-buffer, skyzh, koivunej and a team and removed request for a team, save-buffer and skyzh May 23, 2023 19:02
pageserver/src/tenant.rs Outdated Show resolved Hide resolved
problame added 3 commits May 24, 2023 11:31
…imeline-activate/2-pushup-tenant-and-timeline-activation
…nd-timeline-activation' into problame/infallible-timeline-activate/3-funnel-storage-broker-client
@koivunej koivunej dismissed their stale review May 26, 2023 16:56

stale

@problame
Copy link
Contributor Author

Again, I would recommend https://github.com/orgs/neondatabase/projects/38/views/6 for a view on all the in-flight PRs.

koivunej added 4 commits May 26, 2023 20:56
the panic hook will include the current span, so we will not lose it.
report the total of observed panics as a number if any.
because that's where the span is.
@koivunej koivunej force-pushed the problame/async-timeline-get/dont-hold-timelines-lock-inside-tenant-state-send-modify branch from 745e3d4 to fbf94c0 Compare May 26, 2023 17:56
Copy link
Contributor Author

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes on top lgtm, @koivunej please rerun tests to pass and merge, or get another review from third party monday

@koivunej
Copy link
Member

Yep, thanks. I don't think those latest changes are interesting enough. I'll approve next.

@koivunej
Copy link
Member

koivunej commented May 29, 2023

It seems there's an actual test failure:

test_timeline_resurrection_on_attach[True-local_fs]: debug with:

AssertionError: assert not ['2023-05-29T07:51:54.527416Z ERROR request{method=GET path=/v1/tenant/bdc883cb95af183d3c72095b938eeb95 request_id=2098afa1-e542-4213-acae-94ff27cfd86f}:tenant_status_handler{tenant=bdc883cb95af183d3c72095b938eeb95}:panic{thread=mgmt request worker location=libs/pageserver_api/src/models.rs:106:33}: not yet implemented\n']

It does seem it was changed on this PR, I think we need a better way for this. Link to conversation: #4321 (comment) Also noted on the issue #4344 that we shouldn't merge this to cause havoc for tests.

Confirmed that flakyness is just because load can take a long time, we do get to doing the tenant query before that so the todo!() got to happen. Fixing this soon regardless.

Fixed with d6d0c74

koivunej added 2 commits May 29, 2023 13:34
it was never used ("".parse<TenantState>()) but it would had performed
poorly; it uses Default::default on any contained fields within
variants.
this is in line of the idea of the state being a transparent or RWLock
technical in nature.
@koivunej koivunej merged commit f4f3007 into main May 29, 2023
@koivunej koivunej deleted the problame/async-timeline-get/dont-hold-timelines-lock-inside-tenant-state-send-modify branch May 29, 2023 14:52
problame added a commit that referenced this pull request Jun 13, 2023
This is preliminary work for/from #4220 (async `Layer::get_value_reconstruct_data`).

# Full Stack Of Preliminary PRs

Thanks to the countless preliminary PRs, this conversion is relatively
straight-forward.

1. Clean-ups
  * #4316
  * #4317
  * #4318
  * #4319
  * #4321
* Note: these were mostly to find an alternative to #4291, which I
   thought we'd need in my original plan where we would need to convert
   `Tenant::timelines` into an async locking primitive (#4333). In reviews,
   we walked away from that, but these cleanups were still quite useful.
2. #4364
3. #4472
4. #4476
5. #4477
6. #4485

# Significant Changes In This PR

## `compact_level0_phase1` & `create_delta_layer`

This commit partially reverts

   "pgserver: spawn_blocking in compaction (#4265)"
    4e359db.

Specifically, it reverts the `spawn_blocking`-ificiation of
`compact_level0_phase1`.
If we didn't revert it, we'd have to use `Timeline::layers.blocking_read()`
inside `compact_level0_phase1`. That would use up a thread in the
`spawn_blocking` thread pool, which is hard-capped.

I considered wrapping the code that follows the second
`layers.read().await` into `spawn_blocking`, but there are lifetime
issues with `deltas_to_compact`.

Also, this PR switches the `create_delta_layer` _function_ back to
async, and uses `spawn_blocking` inside to run the code that does sync
IO, while keeping the code that needs to lock `Timeline::layers` async.

## `LayerIter` and `LayerKeyIter` `Send` bounds

I had to add a `Send` bound on the `dyn` type that `LayerIter`
and `LayerKeyIter` wrap. Why? Because we now have the second
`layers.read().await` inside `compact_level0_phase`, and these
iterator instances are held across that await-point.

More background:
#4462 (comment)

## `DatadirModification::flush`

Needed to replace the `HashMap::retain` with a hand-rolled variant
because `TimelineWriter::put` is now async.
awestover pushed a commit that referenced this pull request Jun 14, 2023
This is preliminary work for/from #4220 (async `Layer::get_value_reconstruct_data`).

# Full Stack Of Preliminary PRs

Thanks to the countless preliminary PRs, this conversion is relatively
straight-forward.

1. Clean-ups
  * #4316
  * #4317
  * #4318
  * #4319
  * #4321
* Note: these were mostly to find an alternative to #4291, which I
   thought we'd need in my original plan where we would need to convert
   `Tenant::timelines` into an async locking primitive (#4333). In reviews,
   we walked away from that, but these cleanups were still quite useful.
2. #4364
3. #4472
4. #4476
5. #4477
6. #4485

# Significant Changes In This PR

## `compact_level0_phase1` & `create_delta_layer`

This commit partially reverts

   "pgserver: spawn_blocking in compaction (#4265)"
    4e359db.

Specifically, it reverts the `spawn_blocking`-ificiation of
`compact_level0_phase1`.
If we didn't revert it, we'd have to use `Timeline::layers.blocking_read()`
inside `compact_level0_phase1`. That would use up a thread in the
`spawn_blocking` thread pool, which is hard-capped.

I considered wrapping the code that follows the second
`layers.read().await` into `spawn_blocking`, but there are lifetime
issues with `deltas_to_compact`.

Also, this PR switches the `create_delta_layer` _function_ back to
async, and uses `spawn_blocking` inside to run the code that does sync
IO, while keeping the code that needs to lock `Timeline::layers` async.

## `LayerIter` and `LayerKeyIter` `Send` bounds

I had to add a `Send` bound on the `dyn` type that `LayerIter`
and `LayerKeyIter` wrap. Why? Because we now have the second
`layers.read().await` inside `compact_level0_phase`, and these
iterator instances are held across that await-point.

More background:
#4462 (comment)

## `DatadirModification::flush`

Needed to replace the `HashMap::retain` with a hand-rolled variant
because `TimelineWriter::put` is now async.
problame added a commit that referenced this pull request Jun 26, 2023
This PR concludes the "async `Layer::get_value_reconstruct_data`"
project.

The problem we're solving is that, before this patch, we'd execute
`Layer::get_value_reconstruct_data` on the tokio executor threads.
This function is IO- and/or CPU-intensive.
The IO is using VirtualFile / std::fs; hence it's blocking.
This results in unfairness towards other tokio tasks, especially under
(disk) load.

Some context can be found at
#4154
where I suspect (but can't prove) load spikes of logical size
calculation to
cause heavy eviction skew.

Sadly we don't have tokio runtime/scheduler metrics to quantify the
unfairness.
But generally, we know blocking the executor threads on std::fs IO is
bad.
So, let's have this change and watch out for severe perf regressions in
staging & during rollout.

## Changes

* rename `Layer::get_value_reconstruct_data` to
`Layer::get_value_reconstruct_data_blocking`
* add a new blanket impl'd `Layer::get_value_reconstruct_data`
`async_trait` method that runs `get_value_reconstruct_data_blocking`
inside `spawn_blocking`.
* The `spawn_blocking` requires `'static` lifetime of the captured
variables; hence I had to change the data flow to _move_ the
`ValueReconstructState` into and back out of get_value_reconstruct_data
instead of passing a reference. It's a small struct, so I don't expect a
big performance penalty.

## Performance

Fundamentally, the code changes cause the following performance-relevant
changes:

* Latency & allocations: each `get_value_reconstruct_data` call now
makes a short-lived allocation because `async_trait` is just sugar for
boxed futures under the hood
* Latency: `spawn_blocking` adds some latency because it needs to move
the work to a thread pool
* using `spawn_blocking` plus the existing synchronous code inside is
probably more efficient better than switching all the synchronous code
to tokio::fs because _each_ tokio::fs call does `spawn_blocking` under
the hood.
* Throughput: the `spawn_blocking` thread pool is much larger than the
async executor thread pool. Hence, as long as the disks can keep up,
which they should according to AWS specs, we will be able to deliver
higher `get_value_reconstruct_data` throughput.
* Disk IOPS utilization: we will see higher disk utilization if we get
more throughput. Not a problem because the disks in prod are currently
under-utilized, according to node_exporter metrics & the AWS specs.
* CPU utilization: at higher throughput, CPU utilization will be higher.

Slightly higher latency under regular load is acceptable given the
throughput gains and expected better fairness during disk load peaks,
such as logical size calculation peaks uncovered in #4154.


## Full Stack Of Preliminary PRs

This PR builds on top of the following preliminary PRs

1. Clean-ups
  * #4316
  * #4317
  * #4318
  * #4319
  * #4321
* Note: these were mostly to find an alternative to #4291, which I
thought we'd need in my original plan where we would need to convert
`Tenant::timelines` into an async locking primitive (#4333). In reviews,
we walked away from that, but these cleanups were still quite useful.
2. #4364
3. #4472
4. #4476
5. #4477
6. #4485
7. #4441
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants