Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor TenantState transitions #4321

Merged
Show file tree
Hide file tree
Changes from 47 commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
cc96a51
tenant_map_insert: don't expose the vacant entry to the closure
problame May 23, 2023
d5337e6
refactor responsibility for tenant/timeline activation
problame May 23, 2023
17b081d
refactor: eliminate global storage_broker client state
problame May 23, 2023
8bcb542
refactor: make timeline activation infallible
problame May 23, 2023
3e604ea
refactor: introduce TenantState::Activating to avoid holding timeline…
problame May 23, 2023
ee22e81
don't hold timelines lock inside set_stopping()
problame May 23, 2023
feb2e80
tests were failing because activate() was outside of a span with tena…
problame May 23, 2023
4f586ac
Merge branch 'problame/infallible-timeline-activate/2-pushup-tenant-a…
problame May 23, 2023
a55d224
tests would fail because broker client needs to be launched on a toki…
problame May 23, 2023
94f30f0
Merge branch 'problame/infallible-timeline-activate/3-funnel-storage-…
problame May 23, 2023
b2e0c58
Merge branch 'problame/infallible-timeline-activate/4-make-infallible…
problame May 23, 2023
32c85fa
Merge remote-tracking branch 'origin/main' into problame/infallible-t…
problame May 24, 2023
bdf03ea
Merge branch 'problame/infallible-timeline-activate/2-pushup-tenant-a…
problame May 24, 2023
75c3c43
don't unwrap() the `activate()` result in spawn_load / spawn_attach
problame May 24, 2023
07da786
apply joonas's suggestion to use parent: None + follows_from
problame May 24, 2023
def5eb8
Merge branch 'problame/infallible-timeline-activate/2-pushup-tenant-a…
problame May 24, 2023
b54431b
pass the BrokerClientChannel by value & clone it as necessary
problame May 24, 2023
732f603
Merge remote-tracking branch 'origin/main' into problame/infallible-t…
problame May 24, 2023
8606b6a
Merge remote-tracking branch 'origin/problame/infallible-timeline-act…
problame May 24, 2023
ef956c4
make it clear that `walreceiver_status` is always used in the branch …
problame May 24, 2023
4001f44
activate_timelines counter is now == not_broken_timelines.len()
problame May 24, 2023
2c424c8
Revert "activate_timelines counter is now == not_broken_timelines.len()"
problame May 24, 2023
69cfa9f
launch_wal_receiver: apply joonas's review suggestion (visibility + d…
problame May 24, 2023
b345f32
Merge branch 'problame/infallible-timeline-activate/4-make-infallible…
problame May 24, 2023
413598b
fix merge fallout (?)
problame May 24, 2023
641ca99
assert_eq suggestion
problame May 25, 2023
fe4ef12
use tokio::sync::watch::Receiver::wait_for
problame May 25, 2023
2fee8c8
Merge remote-tracking branch 'origin/main' into problame/infallible-t…
problame May 25, 2023
da6573f
Merge branch 'problame/infallible-timeline-activate/3-funnel-storage-…
problame May 25, 2023
cf8ff7e
explainer comment on storage_broker::connect async weirdness
problame May 25, 2023
96c5502
apply heikki's comment suggestion
problame May 25, 2023
ddad092
Merge branch 'problame/infallible-timeline-activate/3-funnel-storage-…
problame May 25, 2023
eaf270c
Revert "use tokio::sync::watch::Receiver::wait_for"
problame May 25, 2023
05a2fe0
Merge branch 'problame/infallible-timeline-activate/4-make-infallible…
problame May 25, 2023
f18d9f5
Revert "Revert "use tokio::sync::watch::Receiver::wait_for""
problame May 25, 2023
de780d2
make TenantState::{Loading,Attaching,Activating} owned by spawn_load …
problame May 25, 2023
dd0f5c4
Merge remote-tracking branch 'origin/main' into problame/async-timeli…
problame May 25, 2023
1367e2b
improve TenantState doc comments, repeating what's in the Mermaid dia…
problame May 26, 2023
b09beaa
log while waiting for tenant to finish activation
problame May 26, 2023
13d3f4c
set_stopping(): report in result if not transitioning to Stopping
problame May 26, 2023
e7c4ef9
don't hold TENANTS lock while waiting for set_stopping()
problame May 26, 2023
72159ee
Merge remote-tracking branch 'origin/main' into problame/async-timeli…
problame May 26, 2023
9a4789e
demote warn line to info-level, as the log line in set_stopping() is …
problame May 26, 2023
6bee7df
fix: report tenant_id with the spawned set_stopping
koivunej May 26, 2023
71fa6d9
fix: add spans for shutdown_pageserver and shutdown_all_tenants
koivunej May 26, 2023
f46121e
fix: use join_set for freeze_and_flush
koivunej May 26, 2023
fbf94c0
refactor: log within the spawned task
koivunej May 26, 2023
a295ea4
doc: minor typo fixes
koivunej May 26, 2023
7c507d4
empty for new ci run
koivunej May 29, 2023
c9ec933
fix: drop unused strum::EnumString on TenantState
koivunej May 29, 2023
d6d0c74
fix: solve todo by using attachment_status of earlier states
koivunej May 29, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 51 additions & 7 deletions libs/pageserver_api/src/models.rs
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,29 @@ use crate::reltag::RelTag;
use anyhow::bail;
use bytes::{BufMut, Bytes, BytesMut};

/// A state of a tenant in pageserver's memory.
/// The state of a tenant in this pageserver.
///
/// ```mermaid
/// stateDiagram-v2
///
/// [*] --> Loading: spawn_load()
/// [*] --> Attaching: spawn_attach()
///
/// Loading --> Activating: activate()
/// Attaching --> Activating: activate()
/// Activating --> Active: infallible
///
/// Loading --> Broken: load() failure
/// Attaching --> Broken: attach() failure
///
/// Active --> Stopping: set_stopping(), part of shutdown & detach
/// Stopping --> Broken: late error in remove_tenant_from_memory
///
/// Broken --> [*]: ignore / detach / shutdown
/// Stopping --> [*]: remove_from_memory complete
///
/// Active --> Broken: cfg(testing)-only tenant break point
/// ```
koivunej marked this conversation as resolved.
Show resolved Hide resolved
#[derive(
Clone,
PartialEq,
Expand All @@ -33,17 +55,38 @@ use bytes::{BufMut, Bytes, BytesMut};
)]
#[serde(tag = "slug", content = "data")]
pub enum TenantState {
/// This tenant is being loaded from local disk
/// This tenant is being loaded from local disk.
///
/// `set_stopping()` and `set_broken()` do not work in this state and wait for it to pass.
Loading,
/// This tenant is being downloaded from cloud storage.
/// This tenant is being attached to the pageserver.
///
/// `set_stopping()` and `set_broken()` do not work in this state and wait for it to pass.
Attaching,
/// Tenant is fully operational
/// The tenant is transitioning from Loading/Attaching to Active.
///
/// While in this state, the individual timelines are being activated.
///
/// `set_stopping()` and `set_broken()` do not work in this state and wait for it to pass.
Activating,
/// The tenant has finished activating and is open for business.
///
/// Transitions out of this state are possible through `set_stopping()` and `set_broken()`.
Active,
/// A tenant is recognized by pageserver, but it is being detached or the
/// The tenant is recognized by pageserver, but it is being detached or the
/// system is being shut down.
///
/// Transitions out of this state are possible through `set_broken()`.
Stopping,
/// A tenant is recognized by the pageserver, but can no longer be used for
/// any operations, because it failed to be activated.
/// The tenant is recognized by the pageserver, but can no longer be used for
/// any operations.
///
/// If the tenant fails to load or attach, it will transition to this state
/// and it is guaranteed that no background tasks are running in its name.
///
/// The other way to transition into this state is from `Stopping` state
/// through `set_broken()` called from `remove_tenant_from_memory()`. That happens
/// if the cleanup future executed by `remove_tenant_from_memory()` fails.
Broken { reason: String, backtrace: String },
}

Expand All @@ -60,6 +103,7 @@ impl TenantState {
// tenant mgr startup distinguishes attaching from loading via marker file.
// If it's loading, there is no attach marker file, i.e., attach had finished in the past.
Self::Loading => Attached,
Self::Activating => todo!(),
koivunej marked this conversation as resolved.
Show resolved Hide resolved
// We only reach Active after successful load / attach.
// So, call atttachment status Attached.
Self::Active => Attached,
Expand Down
2 changes: 1 addition & 1 deletion pageserver/src/http/routes.rs
Original file line number Diff line number Diff line change
Expand Up @@ -859,7 +859,7 @@ async fn handle_tenant_break(r: Request<Body>) -> Result<Response<Body>, ApiErro
.await
.map_err(|_| ApiError::Conflict(String::from("no active tenant found")))?;

tenant.set_broken("broken from test".to_owned());
tenant.set_broken("broken from test".to_owned()).await;

json_response(StatusCode::OK, ())
}
Expand Down
1 change: 1 addition & 0 deletions pageserver/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ static ZERO_PAGE: bytes::Bytes = bytes::Bytes::from_static(&[0u8; 8192]);

pub use crate::metrics::preinitialize_metrics;

#[tracing::instrument]
pub async fn shutdown_pageserver(exit_code: i32) {
// Shut down the libpq endpoint task. This prevents new connections from
// being accepted.
Expand Down
Loading