Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pageserver is allegedly takes a lot of time to restart when there are a lot of tenants #4183

Closed
kelvich opened this issue May 9, 2023 · 6 comments
Assignees
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug triaged bugs that were already triaged

Comments

@kelvich
Copy link
Contributor

kelvich commented May 9, 2023

It's more of an observation, so should be verified first. Staging has pageservers with 40k+ tenants

@kelvich kelvich added the t/bug Issue Type: Bug label May 9, 2023
@kelvich
Copy link
Contributor Author

kelvich commented May 9, 2023

cc @hlinnaka

@koivunej
Copy link
Member

koivunej commented May 11, 2023

With 40k+ tenants we probably do not get metrics anymore?

This is most likely related to #4025.

koivunej added a commit that referenced this issue May 29, 2023
Startup can take a long time. We suspect it's the initial logical size
calculations. Long term solution is to not block the tokio executors but
do most of I/O in spawn_blocking.

See: #4025, #4183

Short-term solution to above:

- Delay global background tasks until initial tenant loads complete
- Just limit how many init logical size calculations can we have at the
same time to `cores / 2`

This PR is for trying in staging.
@LizardWizzard
Copy link
Contributor

Discussion happens in this long thread: https://neondb.slack.com/archives/C033RQ5SPDH/p1685012031795059

@koivunej
Copy link
Member

koivunej commented May 30, 2023

I posted earlier attempts (#4366, revert) on #4366. After #4372 it looks a bit more promising without too intrusive changes:

after deploying #4366 on staging:

  • ps-0.eu-west-1 (10k): 100s => 37s, 6s
  • ps-1.eu-west-1 (8k): 73s => 5s, 5.5s
  • ps-99.us-east-2 (<2k?): 2.8s => 2.3s, 2s

so I think this looks at least not bad.

But I haven't been able to retry these results yet. I suspect that the remaining problem is the blocking of the background runtime for initial logical size AND repartitioning. The "page_service connection pressure" has been brought up as an idea to lower the activation time for timelines which are being being re-connected to.

Designing and implementing such prioritization system might not be straightforward. Basically it would have to act as a semaphore, but upon getting a notification of page_service connection, it should allow these instaces to jump the queue. But what would this prioritization protect? The first initial logical size calculation's?

Perhaps an easier step is to delay initial repartition + compaction and garbage collection until we've attempted all initial logical size calculations. This should probably delay the timeline's eviction task as well just to be sure. Unsure if this is the right path, because we might end up in a situation that some timelines do not get an active walreceiver connection, and so they would not get an initial logical size calculation happening.

@shanyp shanyp added c/storage/pageserver Component: storage: pageserver triaged bugs that were already triaged labels Jun 1, 2023
@koivunej
Copy link
Member

koivunej commented Jun 5, 2023

With #4397 staging startup times:

  • ps-0.eu-west-1 (8k): 4.6s, 4.0s
  • ps-1.eu-west-1 (8k): 3.4s, 3.5s
  • ps-99.us-east-2 (<2k?): 2.1s, 2.3s

Not really comparable anymore, because ps-0 lost 2k tenants. However, the high values are no longer expected.

The #4399 would further help this by delaying all initial logical size calculations to a phase which runs after we've completed activating all tenants. There will be no background jobs running until timeout (10s by default). It is assumed that the 10s would be spent efficiently doing many queued up initial logical size calculations before letting the compactions start.

koivunej added a commit that referenced this issue Jun 7, 2023
Initial logical size calculation could still hinder our fast startup
efforts in #4397. See #4183. In deployment of 2023-06-06
about a 200 initial logical sizes were calculated on hosts which
took the longest to complete initial load (12s).

Implements the three step/tier initialization ordering described in
#4397:
1. load local tenants
2. do initial logical sizes per walreceivers for 10s
3. background tasks

Ordering is controlled by:
- waiting on `utils::completion::Barrier`s on background tasks
- having one attempt for each Timeline to do initial logical size
calculation
- `pageserver/src/bin/pageserver.rs` releasing background jobs after
timeout or completion of initial logical size calculation

The timeout is there just to safeguard in case a legitimate non-broken
timeline initial logical size calculation goes long. The timeout is
configurable, by default 10s, which I think would be fine for production
systems. In the test cases I've been looking at, it seems that these
steps are completed as fast as possible.

Co-authored-by: Christian Schwarz <[email protected]>
awestover pushed a commit that referenced this issue Jun 14, 2023
Initial logical size calculation could still hinder our fast startup
efforts in #4397. See #4183. In deployment of 2023-06-06
about a 200 initial logical sizes were calculated on hosts which
took the longest to complete initial load (12s).

Implements the three step/tier initialization ordering described in
#4397:
1. load local tenants
2. do initial logical sizes per walreceivers for 10s
3. background tasks

Ordering is controlled by:
- waiting on `utils::completion::Barrier`s on background tasks
- having one attempt for each Timeline to do initial logical size
calculation
- `pageserver/src/bin/pageserver.rs` releasing background jobs after
timeout or completion of initial logical size calculation

The timeout is there just to safeguard in case a legitimate non-broken
timeline initial logical size calculation goes long. The timeout is
configurable, by default 10s, which I think would be fine for production
systems. In the test cases I've been looking at, it seems that these
steps are completed as fast as possible.

Co-authored-by: Christian Schwarz <[email protected]>
@koivunej
Copy link
Member

koivunej commented Dec 7, 2023

I'll just close this because after changes we have now different causes. Originally this helped.

@koivunej koivunej closed this as completed Dec 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug triaged bugs that were already triaged
Projects
None yet
Development

No branches or pull requests

4 participants