-
Notifications
You must be signed in to change notification settings - Fork 491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pageserver is allegedly takes a lot of time to restart when there are a lot of tenants #4183
Comments
cc @hlinnaka |
With 40k+ tenants we probably do not get metrics anymore? This is most likely related to #4025. |
Startup can take a long time. We suspect it's the initial logical size calculations. Long term solution is to not block the tokio executors but do most of I/O in spawn_blocking. See: #4025, #4183 Short-term solution to above: - Delay global background tasks until initial tenant loads complete - Just limit how many init logical size calculations can we have at the same time to `cores / 2` This PR is for trying in staging.
Discussion happens in this long thread: https://neondb.slack.com/archives/C033RQ5SPDH/p1685012031795059 |
I posted earlier attempts (#4366, revert) on #4366. After #4372 it looks a bit more promising without too intrusive changes: after deploying #4366 on staging:
so I think this looks at least not bad. But I haven't been able to retry these results yet. I suspect that the remaining problem is the blocking of the background runtime for initial logical size AND repartitioning. The "page_service connection pressure" has been brought up as an idea to lower the activation time for timelines which are being being re-connected to. Designing and implementing such prioritization system might not be straightforward. Basically it would have to act as a semaphore, but upon getting a notification of page_service connection, it should allow these instaces to jump the queue. But what would this prioritization protect? The first Perhaps an easier step is to delay |
With #4397 staging startup times:
Not really comparable anymore, because ps-0 lost 2k tenants. However, the high values are no longer expected. The #4399 would further help this by delaying all initial logical size calculations to a phase which runs after we've completed activating all tenants. There will be no background jobs running until timeout (10s by default). It is assumed that the 10s would be spent efficiently doing many queued up initial logical size calculations before letting the compactions start. |
Initial logical size calculation could still hinder our fast startup efforts in #4397. See #4183. In deployment of 2023-06-06 about a 200 initial logical sizes were calculated on hosts which took the longest to complete initial load (12s). Implements the three step/tier initialization ordering described in #4397: 1. load local tenants 2. do initial logical sizes per walreceivers for 10s 3. background tasks Ordering is controlled by: - waiting on `utils::completion::Barrier`s on background tasks - having one attempt for each Timeline to do initial logical size calculation - `pageserver/src/bin/pageserver.rs` releasing background jobs after timeout or completion of initial logical size calculation The timeout is there just to safeguard in case a legitimate non-broken timeline initial logical size calculation goes long. The timeout is configurable, by default 10s, which I think would be fine for production systems. In the test cases I've been looking at, it seems that these steps are completed as fast as possible. Co-authored-by: Christian Schwarz <[email protected]>
Initial logical size calculation could still hinder our fast startup efforts in #4397. See #4183. In deployment of 2023-06-06 about a 200 initial logical sizes were calculated on hosts which took the longest to complete initial load (12s). Implements the three step/tier initialization ordering described in #4397: 1. load local tenants 2. do initial logical sizes per walreceivers for 10s 3. background tasks Ordering is controlled by: - waiting on `utils::completion::Barrier`s on background tasks - having one attempt for each Timeline to do initial logical size calculation - `pageserver/src/bin/pageserver.rs` releasing background jobs after timeout or completion of initial logical size calculation The timeout is there just to safeguard in case a legitimate non-broken timeline initial logical size calculation goes long. The timeout is configurable, by default 10s, which I think would be fine for production systems. In the test cases I've been looking at, it seems that these steps are completed as fast as possible. Co-authored-by: Christian Schwarz <[email protected]>
I'll just close this because after changes we have now different causes. Originally this helped. |
It's more of an observation, so should be verified first. Staging has pageservers with 40k+ tenants
The text was updated successfully, but these errors were encountered: