-
Notifications
You must be signed in to change notification settings - Fork 491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
try: startup speedup #4366
try: startup speedup #4366
Conversation
Woops, late drafting; let's see about the tests before reviewing. |
pageserver/src/tenant/timeline/walreceiver/connection_manager.rs
Outdated
Show resolved
Hide resolved
this is rather big change from always eager connections.
Regress tests passed, ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to try on staging
753 tests run: 722 passed, 0 failed, 31 skipped (full report)The comment gets automatically updated with the latest test results
793536a at 2023-05-29T17:54:53.553Z :recycle: |
Very difficult to say if this helped or not. I forgot that we have quite the error margins. ps-0.eu-west: 132s or 127s before, after 206s, 118s, 108s, 117s. I think this might be more about warming up the s3 on repeat attempts. I'll revert the limiting part for proper measurements. Noted:
|
added in #4366. revert for testing without it; it may have unintenteded side-effects, and it's very difficult to understand the results from the 10k load testing environments. earlier results: #4366 (comment)
After #4368 it seems the results are the same, confirming my suspicion that the absence of |
Startup continues to be slow, work towards to alleviate it. Summary of changes: - pretty the functional improvements from #4366 into `utils::completion::{Completion, Barrier}` - extend "initial load completion" usage up to tenant background tasks - previously only global background tasks - spawn_blocking the tenant load directory traversal - demote some logging - remove some unwraps - propagate some spans to `spawn_blocking` Runtime effects should be major speedup to loading, but after that, the `BACKGROUND_RUNTIME` will be blocked for a long time (minutes). Possible follow-ups: - complete initial tenant sizes before allowing background tasks to block the `BACKGROUND_RUNTIME`
Startup can take a long time. We suspect it's the initial logical size calculations. Long term solution is to not block the tokio executors but do most of I/O in spawn_blocking.
See: #4025, #4183
Short-term solution to above:
Do not consider safekeepers without additional WAL worthy of connecting toThis will delay the initial logical size calculation so it will most likely be started by metrics collection firstcores / 2
This PR is to try. I think the additions to initial tenant loading are useful regardless. We can however test this out on staging, to see if there is any positive effect.