Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

try: startup speedup #4366

Merged
merged 14 commits into from
May 29, 2023
Merged

try: startup speedup #4366

merged 14 commits into from
May 29, 2023

Conversation

koivunej
Copy link
Member

@koivunej koivunej commented May 29, 2023

Startup can take a long time. We suspect it's the initial logical size calculations. Long term solution is to not block the tokio executors but do most of I/O in spawn_blocking.

See: #4025, #4183

Short-term solution to above:

  • Delay global background tasks until initial tenant loads complete
    • This works off the assumption that tenant-specific or timeline-specific already have randomized starts
  • Do not consider safekeepers without additional WAL worthy of connecting to
  • Just limit how many init logical size calculations can we have at the same time to cores / 2

This PR is to try. I think the additions to initial tenant loading are useful regardless. We can however test this out on staging, to see if there is any positive effect.

@koivunej koivunej requested review from a team as code owners May 29, 2023 11:30
@koivunej koivunej marked this pull request as draft May 29, 2023 11:31
@koivunej
Copy link
Member Author

Woops, late drafting; let's see about the tests before reviewing.

@koivunej koivunej marked this pull request as ready for review May 29, 2023 13:04
@koivunej
Copy link
Member Author

Regress tests passed, ready for review.

Copy link
Contributor

@LizardWizzard LizardWizzard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to try on staging

@koivunej koivunej enabled auto-merge (squash) May 29, 2023 13:16
@koivunej koivunej disabled auto-merge May 29, 2023 13:17
@koivunej koivunej enabled auto-merge (squash) May 29, 2023 13:27
@koivunej koivunej disabled auto-merge May 29, 2023 13:48
@koivunej
Copy link
Member Author

koivunej commented May 29, 2023

Received a NAK from @arssher for:

Second part is not good, as safekeepers will never know that remote_consistent_lsn is up to date, and will push timeline to the broker forever (timeline will stay active at safekeepers).

Actions taken: a80bf2f

@koivunej koivunej requested a review from LizardWizzard May 29, 2023 14:28
@github-actions
Copy link

github-actions bot commented May 29, 2023

753 tests run: 722 passed, 0 failed, 31 skipped (full report)


The comment gets automatically updated with the latest test results
793536a at 2023-05-29T17:54:53.553Z :recycle:

@koivunej koivunej merged commit cb83495 into main May 29, 2023
@koivunej koivunej deleted the try_startup_speedup branch May 29, 2023 18:48
@koivunej
Copy link
Member Author

koivunej commented May 29, 2023

Very difficult to say if this helped or not. I forgot that we have quite the error margins.

ps-0.eu-west: 132s or 127s before, after 206s, 118s, 108s, 117s.
ps-1.eu-west: 91s or 93s before, after 84s, 64s, 84s.
ps-99.us-east: no change around 2s but difficult to tell, actively used for testing.

I think this might be more about warming up the s3 on repeat attempts. I'll revert the limiting part for proper measurements.

Noted:

  • 3s "time difference" comparing since_creation_millis to elapsed_millis below
  • Not an issue; they are not entirely comparable
  • More important is that tese two were logged quite close to each other in time (0.01s)
2023-05-30T06:50:09.748448Z  INFO load{tenant_id=498d4021c9c525280ba808ac5d1c0022}: activation attempt finished since_creation_millis=81878 tenant_id=498d4021c9c525280ba808ac5d1c0022 activated_timelines=1 total_timelines=1 post_state="Active"
2023-05-30T06:50:09.753052Z  INFO Initial load completed. elapsed_millis=84217

koivunej added a commit that referenced this pull request May 30, 2023
koivunej added a commit that referenced this pull request May 30, 2023
added in #4366. revert for testing without it; it may have unintenteded
side-effects, and it's very difficult to understand the results from the
10k load testing environments. earlier results:
#4366 (comment)
@koivunej koivunej mentioned this pull request May 30, 2023
5 tasks
@koivunej
Copy link
Member Author

After #4368 it seems the results are the same, confirming my suspicion that the absence of spawn_blocking on the load path is to blame.

koivunej added a commit that referenced this pull request May 30, 2023
Startup continues to be slow, work towards to alleviate it.

Summary of changes:

- pretty the functional improvements from #4366 into
`utils::completion::{Completion, Barrier}`
- extend "initial load completion" usage up to tenant background tasks
    - previously only global background tasks
- spawn_blocking the tenant load directory traversal
- demote some logging
- remove some unwraps
- propagate some spans to `spawn_blocking`

Runtime effects should be major speedup to loading, but after that, the
`BACKGROUND_RUNTIME` will be blocked for a long time (minutes). Possible
follow-ups:
- complete initial tenant sizes before allowing background tasks to
block the `BACKGROUND_RUNTIME`
koivunej added a commit that referenced this pull request May 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants