-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add idle compute restart time test #1514
Conversation
I'd propose the following scenario: run some continuous workload that generates a lot of WAL until backpressure steps in (this should correspond to the maximum WAL lag between PS and SKs), kill the compute under load and immediately start again. In that case safekeepers syncing should take more time. And as far as I followed down the way from |
I'll give it a try. Though it's a scenario that will only happen on compute failure. It's good to address but in the short term I'm more interested in cutting down the 2-3 sec startup time that every client will notice every time they use neon. If local restart of an idle compute node can be done in 0.2 sec, then basebackup is not part of the problem, and we should focus on pooling compute nodes and maybe caching data if needed :) |
Not only, also if pod under load was relocated to another k8s node. It could happen due to various reasons: EC2 spot instance was taken back from us, CPU/RAM/disk pressure occurred on the running k8s node, etc. But yeah, I haven't meant that it should be a baseline workload. It's rather a worst case scenario example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realized that currently on startup we verify that Postgres schema (roles/users and databases) is the same as in the console database. We do it by selecting from pg_roles
and pg_database
. So in terms of your test it's rather restart_with_data + read_after_restart = 0.7s. If we add some latency on each step from safekeepers syncing to GetPage@LSN, then current ~2 secs looks reasonable. And personally I find this test representative to compare with
Also here is an example of startup log
2022-03-25 18:34:52.356 UTC [main] INFO: starting cluster #autumn-feather-883732, operation #84386dfd-7a7b-48a1-8af3-0e34a1ecec35
2022-03-25 18:34:52.357 UTC [main] INFO: starting safekeepers syncing
...
2022-03-25 18:34:53.142 UTC [main] INFO: getting basebackup@0/169BBE0 from pageserver host=172.32.24.61 port=6400
...
2022-03-25 18:34:53.327 GMT [14] LOG: starting PostgreSQL 14.2 on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
...
2022-03-25 18:34:54.619 UTC [main] INFO: finished configuration of cluster #autumn-feather-883732
as you can see, in the real env only basebackup + safekeepers sync may take around 1 sec.
Also I wonder why these two numbers are so different?
With current |
TIL startup time is fast with this test locally (only 0.212 s) because the python test framework doesn't use I'm not familiar with the |
I think |
@ololobus You're right :) I'm trying to reproduce 1 sec p9 config latencies locally. With emulated delay of 2ms (which I think is more than what we have on prod) I get: But I shouldn't need 2ms emulated latency to see this behavior. It should happen with a smaller number (maybe 0.2?). Is there any difference in the typical config workload in testing and in prod? In tests we use neon/control_plane/src/endpoint.rs Line 464 in df3bae2
|
Local results:
Is there any workload on which you'd expect slower startup locally? Or is it safe to assume it's always gonna be at most 0.2 sec?
I'm aware that in prod we'd also have pod creation and network overheads, but those are somewhat separate problems.