Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add idle compute restart time test #1514

Merged
merged 2 commits into from
Apr 22, 2022
Merged

Add idle compute restart time test #1514

merged 2 commits into from
Apr 22, 2022

Conversation

bojanserafimov
Copy link
Contributor

@bojanserafimov bojanserafimov commented Apr 15, 2022

Local results:

test_startup.startup_time: 0.138 s
test_startup.restart_time: 0.195 s
test_startup.read_time: 1.024 s
test_startup.second_read_time: 0.433 s
test_startup.restart_with_data: 0.212 s
test_startup.read_after_restart: 0.535 s

Is there any workload on which you'd expect slower startup locally? Or is it safe to assume it's always gonna be at most 0.2 sec?

I'm aware that in prod we'd also have pod creation and network overheads, but those are somewhat separate problems.

@ololobus
Copy link
Member

Is there any workload on which you'd expect slower startup locally?

I'd propose the following scenario: run some continuous workload that generates a lot of WAL until backpressure steps in (this should correspond to the maximum WAL lag between PS and SKs), kill the compute under load and immediately start again. In that case safekeepers syncing should take more time. And as far as I followed down the way from env.postgres.create_start safekeepers syncing is involved here

@bojanserafimov
Copy link
Contributor Author

I'd propose the following scenario: run some continuous workload that generates a lot of WAL until backpressure steps in (this should correspond to the maximum WAL lag between PS and SKs), kill the compute under load and immediately start again. In that case safekeepers syncing should take more time. And as far as I followed down the way from env.postgres.create_start safekeepers syncing is involved here

I'll give it a try. Though it's a scenario that will only happen on compute failure. It's good to address but in the short term I'm more interested in cutting down the 2-3 sec startup time that every client will notice every time they use neon. If local restart of an idle compute node can be done in 0.2 sec, then basebackup is not part of the problem, and we should focus on pooling compute nodes and maybe caching data if needed :)

@bojanserafimov bojanserafimov changed the title Add pg startup time test Add idle compute restart time test Apr 15, 2022
@ololobus
Copy link
Member

ololobus commented Apr 15, 2022

Though it's a scenario that will only happen on compute failure.

Not only, also if pod under load was relocated to another k8s node. It could happen due to various reasons: EC2 spot instance was taken back from us, CPU/RAM/disk pressure occurred on the running k8s node, etc.

But yeah, I haven't meant that it should be a baseline workload. It's rather a worst case scenario example

Copy link
Member

@ololobus ololobus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realized that currently on startup we verify that Postgres schema (roles/users and databases) is the same as in the console database. We do it by selecting from pg_roles and pg_database. So in terms of your test it's rather restart_with_data + read_after_restart = 0.7s. If we add some latency on each step from safekeepers syncing to GetPage@LSN, then current ~2 secs looks reasonable. And personally I find this test representative to compare with

Also here is an example of startup log

2022-03-25 18:34:52.356 UTC [main] INFO: starting cluster #autumn-feather-883732, operation #84386dfd-7a7b-48a1-8af3-0e34a1ecec35
2022-03-25 18:34:52.357 UTC [main] INFO: starting safekeepers syncing
...
2022-03-25 18:34:53.142 UTC [main] INFO: getting basebackup@0/169BBE0 from pageserver host=172.32.24.61 port=6400
...
2022-03-25 18:34:53.327 GMT [14] LOG:  starting PostgreSQL 14.2 on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
...
2022-03-25 18:34:54.619 UTC [main] INFO: finished configuration of cluster #autumn-feather-883732

as you can see, in the real env only basebackup + safekeepers sync may take around 1 sec.

@ololobus
Copy link
Member

Also I wonder why these two numbers are so different?

test_startup.read_time: 1.024 s
test_startup.second_read_time: 0.433 s

With current shared_buffers=1MB you should read most of these test 30 MB in both cases. And I suppose that just inserted data should be in RAM on pageserver anyway.

@bojanserafimov bojanserafimov marked this pull request as ready for review April 22, 2022 14:45
@bojanserafimov bojanserafimov merged commit 867aede into main Apr 22, 2022
@bojanserafimov bojanserafimov deleted the bojan-test-startup branch April 22, 2022 14:45
@bojanserafimov
Copy link
Contributor Author

TIL startup time is fast with this test locally (only 0.212 s) because the python test framework doesn't use compute_ctl at all. @ololobus @hlinnaka

I'm not familiar with the compute_ctl would it be feasible to actually use it, or should I delete the test, set up the cloud repo and run some e2e tests?

@ololobus
Copy link
Member

ololobus commented Jun 6, 2023

I'm not familiar with the compute_ctl would it be feasible to actually use it

I think compute_ctl should be used in tests after #3886

@bojanserafimov
Copy link
Contributor Author

bojanserafimov commented Jun 6, 2023

@ololobus You're right :)

I'm trying to reproduce 1 sec p9 config latencies locally. With emulated delay of 2ms (which I think is more than what we have on prod) I get: Startup time: 838ms config time: 621ms

But I shouldn't need 2ms emulated latency to see this behavior. It should happen with a smaller number (maybe 0.2?). Is there any difference in the typical config workload in testing and in prod? In tests we use delta_operations: None which seems unlikely to match prod

delta_operations: None,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants