Add idle compute restart time test #1514

bojanserafimov · 2022-04-15T03:47:52Z

Local results:

test_startup.startup_time: 0.138 s
test_startup.restart_time: 0.195 s
test_startup.read_time: 1.024 s
test_startup.second_read_time: 0.433 s
test_startup.restart_with_data: 0.212 s
test_startup.read_after_restart: 0.535 s

Is there any workload on which you'd expect slower startup locally? Or is it safe to assume it's always gonna be at most 0.2 sec?

I'm aware that in prod we'd also have pod creation and network overheads, but those are somewhat separate problems.

ololobus · 2022-04-15T09:20:26Z

Is there any workload on which you'd expect slower startup locally?

I'd propose the following scenario: run some continuous workload that generates a lot of WAL until backpressure steps in (this should correspond to the maximum WAL lag between PS and SKs), kill the compute under load and immediately start again. In that case safekeepers syncing should take more time. And as far as I followed down the way from env.postgres.create_start safekeepers syncing is involved here

bojanserafimov · 2022-04-15T14:38:23Z

I'd propose the following scenario: run some continuous workload that generates a lot of WAL until backpressure steps in (this should correspond to the maximum WAL lag between PS and SKs), kill the compute under load and immediately start again. In that case safekeepers syncing should take more time. And as far as I followed down the way from env.postgres.create_start safekeepers syncing is involved here

I'll give it a try. Though it's a scenario that will only happen on compute failure. It's good to address but in the short term I'm more interested in cutting down the 2-3 sec startup time that every client will notice every time they use neon. If local restart of an idle compute node can be done in 0.2 sec, then basebackup is not part of the problem, and we should focus on pooling compute nodes and maybe caching data if needed :)

ololobus · 2022-04-15T14:55:04Z

Though it's a scenario that will only happen on compute failure.

Not only, also if pod under load was relocated to another k8s node. It could happen due to various reasons: EC2 spot instance was taken back from us, CPU/RAM/disk pressure occurred on the running k8s node, etc.

But yeah, I haven't meant that it should be a baseline workload. It's rather a worst case scenario example

ololobus

Just realized that currently on startup we verify that Postgres schema (roles/users and databases) is the same as in the console database. We do it by selecting from pg_roles and pg_database. So in terms of your test it's rather restart_with_data + read_after_restart = 0.7s. If we add some latency on each step from safekeepers syncing to GetPage@LSN, then current ~2 secs looks reasonable. And personally I find this test representative to compare with

Also here is an example of startup log

2022-03-25 18:34:52.356 UTC [main] INFO: starting cluster #autumn-feather-883732, operation #84386dfd-7a7b-48a1-8af3-0e34a1ecec35
2022-03-25 18:34:52.357 UTC [main] INFO: starting safekeepers syncing
...
2022-03-25 18:34:53.142 UTC [main] INFO: getting basebackup@0/169BBE0 from pageserver host=172.32.24.61 port=6400
...
2022-03-25 18:34:53.327 GMT [14] LOG:  starting PostgreSQL 14.2 on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
...
2022-03-25 18:34:54.619 UTC [main] INFO: finished configuration of cluster #autumn-feather-883732

as you can see, in the real env only basebackup + safekeepers sync may take around 1 sec.

ololobus · 2022-04-16T07:21:25Z

Also I wonder why these two numbers are so different?

test_startup.read_time: 1.024 s
test_startup.second_read_time: 0.433 s

With current shared_buffers=1MB you should read most of these test 30 MB in both cases. And I suppose that just inserted data should be in RAM on pageserver anyway.

bojanserafimov · 2023-06-06T01:35:37Z

TIL startup time is fast with this test locally (only 0.212 s) because the python test framework doesn't use compute_ctl at all. @ololobus @hlinnaka

I'm not familiar with the compute_ctl would it be feasible to actually use it, or should I delete the test, set up the cloud repo and run some e2e tests?

ololobus · 2023-06-06T17:06:20Z

I'm not familiar with the compute_ctl would it be feasible to actually use it

I think compute_ctl should be used in tests after #3886

bojanserafimov · 2023-06-06T17:59:34Z

@ololobus You're right :)

I'm trying to reproduce 1 sec p9 config latencies locally. With emulated delay of 2ms (which I think is more than what we have on prod) I get: Startup time: 838ms config time: 621ms

But I shouldn't need 2ms emulated latency to see this behavior. It should happen with a smaller number (maybe 0.2?). Is there any difference in the typical config workload in testing and in prod? In tests we use delta_operations: None which seems unlikely to match prod

neon/control_plane/src/endpoint.rs

Line 464 in df3bae2

delta_operations: None,

Bojan Serafimov added 2 commits April 14, 2022 22:55

Add startup time test

7e15985

Write some data

9a6e636

bojanserafimov requested review from antons-antons, hlinnaka and ololobus April 15, 2022 03:48

bojanserafimov changed the title ~~Add pg startup time test~~ Add idle compute restart time test Apr 15, 2022

ololobus approved these changes Apr 16, 2022

View reviewed changes

hlinnaka approved these changes Apr 22, 2022

View reviewed changes

bojanserafimov marked this pull request as ready for review April 22, 2022 14:45

bojanserafimov merged commit 867aede into main Apr 22, 2022

bojanserafimov deleted the bojan-test-startup branch April 22, 2022 14:45

ololobus mentioned this pull request Apr 25, 2022

Epic: break down compute start time #1572

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add idle compute restart time test #1514

Add idle compute restart time test #1514

bojanserafimov commented Apr 15, 2022 •

edited

Loading

ololobus commented Apr 15, 2022

bojanserafimov commented Apr 15, 2022

ololobus commented Apr 15, 2022 •

edited

Loading

ololobus left a comment

ololobus commented Apr 16, 2022

bojanserafimov commented Jun 6, 2023

ololobus commented Jun 6, 2023

bojanserafimov commented Jun 6, 2023 •

edited

Loading

Add idle compute restart time test #1514

Add idle compute restart time test #1514

Conversation

bojanserafimov commented Apr 15, 2022 • edited Loading

ololobus commented Apr 15, 2022

bojanserafimov commented Apr 15, 2022

ololobus commented Apr 15, 2022 • edited Loading

ololobus left a comment

Choose a reason for hiding this comment

ololobus commented Apr 16, 2022

bojanserafimov commented Jun 6, 2023

ololobus commented Jun 6, 2023

bojanserafimov commented Jun 6, 2023 • edited Loading

bojanserafimov commented Apr 15, 2022 •

edited

Loading

ololobus commented Apr 15, 2022 •

edited

Loading

bojanserafimov commented Jun 6, 2023 •

edited

Loading