Epic: Create the pool of pre-allocated VMs #115

vadim2404 · 2023-03-31T09:02:17Z

Motivation

To reduce start-up time for VMs (currently, it's ~4 seconds), we need to have several pre-warmed VMs and be able to change min/max boundaries at runtime.

DoD

Start-up time for 99% of cases is < 1 sec

Tasks:

Allow annotation to set/override scaling bounds #128 (includes handling runtime changes)
https://github.com/neondatabase/cloud/pull/4741
cluster-autoscaler does not work with VMs #77

Related epics/tasks/PRs:

compute_ctl: support for fetching spec from control plane neon#3610
neondatabase/cloud#4433

ololobus · 2023-03-31T14:23:31Z

I don't think that we have to do it specifically for VMs only. There are at least two things that prevent doing it only on the autoscaler / neonvm side.

Compute versions, they are picked based on storage node versions, so only control-plane knows which version of the pre-created compute is needed. After storage release there always will be pre-created computes left with stale version, so control-plane needs to tear them down and spawn fresh ones.
compute_ctl needs JWT and id (compute_id) to be able to get spec from the control-plane. These two are also known to the control-plane only, and future compute_id<->endpoint_id binding too.

Thus, as I imagined and discussed it with @tychoish and @kelvich primarily, pool of pre-created 'whatever' (pods, VMs or any other custom resource in the future) is maintained by the control-plane. From compute we only need an interface to 'wake it up' and notify that it now serves some particular timeline. I added some PoC compute_ctl code in this PR neondatabase/neon#3923

So the only required part from the autoscaler perspective is:

be able to change min/max boundaries at runtime

And as we discussed with @sharnoff, we will likely need to switch to using labels / annotations for that, since this is usually the only thing one can change on k8s objects without re-creation / restart.

As for the DoD

Start-up time for 99% of cases is < 1 sec

I don't think it should be in the Epic, but rahter in the PReq, as it depends on waaay to many code paths in all components. E.g. we already have some p99 outliers shortly after pageserver restarts. Control-plane could be sloppy and do many sub-optimal movements. And so on.

That said, I don't think that this target is realistic for a first iteration. Something like p90 or even p80 is more realistic, as any really huge spike from Hasura will likely hit k8s node autoscaler and p99 will ~20s-1min

vadim2404 · 2023-03-31T15:50:07Z

we discussed this item with you and it requires some on our side as well, therefore the epic is created.

be able to change min/max boundaries at runtime
If it's enough, then who will be against it? :)

That said, I don't think that this target is realistic for a first iteration. Something like p90 or even p80 is more realistic, as any really huge spike from Hasura will likely hit k8s node autoscaler and p99 will ~20s-1min

I don't think that we need to stop after the first iteration. The ultimate goal is to have a really quick start-up for computes. Then, I don't want to discount the ultimate goal from the very beginning.

sharnoff · 2023-04-25T22:14:50Z

Was going to unassign myself because it seemed like the remaining work is in https://github.com/neondatabase/cloud/pull/4741, but given this:

That said, I don't think that this target is realistic for a first iteration. Something like p90 or even p80 is more realistic, as any really huge spike from Hasura will likely hit k8s node autoscaler and p99 will ~20s-1min

I don't think that we need to stop after the first iteration. The ultimate goal is to have a really quick start-up for computes. Then, I don't want to discount the ultimate goal from the very beginning.

I added #77 to the task list, and am keeping myself assigned :)

However: I think the DoD does not match the issue title, so one of them should probably change.

ololobus · 2023-05-26T12:51:33Z

@sharnoff am I right that we can now change autoscaling limits on NeonVM without restart?

sharnoff · 2023-05-28T17:20:54Z

In theory we can, but there's still some bugs that makes it unusable in practice (#249, #252). Happy to prioritize those if it'd be useful. So far, my understanding was that the compute pool would just be creating VMs with their target sizes for now.

ololobus · 2023-05-30T14:29:35Z

the compute pool would just be creating VMs with their target sizes for now

If we do pool matrix for all (even some) combinations of min/max there will be just too many dimensions. I think that we will end up with a standard free tier size (1/4) for beginning. We can gather stats on the most common compute sizes for pods (obviously 1/4) and VMs (not sure)

So I think that we absolutely need this for moving everyone to VMs (it's where start time is much worse), but in terms of min/max VMs are not worse than pods now (we cannot patch them without restart either), so I don't think it's urgent

cc @tychoish @nikitakalyanov just in case

sharnoff · 2023-07-24T14:36:20Z

Closing as completed because it's (broadly) been implemented on the control plane side which, which was the last remaining item here. See also https://github.com/neondatabase/cloud/pull/4741#issuecomment-1644267920

vadim2404 added the t/Epic Issue type: Epic label Mar 31, 2023

vadim2404 added this to the 2023/04 milestone Mar 31, 2023

vadim2404 assigned sharnoff Mar 31, 2023

vadim2404 modified the milestones: 2023/04, 2023/05 Mar 31, 2023

sharnoff mentioned this issue Apr 3, 2023

Allow annotation to set/override scaling bounds #128

Merged

2 tasks

vadim2404 added the c/storage/compute label Apr 3, 2023

sharnoff closed this as completed Jul 24, 2023

stepashka added the c/compute Component: compute label Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: Create the pool of pre-allocated VMs #115

Epic: Create the pool of pre-allocated VMs #115

vadim2404 commented Mar 31, 2023 •

edited by sharnoff

Loading

ololobus commented Mar 31, 2023 •

edited

Loading

vadim2404 commented Mar 31, 2023

sharnoff commented Apr 25, 2023

ololobus commented May 26, 2023

sharnoff commented May 28, 2023

ololobus commented May 30, 2023

sharnoff commented Jul 24, 2023

Epic: Create the pool of pre-allocated VMs #115

Epic: Create the pool of pre-allocated VMs #115

Comments

vadim2404 commented Mar 31, 2023 • edited by sharnoff Loading

Motivation

DoD

Tasks:

Related epics/tasks/PRs:

ololobus commented Mar 31, 2023 • edited Loading

vadim2404 commented Mar 31, 2023

sharnoff commented Apr 25, 2023

ololobus commented May 26, 2023

sharnoff commented May 28, 2023

ololobus commented May 30, 2023

sharnoff commented Jul 24, 2023

vadim2404 commented Mar 31, 2023 •

edited by sharnoff

Loading

ololobus commented Mar 31, 2023 •

edited

Loading