-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Epic: Create the pool of pre-allocated VMs #115
Comments
I don't think that we have to do it specifically for VMs only. There are at least two things that prevent doing it only on the autoscaler / neonvm side.
Thus, as I imagined and discussed it with @tychoish and @kelvich primarily, pool of pre-created 'whatever' (pods, VMs or any other custom resource in the future) is maintained by the control-plane. From compute we only need an interface to 'wake it up' and notify that it now serves some particular timeline. I added some PoC So the only required part from the autoscaler perspective is:
And as we discussed with @sharnoff, we will likely need to switch to using labels / annotations for that, since this is usually the only thing one can change on k8s objects without re-creation / restart. As for the DoD
I don't think it should be in the Epic, but rahter in the PReq, as it depends on waaay to many code paths in all components. E.g. we already have some p99 outliers shortly after pageserver restarts. Control-plane could be sloppy and do many sub-optimal movements. And so on. That said, I don't think that this target is realistic for a first iteration. Something like p90 or even p80 is more realistic, as any really huge spike from Hasura will likely hit k8s node autoscaler and p99 will ~20s-1min |
we discussed this item with you and it requires some on our side as well, therefore the epic is created.
I don't think that we need to stop after the first iteration. The ultimate goal is to have a really quick start-up for computes. Then, I don't want to discount the ultimate goal from the very beginning. |
Was going to unassign myself because it seemed like the remaining work is in https://github.com/neondatabase/cloud/pull/4741, but given this:
I added #77 to the task list, and am keeping myself assigned :) However: I think the DoD does not match the issue title, so one of them should probably change. |
@sharnoff am I right that we can now change autoscaling limits on NeonVM without restart? |
If we do pool matrix for all (even some) combinations of min/max there will be just too many dimensions. I think that we will end up with a standard free tier size (1/4) for beginning. We can gather stats on the most common compute sizes for pods (obviously 1/4) and VMs (not sure) So I think that we absolutely need this for moving everyone to VMs (it's where start time is much worse), but in terms of min/max VMs are not worse than pods now (we cannot patch them without restart either), so I don't think it's urgent cc @tychoish @nikitakalyanov just in case |
Closing as completed because it's (broadly) been implemented on the control plane side which, which was the last remaining item here. See also https://github.com/neondatabase/cloud/pull/4741#issuecomment-1644267920 |
Motivation
To reduce start-up time for VMs (currently, it's ~4 seconds), we need to have several pre-warmed VMs and be able to change min/max boundaries at runtime.
DoD
Start-up time for 99% of cases is < 1 sec
Tasks:
Related epics/tasks/PRs:
The text was updated successfully, but these errors were encountered: