Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Managed Cluster Topology and cluster-autoscaler #8217

Open
MaxFedotov opened this issue Mar 2, 2023 · 28 comments
Open

Managed Cluster Topology and cluster-autoscaler #8217

MaxFedotov opened this issue Mar 2, 2023 · 28 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/backlog Higher priority than priority/awaiting-more-evidence. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@MaxFedotov
Copy link
Contributor

Hi @elmiko and @sbueringer.

Want to go back to this issue after #7990 was merged.
If I understand correctly, now it is possible to add a new machineDeployment to managed Cluster and specify cluster-autoscaler annotations in spec like:

    workers:
      machineDeployments:
      - class: default-worker
        metadata:
          annotations:
            cluster.x-k8s.io/cluster-api-autoscaler-node-group-max-size: "4"
            cluster.x-k8s.io/cluster-api-autoscaler-node-group-min-size: "2"
        name: md-1

and defaulting webhook will set replicas field of created machineDeployment to a value specified in cluster.x-k8s.io/cluster-api-autoscaler-node-group-min-size annotation.

But if I will update this field, for example, set cluster.x-k8s.io/cluster-api-autoscaler-node-group-min-size: "3" in Cluster spec, defaulting webhook will not update replicas field, due to this check

func calculateMachineDeploymentReplicas(ctx context.Context, oldMD *MachineDeployment, newMD *MachineDeployment, dryRun bool) (int32, error) {
	// If replicas is already set => Keep the current value.
	if newMD.Spec.Replicas != nil {
		return *newMD.Spec.Replicas, nil
	}

And this completely prevents an ability to use cluster-autoscaler annotations when using managed Cluster topologies and machineDeployment is already created. Because if machineDeployments are managed by topology controller and workers .machineDeployments .[].metadata .annotations are updated it generates a Patch which includes only updated fields. So a user doesn't have any ability to set newMD.Spec.Replicas to nil value.

Maybe it is possible to modify this check and exclude machineDeployments created by topology controller from it?

Thanks!

/kind feature

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 2, 2023
@sbueringer
Copy link
Member

sbueringer commented Mar 2, 2023

But if I will update this field, for example, set cluster.x-k8s.io/cluster-api-autoscaler-node-group-min-size: "3" in Cluster spec, defaulting webhook will not update replicas field, due to this check

In my opinion this is intended behavior and a limitation/feature of the autoscaler.

If you use a MachineDeployment without ClusterClass you would get the same effect.

Please also see my comment here: #7293 (comment)

@elmiko
Copy link
Contributor

elmiko commented Mar 2, 2023

If you use a MachineDeployment without ClusterClass you would get the same effect.

i guess my question here is, if using MachineDeployment without ClusterClass, i could update the min size annotation and then update the replicas to the new value. how would i do that with a ClusterClass managing the MachineDeployment?

@sbueringer
Copy link
Member

You would update the annotation, then set the replica field in Cluster.spec.topology and then unset it again.

@sbueringer
Copy link
Member

Not ideal, but I think it's better then if we try to fixup the replica field on the MD based on the annotation and thus break the functionality that autoscaler doesn't act when replicas is outside of the range

@sbueringer
Copy link
Member

sbueringer commented Mar 2, 2023

or instead of setting / unsetting the replica field in Cluster.spec.topology you can also only set it directly on the MD

@elmiko
Copy link
Contributor

elmiko commented Mar 2, 2023

You would update the annotation, then set the replica field in Cluster.spec.topology and then unset it again.

i thought we had to leave this field unset if we are using with the autoscaler? (sorry if i've got the details wrong here)

Not ideal, but I think it's better then if we try to fixup the replica field on the MD based on the annotation and thus break the functionality that autoscaler doesn't act when replicas is outside of the range

this is my concern too

or instead of setting / unsetting the replica field in Cluster.spec.topology you can also only set it directly on the MD

ack, i'm curious to hear what @MaxFedotov has to say as he is using this feature much more than i am currently.

@sbueringer
Copy link
Member

sbueringer commented Mar 2, 2023

You would update the annotation, then set the replica field in Cluster.spec.topology and then unset it again.

i thought we had to leave this field unset if we are using with the autoscaler? (sorry if i've got the details wrong here)

It has to be unset for autoscaler to be able to control the field continuously (otherwise they both would try to set the field continuously). But if the goal is basically to temporarily take control of the field to bring it in the min/max range you can set it and directly unset it afterwards

@elmiko
Copy link
Contributor

elmiko commented Mar 2, 2023

But if the goal is basically to temporarily take control of the field to bring it in the min/max range you can set it and directly unset it afterwards

that sounds like the user can at least work around this issue manually

@sbueringer
Copy link
Member

sbueringer commented Mar 2, 2023

I think it's definitely not worse than with the standalone MachineDeployment.

In both cases they:

  1. change the min-size annotation
  2. set the replica field to bring it in range (with Cluster topology they can either do it also directly on the MD or by setting and unsetting via Cluster.spec.topology)

@MaxFedotov
Copy link
Contributor Author

MaxFedotov commented Mar 2, 2023

Not ideal, but I think it's better then if we try to fixup the replica field on the MD based on the annotation and thus break the functionality that autoscaler doesn't act when replicas is outside of the range

Seems like I don't understand what can go wrong in this case :( If we will set replicas on MD equal to autoscaler min annotation, how it can be outside of a range?

I think it's definitely not worse than with the standalone MachineDeployment.

I have to argue about it. In the case of standalone MD, a user needs a single atomic update operation (set annotations, remove replicas). In the case of a Cluster Topology user needs at least 2 operations (but really it will be 3 - update\get\update), which can be hard to make atomically.

@sbueringer
Copy link
Member

Seems like I don't understand what can go wrong in this case :( If we will set replicas on MD equal to autoscaler min annotation, how it can be outside of a range?

Autoscaler today intentionally does not act when the value of the replica field is outside of the min/max range. If we want to preserve this behavior we can't implement something that always changes the replica field to be inside of the range.

I think it's definitely not worse than with the standalone MachineDeployment.

I have to argue about it. In the case of standalone MD, a user needs a single atomic update operation (set annotations, remove replicas). In the case of a Cluster Topology user needs at least 2 operations (but really it will be 3 - update\get\update), which can be hard to make atomically.

That's a valid point

@MaxFedotov
Copy link
Contributor Author

MaxFedotov commented Mar 3, 2023

Autoscaler today intentionally does not act when the value of the replica field is outside of the min/max range. If we want to preserve this behavior we can't implement something that always changes the replica field to be inside of the range.

But if the user specified intentionally using cluster autoscaler annotation that he wants min number of nodes in nodegroup to be equal to some number, wouldn't it be the correct behavior to set a current number of replicas to this number (because autoscaler won't do it, it will only ensure that the number of replicas won't drop below this number)?

@elmiko
Copy link
Contributor

elmiko commented Mar 3, 2023

But if the user specified intentionally using cluster autoscaler annotation that he wants min number of nodes in nodegroup to be equal to some number, wouldn't it be the correct behavior to set a current number of replicas to this number (because autoscaler won't do it, it will only ensure that the number of replicas won't drop below this number)?

this is really the central question imo, if we decide to do this then we are diverging from the autoscaler's default behavior. at which point it becomes an expected cluster-api behavior. currently, the autoscaler has chosen not to adjust the cluster when the min/max limits are changed because there might be external factors for why this happened and it is expected that the user will take action. in this case, the proposal would be for the cluster-api controllers to automatically take the action that the user would. my next question would be, what if a user didn't want the replicas to be automatically adjusted, how would we allow that?

@MaxFedotov
Copy link
Contributor Author

MaxFedotov commented Mar 3, 2023

my next question would be, what if a user didn't want the replicas to be automatically adjusted, how would we allow that?

That depends on a user's intention :) If he specified a min number of nodes to be equal N, I think he wants them to be adjusted to the value N (and not to perform some other manual steps to achieve it), otherwise, he just won't do it.

@elmiko
Copy link
Contributor

elmiko commented Mar 3, 2023

If he specified a min number of nodes to be equal N, I think he wants them to be adjusted to the value N (and not to perform some other manual steps to achieve it), otherwise, he just won't do it.

that's generally what i would expect too, i'm just trying to think about corner cases where a user would adjust min value but not want the replicas automatically updated. it does sound bizarre when said aloud though XD

@MaxFedotov
Copy link
Contributor Author

it does sound bizarre when said aloud though XD

Completely agree :)

@sbueringer
Copy link
Member

sbueringer commented Mar 6, 2023

I agree that it feels surprising that when replicas is outside of the range it isn't brought into range.

But I think if this is the expected behavior then we should improve autoscaler to handle this case correctly. This will benefit not only the ClusterClass use case. It just feels off to implement a workaround/hack in ClusterClass/Cluster API because the behavior of the autoscaler is not what we expect it to be.

Essentially by setting the annotations we hand over control of the replica field to the autoscaler. Then the autoscaler should also be responsible to bring the replica field in range.

I think when using Cluster API with GitOps folks run into the exact same issue as with ClusterClass. They somehow have to bring replicas in the min/max range but they cannot continuously apply the replica field.

As far as I can tell the only scenario where the current behavior works well is when someone manually modifies a MachineDeployment or if someone implements some sort of "one time operation". But continuous reconcile doesn't work, which doesn't seem ideal.

@MaxFedotov
Copy link
Contributor Author

@sbueringer thanks, understood!
@elmiko What do you think, will it be possible to handle this logic on cluster-autoscaler side?

@elmiko
Copy link
Contributor

elmiko commented Mar 6, 2023

@MaxFedotov i have a feeling this would be extremely difficult to change in the core autoscaler. this behavior has been long-standing there and i doubt they will want to change it.

it's possible that we could try to detect the annotation change in our cloud provider and then adjust the replicas from that side, but since we don't really have a controller there it might not work exactly as we want. we would need to do some research on the cloud-provider side. i have a feeling this won't work though since we won't know when a user has updated the min value.

my gut feeling after this conversation is that we should probably leave this behavior as it is and add more documentation to instruct users how to automate this. i'm willing to do some research about it, or happy to collaborate if you want to do some research as well, but i have a feeling the manual approach might be the most straightforward solution.

@MaxFedotov
Copy link
Contributor Author

Seems like we are in a deadlock situation here.
@elmiko, @sbueringer as you have more experience in both, autoscaler and CAPI. What do you think, are there any ways to resolve it? Or better drop this discussion and stick to

set the replica field to bring it in range (with Cluster topology they can either do it also directly on the MD or by setting and unsetting via Cluster.spec.topology)

Although, to be honest, this looks a bit ugly from a design perspective for me :(

@fabriziopandini
Copy link
Member

fabriziopandini commented Mar 15, 2023

Reading through this thread it seems that there is a common agreement on a couple of points

  • The current autoscaler behavior is surprising for the users, and more research is required to see if/how this can be changed
  • Recent work in CAPI mitigates this issue, but this workaround covers CAPI handling over managing replicas to the auto-scaler and back, while we are intentionally avoiding CAPI to take any actions when the autoscaler is responsible of managing the replica number, because this can easily lead to the controller fighting.
  • The use case described in this issue falls into the gray zone in between the "autoscaler is responsible of managing the replica number" and "current autoscaler behavior is surprising for the users"

Considering that, I think we should open an issue in the autoscaler repository, and possibily advocate for it the the autoscaler office hours. If we can trigger discussion about autoscaler acting when the number of replicas is our of the min/max boundary, and get some traction around it, we can solve this issue and also get rid of the CAPI workaround.

Documentation could be a stop-gap while this discussion goes on, but the primarily work for this should happen in the autoscaler IMO.

@elmiko
Copy link
Contributor

elmiko commented Mar 15, 2023

The use case described in this issue falls into the gray zone in between the "autoscaler is responsible of managing the replica number" and "current autoscaler behavior is surprising for the users"

+1

Considering that, I think we should open an issue in the autoscaler repository, and possibily advocate for it the the autoscaler office hours. If we can trigger discussion about autoscaler acting when the number of replicas is our of the min/max boundary, and get some traction around it, we can solve this issue and also get rid of the CAPI workaround.

i think this would be a good discussion if only to help surface the issue we are facing and to see if the SIG might have other ideas as well. the behavior of the autoscaler with respect to min/max nodes has been in place for a long time, so i'm not quite sure how the SIG will react to a proposal to change that behavior, but i do think they will have some ideas to help move the conversation forward, whatever the result may be.

@sbueringer
Copy link
Member

The behavior of the autoscaler with respect to min/max nodes has been in place for a long time, so i'm not quite sure how the SIG will react to a proposal to change that behavior, but i do think they will have some ideas to help move the conversation forward, whatever the result may be.

Not sure if it's a realistic option, but maybe this could also be an optional behavior of the autoscaler controlled via some sort of flag or annotation.

@elmiko
Copy link
Contributor

elmiko commented Mar 16, 2023

Not sure if it's a realistic option, but maybe this could also be an optional behavior of the autoscaler controlled via some sort of flag or annotation.

it's certainly possible and might give us a way to break this jam, that seems like a reasonable path forward to me.

@fabriziopandini
Copy link
Member

/triage accepted
given all the value in the above discussions
however, it seems that there is nothing actionable on the CAPI side at least for the timeframe where we engage with the autoscaler community

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 21, 2023
@sbueringer
Copy link
Member

sbueringer commented Dec 19, 2023

@MaxFedotov @elmiko Given that there is a flag in autoscaler to control this behavior (enforce-node-group-min-size kubernetes/autoscaler#5267 (comment)), what do we think about this topic now?

(looks like this flag was merged ~ a year ago)

Do you think we should update our documentation? (https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/autoscaling)

@fabriziopandini
Copy link
Member

/priority backlog

@k8s-ci-robot k8s-ci-robot added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Apr 11, 2024
@fabriziopandini
Copy link
Member

/assign
To update documentation about current state of the autoscaler

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/backlog Higher priority than priority/awaiting-more-evidence. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

5 participants