Managed Cluster Topology and cluster-autoscaler #8217

MaxFedotov · 2023-03-02T11:25:32Z

Want to go back to this issue after #7990 was merged.
If I understand correctly, now it is possible to add a new machineDeployment to managed Cluster and specify cluster-autoscaler annotations in spec like:

    workers:
      machineDeployments:
      - class: default-worker
        metadata:
          annotations:
            cluster.x-k8s.io/cluster-api-autoscaler-node-group-max-size: "4"
            cluster.x-k8s.io/cluster-api-autoscaler-node-group-min-size: "2"
        name: md-1

and defaulting webhook will set replicas field of created machineDeployment to a value specified in cluster.x-k8s.io/cluster-api-autoscaler-node-group-min-size annotation.

But if I will update this field, for example, set cluster.x-k8s.io/cluster-api-autoscaler-node-group-min-size: "3" in Cluster spec, defaulting webhook will not update replicas field, due to this check

func calculateMachineDeploymentReplicas(ctx context.Context, oldMD *MachineDeployment, newMD *MachineDeployment, dryRun bool) (int32, error) {
	// If replicas is already set => Keep the current value.
	if newMD.Spec.Replicas != nil {
		return *newMD.Spec.Replicas, nil
	}

And this completely prevents an ability to use cluster-autoscaler annotations when using managed Cluster topologies and machineDeployment is already created. Because if machineDeployments are managed by topology controller and workers .machineDeployments .[].metadata .annotations are updated it generates a Patch which includes only updated fields. So a user doesn't have any ability to set newMD.Spec.Replicas to nil value.

Maybe it is possible to modify this check and exclude machineDeployments created by topology controller from it?

Thanks!

/kind feature

The text was updated successfully, but these errors were encountered:

sbueringer · 2023-03-02T16:30:14Z

But if I will update this field, for example, set cluster.x-k8s.io/cluster-api-autoscaler-node-group-min-size: "3" in Cluster spec, defaulting webhook will not update replicas field, due to this check

In my opinion this is intended behavior and a limitation/feature of the autoscaler.

If you use a MachineDeployment without ClusterClass you would get the same effect.

Please also see my comment here: #7293 (comment)

elmiko · 2023-03-02T16:41:17Z

If you use a MachineDeployment without ClusterClass you would get the same effect.

i guess my question here is, if using MachineDeployment without ClusterClass, i could update the min size annotation and then update the replicas to the new value. how would i do that with a ClusterClass managing the MachineDeployment?

sbueringer · 2023-03-02T16:43:13Z

You would update the annotation, then set the replica field in Cluster.spec.topology and then unset it again.

sbueringer · 2023-03-02T16:43:54Z

Not ideal, but I think it's better then if we try to fixup the replica field on the MD based on the annotation and thus break the functionality that autoscaler doesn't act when replicas is outside of the range

sbueringer · 2023-03-02T16:44:16Z

or instead of setting / unsetting the replica field in Cluster.spec.topology you can also only set it directly on the MD

elmiko · 2023-03-02T16:46:32Z

You would update the annotation, then set the replica field in Cluster.spec.topology and then unset it again.

i thought we had to leave this field unset if we are using with the autoscaler? (sorry if i've got the details wrong here)

Not ideal, but I think it's better then if we try to fixup the replica field on the MD based on the annotation and thus break the functionality that autoscaler doesn't act when replicas is outside of the range

this is my concern too

or instead of setting / unsetting the replica field in Cluster.spec.topology you can also only set it directly on the MD

ack, i'm curious to hear what @MaxFedotov has to say as he is using this feature much more than i am currently.

sbueringer · 2023-03-02T16:57:48Z

You would update the annotation, then set the replica field in Cluster.spec.topology and then unset it again.

i thought we had to leave this field unset if we are using with the autoscaler? (sorry if i've got the details wrong here)

It has to be unset for autoscaler to be able to control the field continuously (otherwise they both would try to set the field continuously). But if the goal is basically to temporarily take control of the field to bring it in the min/max range you can set it and directly unset it afterwards

elmiko · 2023-03-02T16:59:42Z

But if the goal is basically to temporarily take control of the field to bring it in the min/max range you can set it and directly unset it afterwards

that sounds like the user can at least work around this issue manually

sbueringer · 2023-03-02T17:13:22Z

I think it's definitely not worse than with the standalone MachineDeployment.

In both cases they:

change the min-size annotation
set the replica field to bring it in range (with Cluster topology they can either do it also directly on the MD or by setting and unsetting via Cluster.spec.topology)

MaxFedotov · 2023-03-02T17:51:27Z

Not ideal, but I think it's better then if we try to fixup the replica field on the MD based on the annotation and thus break the functionality that autoscaler doesn't act when replicas is outside of the range

Seems like I don't understand what can go wrong in this case :( If we will set replicas on MD equal to autoscaler min annotation, how it can be outside of a range?

I think it's definitely not worse than with the standalone MachineDeployment.

I have to argue about it. In the case of standalone MD, a user needs a single atomic update operation (set annotations, remove replicas). In the case of a Cluster Topology user needs at least 2 operations (but really it will be 3 - update\get\update), which can be hard to make atomically.

sbueringer · 2023-03-02T18:30:06Z

Seems like I don't understand what can go wrong in this case :( If we will set replicas on MD equal to autoscaler min annotation, how it can be outside of a range?

Autoscaler today intentionally does not act when the value of the replica field is outside of the min/max range. If we want to preserve this behavior we can't implement something that always changes the replica field to be inside of the range.

I think it's definitely not worse than with the standalone MachineDeployment.

I have to argue about it. In the case of standalone MD, a user needs a single atomic update operation (set annotations, remove replicas). In the case of a Cluster Topology user needs at least 2 operations (but really it will be 3 - update\get\update), which can be hard to make atomically.

That's a valid point

MaxFedotov · 2023-03-03T14:22:25Z

Autoscaler today intentionally does not act when the value of the replica field is outside of the min/max range. If we want to preserve this behavior we can't implement something that always changes the replica field to be inside of the range.

But if the user specified intentionally using cluster autoscaler annotation that he wants min number of nodes in nodegroup to be equal to some number, wouldn't it be the correct behavior to set a current number of replicas to this number (because autoscaler won't do it, it will only ensure that the number of replicas won't drop below this number)?

elmiko · 2023-03-03T14:33:21Z

But if the user specified intentionally using cluster autoscaler annotation that he wants min number of nodes in nodegroup to be equal to some number, wouldn't it be the correct behavior to set a current number of replicas to this number (because autoscaler won't do it, it will only ensure that the number of replicas won't drop below this number)?

this is really the central question imo, if we decide to do this then we are diverging from the autoscaler's default behavior. at which point it becomes an expected cluster-api behavior. currently, the autoscaler has chosen not to adjust the cluster when the min/max limits are changed because there might be external factors for why this happened and it is expected that the user will take action. in this case, the proposal would be for the cluster-api controllers to automatically take the action that the user would. my next question would be, what if a user didn't want the replicas to be automatically adjusted, how would we allow that?

MaxFedotov · 2023-03-03T17:37:53Z

my next question would be, what if a user didn't want the replicas to be automatically adjusted, how would we allow that?

That depends on a user's intention :) If he specified a min number of nodes to be equal N, I think he wants them to be adjusted to the value N (and not to perform some other manual steps to achieve it), otherwise, he just won't do it.

elmiko · 2023-03-03T17:51:54Z

If he specified a min number of nodes to be equal N, I think he wants them to be adjusted to the value N (and not to perform some other manual steps to achieve it), otherwise, he just won't do it.

that's generally what i would expect too, i'm just trying to think about corner cases where a user would adjust min value but not want the replicas automatically updated. it does sound bizarre when said aloud though XD

MaxFedotov · 2023-03-03T18:10:47Z

it does sound bizarre when said aloud though XD

Completely agree :)

sbueringer · 2023-03-06T03:15:04Z

I agree that it feels surprising that when replicas is outside of the range it isn't brought into range.

But I think if this is the expected behavior then we should improve autoscaler to handle this case correctly. This will benefit not only the ClusterClass use case. It just feels off to implement a workaround/hack in ClusterClass/Cluster API because the behavior of the autoscaler is not what we expect it to be.

Essentially by setting the annotations we hand over control of the replica field to the autoscaler. Then the autoscaler should also be responsible to bring the replica field in range.

I think when using Cluster API with GitOps folks run into the exact same issue as with ClusterClass. They somehow have to bring replicas in the min/max range but they cannot continuously apply the replica field.

As far as I can tell the only scenario where the current behavior works well is when someone manually modifies a MachineDeployment or if someone implements some sort of "one time operation". But continuous reconcile doesn't work, which doesn't seem ideal.

MaxFedotov · 2023-03-06T14:29:29Z

@sbueringer thanks, understood!
@elmiko What do you think, will it be possible to handle this logic on cluster-autoscaler side?

elmiko · 2023-03-06T15:34:19Z

@MaxFedotov i have a feeling this would be extremely difficult to change in the core autoscaler. this behavior has been long-standing there and i doubt they will want to change it.

it's possible that we could try to detect the annotation change in our cloud provider and then adjust the replicas from that side, but since we don't really have a controller there it might not work exactly as we want. we would need to do some research on the cloud-provider side. i have a feeling this won't work though since we won't know when a user has updated the min value.

my gut feeling after this conversation is that we should probably leave this behavior as it is and add more documentation to instruct users how to automate this. i'm willing to do some research about it, or happy to collaborate if you want to do some research as well, but i have a feeling the manual approach might be the most straightforward solution.

MaxFedotov · 2023-03-08T10:15:09Z

Seems like we are in a deadlock situation here.
@elmiko, @sbueringer as you have more experience in both, autoscaler and CAPI. What do you think, are there any ways to resolve it? Or better drop this discussion and stick to

set the replica field to bring it in range (with Cluster topology they can either do it also directly on the MD or by setting and unsetting via Cluster.spec.topology)

Although, to be honest, this looks a bit ugly from a design perspective for me :(

fabriziopandini · 2023-03-15T19:46:02Z

Reading through this thread it seems that there is a common agreement on a couple of points

The current autoscaler behavior is surprising for the users, and more research is required to see if/how this can be changed
Recent work in CAPI mitigates this issue, but this workaround covers CAPI handling over managing replicas to the auto-scaler and back, while we are intentionally avoiding CAPI to take any actions when the autoscaler is responsible of managing the replica number, because this can easily lead to the controller fighting.
The use case described in this issue falls into the gray zone in between the "autoscaler is responsible of managing the replica number" and "current autoscaler behavior is surprising for the users"

Considering that, I think we should open an issue in the autoscaler repository, and possibily advocate for it the the autoscaler office hours. If we can trigger discussion about autoscaler acting when the number of replicas is our of the min/max boundary, and get some traction around it, we can solve this issue and also get rid of the CAPI workaround.

Documentation could be a stop-gap while this discussion goes on, but the primarily work for this should happen in the autoscaler IMO.

elmiko · 2023-03-15T21:48:02Z

The use case described in this issue falls into the gray zone in between the "autoscaler is responsible of managing the replica number" and "current autoscaler behavior is surprising for the users"

+1

Considering that, I think we should open an issue in the autoscaler repository, and possibily advocate for it the the autoscaler office hours. If we can trigger discussion about autoscaler acting when the number of replicas is our of the min/max boundary, and get some traction around it, we can solve this issue and also get rid of the CAPI workaround.

i think this would be a good discussion if only to help surface the issue we are facing and to see if the SIG might have other ideas as well. the behavior of the autoscaler with respect to min/max nodes has been in place for a long time, so i'm not quite sure how the SIG will react to a proposal to change that behavior, but i do think they will have some ideas to help move the conversation forward, whatever the result may be.

sbueringer · 2023-03-16T03:22:26Z

The behavior of the autoscaler with respect to min/max nodes has been in place for a long time, so i'm not quite sure how the SIG will react to a proposal to change that behavior, but i do think they will have some ideas to help move the conversation forward, whatever the result may be.

Not sure if it's a realistic option, but maybe this could also be an optional behavior of the autoscaler controlled via some sort of flag or annotation.

elmiko · 2023-03-16T14:39:35Z

Not sure if it's a realistic option, but maybe this could also be an optional behavior of the autoscaler controlled via some sort of flag or annotation.

it's certainly possible and might give us a way to break this jam, that seems like a reasonable path forward to me.

fabriziopandini · 2023-03-21T10:08:14Z

/triage accepted
given all the value in the above discussions
however, it seems that there is nothing actionable on the CAPI side at least for the timeframe where we engage with the autoscaler community

sbueringer · 2023-12-19T12:46:30Z

@MaxFedotov @elmiko Given that there is a flag in autoscaler to control this behavior (enforce-node-group-min-size kubernetes/autoscaler#5267 (comment)), what do we think about this topic now?

(looks like this flag was merged ~ a year ago)

Do you think we should update our documentation? (https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/autoscaling)

fabriziopandini · 2024-04-11T19:44:57Z

/priority backlog

fabriziopandini · 2024-05-08T12:54:31Z

/assign
To update documentation about current state of the autoscaler

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 2, 2023

MaxFedotov mentioned this issue Mar 2, 2023

cluster-autoscaler CAPI scale up when current number of nodes is less than minSize kubernetes/autoscaler#5267

Closed

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 21, 2023

jimmidyson mentioned this issue Mar 6, 2024

feat: add cluster-autoscaler CRS addon nutanix-cloud-native/cluster-api-runtime-extensions-nutanix#423

Merged

neolit123 mentioned this issue Mar 28, 2024

make the autoscaler annotations and the replicas field in cluster.spec.topology mutual exclusive #10343

Closed

k8s-ci-robot added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Apr 11, 2024

k8s-ci-robot assigned fabriziopandini May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Managed Cluster Topology and cluster-autoscaler #8217

Managed Cluster Topology and cluster-autoscaler #8217

MaxFedotov commented Mar 2, 2023

sbueringer commented Mar 2, 2023 •

edited

Loading

elmiko commented Mar 2, 2023

sbueringer commented Mar 2, 2023

sbueringer commented Mar 2, 2023

sbueringer commented Mar 2, 2023 •

edited

Loading

elmiko commented Mar 2, 2023

sbueringer commented Mar 2, 2023 •

edited

Loading

elmiko commented Mar 2, 2023

sbueringer commented Mar 2, 2023 •

edited

Loading

MaxFedotov commented Mar 2, 2023 •

edited

Loading

sbueringer commented Mar 2, 2023

MaxFedotov commented Mar 3, 2023 •

edited

Loading

elmiko commented Mar 3, 2023

MaxFedotov commented Mar 3, 2023 •

edited

Loading

elmiko commented Mar 3, 2023

MaxFedotov commented Mar 3, 2023

sbueringer commented Mar 6, 2023 •

edited

Loading

MaxFedotov commented Mar 6, 2023

elmiko commented Mar 6, 2023

MaxFedotov commented Mar 8, 2023

fabriziopandini commented Mar 15, 2023 •

edited

Loading

elmiko commented Mar 15, 2023

sbueringer commented Mar 16, 2023

elmiko commented Mar 16, 2023

fabriziopandini commented Mar 21, 2023

sbueringer commented Dec 19, 2023 •

edited

Loading

fabriziopandini commented Apr 11, 2024

fabriziopandini commented May 8, 2024

Managed Cluster Topology and cluster-autoscaler #8217

Managed Cluster Topology and cluster-autoscaler #8217

Comments

MaxFedotov commented Mar 2, 2023

sbueringer commented Mar 2, 2023 • edited Loading

elmiko commented Mar 2, 2023

sbueringer commented Mar 2, 2023

sbueringer commented Mar 2, 2023

sbueringer commented Mar 2, 2023 • edited Loading

elmiko commented Mar 2, 2023

sbueringer commented Mar 2, 2023 • edited Loading

elmiko commented Mar 2, 2023

sbueringer commented Mar 2, 2023 • edited Loading

MaxFedotov commented Mar 2, 2023 • edited Loading

sbueringer commented Mar 2, 2023

MaxFedotov commented Mar 3, 2023 • edited Loading

elmiko commented Mar 3, 2023

MaxFedotov commented Mar 3, 2023 • edited Loading

elmiko commented Mar 3, 2023

MaxFedotov commented Mar 3, 2023

sbueringer commented Mar 6, 2023 • edited Loading

MaxFedotov commented Mar 6, 2023

elmiko commented Mar 6, 2023

MaxFedotov commented Mar 8, 2023

fabriziopandini commented Mar 15, 2023 • edited Loading

elmiko commented Mar 15, 2023

sbueringer commented Mar 16, 2023

elmiko commented Mar 16, 2023

fabriziopandini commented Mar 21, 2023

sbueringer commented Dec 19, 2023 • edited Loading

fabriziopandini commented Apr 11, 2024

fabriziopandini commented May 8, 2024

sbueringer commented Mar 2, 2023 •

edited

Loading

sbueringer commented Mar 2, 2023 •

edited

Loading

sbueringer commented Mar 2, 2023 •

edited

Loading

sbueringer commented Mar 2, 2023 •

edited

Loading

MaxFedotov commented Mar 2, 2023 •

edited

Loading

MaxFedotov commented Mar 3, 2023 •

edited

Loading

MaxFedotov commented Mar 3, 2023 •

edited

Loading

sbueringer commented Mar 6, 2023 •

edited

Loading

fabriziopandini commented Mar 15, 2023 •

edited

Loading

sbueringer commented Dec 19, 2023 •

edited

Loading