Scale up: ignore existing nodes #1021

aermakov-zalando · 2018-06-28T13:32:02Z

Add a new flag, --scale-up-cloud-provider-template, that forces the autoscaler to always construct the template node from the cloud provider data (e.g. AWS launch configuration) instead of a random existing node. This fixes both the trivial issues where instance type change is not taken into account until the user gets lucky with node ordering and more complicated ones where pods using node affinities sometimes don't trigger scale up (see comments in #789).

I think that this should be the default behaviour, because the current one is non-intuitive and hard to debug. However, since the users might rely on existing logic and use node selectors without the corresponding tags on the ASGs, the flag defaults to false.

One more change I've made selects a random AZ when constructing a template node for a multi-AZ ASG in the AWS cloud provider. Previously, the autoscaler always used the first AZ, but since this was masked by the default logic of using a random existing node it sort of worked when existing nodes got added or removed. Using a random AZ allows the autoscaler to progress even in these non-ideal setups.

k8s-ci-robot · 2018-06-28T13:32:05Z

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
If you have done the above and are still having issues with the CLA being reported as unsigned, please email the CNCF helpdesk: [email protected]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Add --scale-up-cloud-provider-template option, defaulting to false. If enabled, the autoscaler will always create template nodes using the cloud provider configuration, ignoring existing nodes.

If an ASG uses multiple availability zones, select a random one when constructing a template node instead of the first one. This allows the autoscaler to make progress on nodes with specific zone requirements (e.g. due to node selectors or volume bindings), even if it takes a couple of attempts.

aermakov-zalando · 2018-07-18T15:00:32Z

Rebased on master. Any updates? @aleksandra-malinowska

MaciekPytel

The reason why we pick an existing node is because building template in cloudprovider is just guessing how the node will look like. There is a lot of logic that hard-codes some values or copy-pastes code from other repos.

To give you an example:
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_manager.go#L268
This is pretty bad, it means we assume kubelet doesn't reserve any resources for system.

Also we have hard-coded kube-proxy request:
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/utils.go#L36

And a whole lot of other stuff like that. Cloudprovider template is just an approximation we use for scale-from-0, becuse we have nothing else we can use.

Recently I've seen a case where CA overestimated node memory when scaling from 0 and it added a node to a group which was too small for the pending pod. After the node was added it figured out it's too small, added a node from different node group and finally removed the incorrectly added node. With your flag set to true it would keep adding more nodes to the wrong node group.

Also implementing TemplateNodeInfo is optional for cloudproviders and setting your flag to true would break CA for any provider who doesn't implement it.

So I recommend against what you're trying to do. That being said I don't mind adding the flag as long as the default is false and flag description clearly warns about the risk of using it.

MaciekPytel · 2018-08-08T15:34:03Z

cluster-autoscaler/cloudprovider/aws/aws_manager.go

@@ -214,7 +214,7 @@ func (m *AwsManager) getAsgTemplate(asg *asg) (*asgTemplate, error) {
 		return nil, fmt.Errorf("Unable to get first AvailabilityZone for %s", asg.Name)
 	}

-	az := asg.AvailabilityZones[0]
+	az := asg.AvailabilityZones[rand.Intn(len(asg.AvailabilityZones))]


Similar to picking random node this makes multi-az ASG sort of work. Except we run predicates for one node and expect the result will always be exactly the same for every node in node group. As soon as you use a scheduling config that is zone-specific (nodeSelector on zone label, zone-level pod affinity/antiaffinity, etc) CA will start behaving incorrectly.

You really should be using a separate ASG for each zone if you want CA to work correctly in every scenario.

MaciekPytel · 2018-08-08T15:42:54Z

cluster-autoscaler/main.go

+	expendablePodsPriorityCutoff     = flag.Int("expendable-pods-priority-cutoff", -10, "Pods with priority below cutoff will be expendable. They can be killed without any consideration during scale down and they don't cause scale up. Pods with null priority (PodPriority disabled) are non expendable.")
+	regional                         = flag.Bool("regional", false, "Cluster is regional.")
+	scaleUpTemplateFromCloudProvider = flag.Bool("scale-up-cloud-provider-template", false,
+		"Should CA build the template nodes using up-to-date cloud provider configuration instead of a random existing node.")


I think the description is misleading. I'd prefer if it mentioned the risks I talk about in my other comment.

MaciekPytel · 2018-08-08T15:47:01Z

cluster-autoscaler/core/scale_up.go

-	nodeInfos, err := GetNodeInfosForGroups(nodes, context.CloudProvider, context.ClientSet,
+
+	existingNodes := nodes
+	if context.ScaleUpTemplateFromCloudProvider {


This should happen inside GetNodeInfosForGroups. The way it is now relies on the fact that we only call that method here, but if we ever need to call it elsewhere we're likely to end up with some inconsistent behavior. Also it relies on how GetNodeInfosForGroups work internally. Given that we don't really test CA with all possible flag values (ie. this branch is unlikely to be tested), it may easily break as soon as someone needs to change anything.

MaciekPytel · 2018-08-08T16:00:18Z

cluster-autoscaler/core/scale_up_test.go

@@ -887,6 +887,118 @@ func TestScaleUpAutoprovisionedNodeGroup(t *testing.T) {
 	assert.Equal(t, "autoprovisioned-T1-1", getStringFromChan(expandedGroups))
 }

+func TestScaleUpAutoprovisionFavorExistingNodes(t *testing.T) {


Your change has nothing to do with autoprovisioning (which is our word for CA automatically creating new node groups, currently only working on GKE). You should use non-autoprovisioning test as a starting point.

aermakov-zalando · 2018-09-19T13:25:11Z

@MaciekPytel Sorry it took so long to reply, I was busy with other things.

I didn't really look at the scale-up-from-zero code and expected it to handle everything correctly, but I now see how its current behavior could cause issues if I force it, but I'd also like to solve the issues we have with the current faulty logic. The main two issues we want to solve are:

template labels should be taken from the template, not from a random node
instance type change should be handled at least somehow, otherwise we won't be able to scale up the cluster if the users suddenly want to deploy bigger pods than what's currently supported.

A combination of the current approach (use existing nodes as templates) and scale-up-from-0 logic could solve both of these, I believe, but will require changing the NodeGroup interface a bit, so please tell me if you're OK with it first. What I want to do is pass existing nodes to TemplateNodeInfo, and then change the logic in the following way:

If there are existing nodes launched from the current template (LaunchConfiguration/LaunchTemplate for AWS), use allocation information from one of these, otherwise fallback to scale-up-from-0 logic which then guesstimates. This should give the autoscaler good enough information after at least one node exists, and the group will have at most one flapping node if there are any issues.
Reset the labels/taints on the template node to only the template ones (optional, since a lot of people might be relying on the current logic)

Comments/suggestions?

MaciekPytel · 2018-10-17T14:14:29Z

Sorry for late reply, I was on vacation and completely missed your reply.

First of all - there are a lot of implementations of cloudprovider (3 + 2 in progress in main repo and I know about at least 3 different providers that implemented their cloudprovider in fork). We've been GA for a year now and it's effectively our API. I'm ok-ish with extending the interface with new optional methods, but I'm against any major changes (especially backward incompatible ones) without an extremely good reason. Also if we were to do such changes we probably should go through a deprecation process, which makes me even less happy about it.

Fundamentally, a NodeGroup is a set of identical, interchangeable nodes. It's the central assumption in CA and a we depend on that both in core code and various cloudprovider-specific places. Just create a separate NodeGroup with a different instance type - CA works quite well with multiple NodeGroups. It's the same advice we give people who want multi-zone clusters: use a separate ASG in each zone instead of multi-zonal ASGs.

I'm more sympathetic to the labels and taints use-case. I can see a lot of reasons for temporarily tainting a node and CA should play better with it. The simplest solution would be to just say that any label or taint prefixed with custom.cluster-autoscaler.kubernetes.io/ (or something like that) would be ignored when creating template node. Would that help? If not we can think about adding some optional method to cloudprovider that filters out labels and taints, but that's likely to bring some complexity (in my experience an extra API call per NodeGroup per loop is likely to be a scalability problem and caching makes things complex).

aermakov-zalando · 2018-11-12T11:09:53Z

I still think that what the current CA behaviour is a surprising and undocumented hack that's very painful when you actually run into it. Software should be as simple as possible and not try to magically guess for the user which is what CA does now (badly!).

I see your point about not wanting to change the provider interface. What about reconciling the difference in the CA code itself? You could still keep the provider interface the same as before, where CA asks it to provide a template node, and then merge it with the existing nodes:

labels/taints from the existing will be overwritten by the template node's labels/taints (if configured)
allocatable can be computed in a smarter way (e.g. get it from the latest existing node if the instance type is the same as the template's one, or just recompute it using the existing bad heuristics)

This should make it possible to implement better logic while keeping the provider interface intact, even though it doesn't cover more complicated future improvements (e.g. node pools with different instance types, which is something that'd be possible with AWS Spot Fleets, for example). For this you'd actually need to implement scale-up-from-zero calculation properly.

MaciekPytel · 2018-11-14T16:51:53Z

I think the problem once again boils to the definition of NodeGroup. A NodeGroup is not an ASG, it is a scalable set of identical nodes. Nothing in the core logic expects it to be anything more and I don't think we want to change that - the interface is meant to be a common denominator between clouds and there are significant differences between ASG, MIG, VMSS, etc.

I'd argue that once we assume the nodes in NodeGroup are identical using a copy of an existing node as a template is the simplest solution. I can explain it in a single sentence to someone who doesn't know CA, which is not the case for combining allocatable of existing node with taints and labels from TemplateNodeInfo with some sort of fallback if one or the other is unavailable.

If we want to support cloud provider specific features like changing instance type in ASG perhaps the right way to do this would be to have cloudprovider implementation that handles it behind the scenes, representing machines with different instance type as a separate NodeGroup. There is no requirement for NodeGroup to map 1-1 to ASG, as long as we manage to implement things like TargetSize() in a consistent way.

Taints are a special case, because they can represent both node config (ex. tainting node with expensive resources to limit what pods can schedule on this node) and state (some temporary condition of the node). If it's the latter we should discard the taint. The problem is how to distinguish those - I dislike both the flag and the prefix approach proposed in #1350, but I don't have any better idea. We don't have any CRD or dynamic config that we could use to pass those and we're not likely to add it in predictable future.

aermakov-zalando · 2018-11-14T17:12:51Z

I'd argue that once we assume the nodes in NodeGroup are identical using a copy of an existing node as a template is the simplest solution.

Yes, but it's also a bad solution, unless you expect the cluster operators to never change taints or labels. And the way it's completely automagical is even worse, because it will work most of the time so people won't even read the documentation and just rely on it, until they change a label or add a taint and then their cluster suddenly stops scaling for a couple of pods. Once that happens it's also impossible to fix it without either getting rid of affinities/selectors, babysitting the cluster to manually scale it up or bringing up nodes until you get lucky and the provider gives you a node with a name that'll sort before the "broken" one.

You now also have two code paths, one for scale-up-from-0 and another one for magically-copy-from-existing-node. One of them might be exercised a lot less often, so the configuration getting out of sync could be noticed at the worst possible moment, triggering an incident. If CA used just one set of truth instead of trying to be smart, the error would've been noticed immediately.

I would also argue that the assumption that all the nodes in the node pool are identical is also dubious at best. Yes, it works for GKE because you have tooling that allows you to treat node pools as immutable and migrate from one to another. Is every user of CA supposed to implement the same tooling now just to achieve something as simple as changing the instance type? Or are the users supposed to manually babysit the nodes, draining them one by one after creating the fiftieth copy of the same ASG even though AWS easily allows changing an instance type of an existing ASG? The only real problem, as you've said, is the capacity estimation of new nodes, but there's no real requirement that all nodes spawned from an ASG look exactly the same as existing ones, they just have to look exactly the same as each other so the scale-up calculations can work correctly.

losipiuk · 2018-12-03T07:28:00Z

I would also argue that the assumption that all the nodes in the node pool are identical is also dubious at best. Yes, it works for GKE because you have tooling that allows you to treat node pools as immutable and migrate from one to another. Is every user of CA supposed to implement the same tooling now just to achieve something as simple as changing the instance type? Or are the users supposed to manually babysit the nodes, draining them one by one after creating the fiftieth copy of the same ASG even though AWS easily allows changing an instance type of an existing ASG? The only real problem, as you've said, is the capacity estimation of new nodes, but there's no real requirement that all nodes spawned from an ASG look exactly the same as existing ones, they just have to look exactly the same as each other so the scale-up calculations can work correctly.

As @MaciekPytel already said fact that node-group is set of identical nodes is principal design constraint and it will not be changed. All the logic in CA depends on that assumption.
Yet it is important to note that node-group is purely logical concept. It currently maps in AWS cloud provider to ASG in 1:1 manner, but it is something that can be changed.
If we have ASG which may consist of multiple types of machines you may expose it as set of node-groups to CA. Then NodeGroup.MaxSize() can be used to enforce that only one of node-groups (matching current template) is up-scalable.

losipiuk · 2018-12-03T17:34:23Z

As we discussed during sig-meeting, it seems like an option to be able to influence how the function of predicting next node shape for give node group looks like, may came handy.
On the other hand it is not obvious that everyone needs the same logic.

Given that we suggest implementing adding new *Processor interface which will be called in places when we need to determine the node shape for node pool.
Then we can have default implementation of processor which basically replicates current logic of CA.
And we can have another implementation which allows handling Zalando case (preferably in as generic way as possible - so it is not limited to Zalando only).
The processor implementation as well as maybe some parametrization of the processor itself would be done using CA program flags.

As an example how we use Processors to have pluggable logic you may want to look at NodeGroupSetProcessor which was introduced by 6f5e6aa

@MaciekPytel, do you want to add something?

mwielgus · 2019-02-01T16:41:04Z

No activity for almost 2 months, closing. Please reopen once needed.

This is a second attempt at kubernetes#1021 (might also provides an alternate solution for kubernetes#2892 and kubernetes#3608). Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting to false and describing the limitations (not all cloud providers) and the risk of using it (loss of accuracy, risk of upscaling unusable nodes or leaving pending pods). Some uses cases includes: * Balance Similar when uspscaling from zero (but I believe kubernetes#3608 is a better approach) * Edited/updated ASGs/MIGs taints and labels * Updated instance type Per previous discussion, the later two cases could be covered in the long term by custom node-shape discovery Processors (open to discuss that option too).

This is a third attempt at kubernetes#1021 (might also provides an alternate solution for kubernetes#2892 and kubernetes#3608 amd kubernetes#3609). Some uses cases includes: balance Similar when uspscaling from zero, edited/updated ASGs/MIGs taints and labels, updated instance type. Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting to false and describing the limitations (not all cloud providers) and the risk of using it (loss of accuracy, risk of upscaling unusable nodes or leaving pending pods). Per kubernetes#3609 discussion, using a NodeInfo processor is prefered. `GetNodeInfosForGroups()` fate (and the opportunity to split NodeInfoProcessor in NodeInfoProcessor and NodeInfoDecoratorProcessor) left open for discussion as I'd like to collect feedback on refactoring plans here.

This is a third attempt at kubernetes#1021 (might also provides an alternate solution for kubernetes#2892 and kubernetes#3608 amd kubernetes#3609). Some uses cases includes: balance Similar when uspscaling from zero, edited/updated ASGs/MIGs taints and labels, updated instance type. Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting to false and describing the limitations (not all cloud providers) and the risk of using it (loss of accuracy, risk of upscaling unusable nodes or leaving pending pods). Per kubernetes#3609 discussion, using a NodeInfo processor is prefered.

This is a third attempt at kubernetes#1021 (might also provides an alternate solution for kubernetes#2892 and kubernetes#3608 amd kubernetes#3609). Some uses cases includes: balance Similar when uspscaling from zero, edited/updated ASGs/MIGs taints and labels, updated instance type. Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting to false and describing the limitations (not all cloud providers) and the risk of using it (loss of accuracy, risk of upscaling unusable nodes or leaving pending pods). Per kubernetes#3609 discussion, using a NodeInfo processor is prefered. `GetNodeInfosForGroups()` fate (and the opportunity to split NodeInfoProcessor in NodeInfoProcessor and NodeInfoDecoratorProcessor) left open for discussion as I'd like to collect feedback on refactoring plans here.

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 28, 2018

aermakov-zalando closed this Jun 28, 2018

aermakov-zalando reopened this Jun 28, 2018

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 28, 2018

aleksandra-malinowska added the area/cluster-autoscaler label Jul 10, 2018

aermakov-zalando added 2 commits July 18, 2018 17:00

Add an option to always create template nodes from the cloud provider

c1e6b06

Add --scale-up-cloud-provider-template option, defaulting to false. If enabled, the autoscaler will always create template nodes using the cloud provider configuration, ignoring existing nodes.

njuettner mentioned this pull request Aug 8, 2018

Kube 1.11 patch zalando-incubator/kubernetes-on-aws#1299

Merged

MaciekPytel requested changes Aug 8, 2018

View reviewed changes

aleksandra-malinowska mentioned this pull request Oct 25, 2018

[Cluster Autoscaler] --IgnoreTaint=sometaint proposal. #1350

Closed

mwielgus closed this Feb 1, 2019

aermakov-zalando deleted the autoscaler-ignore-existing-nodes-upstream branch March 13, 2019 16:21

MaciekPytel mentioned this pull request Apr 29, 2019

Add support for AWS ASG mixed instances policy #1886

Merged

MaciekPytel mentioned this pull request Mar 6, 2020

Imbalance with balance similar node groups #2892

Closed

bpineau mentioned this pull request Oct 13, 2020

Flag for exclusive usage of template infos (with warnings) #3609

Closed

bpineau mentioned this pull request Apr 8, 2021

NodeInfo Processor for exclusive usage of template infos #4000

Closed

bpineau mentioned this pull request Mar 7, 2023

scaling from 0 only scales one of the ASG's detected in spite of balance-similar-node-groups #5352

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale up: ignore existing nodes #1021

Scale up: ignore existing nodes #1021

aermakov-zalando commented Jun 28, 2018 •

edited

Loading

k8s-ci-robot commented Jun 28, 2018

aermakov-zalando commented Jul 18, 2018 •

edited

Loading

MaciekPytel left a comment •

edited

Loading

MaciekPytel Aug 8, 2018

MaciekPytel Aug 8, 2018

MaciekPytel Aug 8, 2018

MaciekPytel Aug 8, 2018

aermakov-zalando commented Sep 19, 2018

MaciekPytel commented Oct 17, 2018

aermakov-zalando commented Nov 12, 2018

MaciekPytel commented Nov 14, 2018

aermakov-zalando commented Nov 14, 2018

losipiuk commented Dec 3, 2018 •

edited

Loading

losipiuk commented Dec 3, 2018

mwielgus commented Feb 1, 2019

Scale up: ignore existing nodes #1021

Scale up: ignore existing nodes #1021

Conversation

aermakov-zalando commented Jun 28, 2018 • edited Loading

k8s-ci-robot commented Jun 28, 2018

aermakov-zalando commented Jul 18, 2018 • edited Loading

MaciekPytel left a comment • edited Loading

Choose a reason for hiding this comment

MaciekPytel Aug 8, 2018

Choose a reason for hiding this comment

MaciekPytel Aug 8, 2018

Choose a reason for hiding this comment

MaciekPytel Aug 8, 2018

Choose a reason for hiding this comment

MaciekPytel Aug 8, 2018

Choose a reason for hiding this comment

aermakov-zalando commented Sep 19, 2018

MaciekPytel commented Oct 17, 2018

aermakov-zalando commented Nov 12, 2018

MaciekPytel commented Nov 14, 2018

aermakov-zalando commented Nov 14, 2018

losipiuk commented Dec 3, 2018 • edited Loading

losipiuk commented Dec 3, 2018

mwielgus commented Feb 1, 2019

aermakov-zalando commented Jun 28, 2018 •

edited

Loading

aermakov-zalando commented Jul 18, 2018 •

edited

Loading

MaciekPytel left a comment •

edited

Loading

losipiuk commented Dec 3, 2018 •

edited

Loading