Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale up: ignore existing nodes #1021

Closed
wants to merge 2 commits into from
Closed

Scale up: ignore existing nodes #1021

wants to merge 2 commits into from

Conversation

aermakov-zalando
Copy link
Contributor

@aermakov-zalando aermakov-zalando commented Jun 28, 2018

Add a new flag, --scale-up-cloud-provider-template, that forces the autoscaler to always construct the template node from the cloud provider data (e.g. AWS launch configuration) instead of a random existing node. This fixes both the trivial issues where instance type change is not taken into account until the user gets lucky with node ordering and more complicated ones where pods using node affinities sometimes don't trigger scale up (see comments in #789).

I think that this should be the default behaviour, because the current one is non-intuitive and hard to debug. However, since the users might rely on existing logic and use node selectors without the corresponding tags on the ASGs, the flag defaults to false.

One more change I've made selects a random AZ when constructing a template node for a multi-AZ ASG in the AWS cloud provider. Previously, the autoscaler always used the first AZ, but since this was masked by the default logic of using a random existing node it sort of worked when existing nodes got added or removed. Using a random AZ allows the autoscaler to progress even in these non-ideal setups.

@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


  • If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
  • If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
  • If you have done the above and are still having issues with the CLA being reported as unsigned, please email the CNCF helpdesk: [email protected]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 28, 2018
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 28, 2018
Add --scale-up-cloud-provider-template option, defaulting to false. If
enabled, the autoscaler will always create template nodes using the
cloud provider configuration, ignoring existing nodes.
If an ASG uses multiple availability zones, select a random one when
constructing a template node instead of the first one. This allows the
autoscaler to make progress on nodes with specific zone requirements
(e.g. due to node selectors or volume bindings), even if it takes a
couple of attempts.
@aermakov-zalando
Copy link
Contributor Author

aermakov-zalando commented Jul 18, 2018

Rebased on master. Any updates? @aleksandra-malinowska

Copy link
Contributor

@MaciekPytel MaciekPytel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why we pick an existing node is because building template in cloudprovider is just guessing how the node will look like. There is a lot of logic that hard-codes some values or copy-pastes code from other repos.

To give you an example:
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_manager.go#L268
This is pretty bad, it means we assume kubelet doesn't reserve any resources for system.

Also we have hard-coded kube-proxy request:
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/utils.go#L36

And a whole lot of other stuff like that. Cloudprovider template is just an approximation we use for scale-from-0, becuse we have nothing else we can use.

Recently I've seen a case where CA overestimated node memory when scaling from 0 and it added a node to a group which was too small for the pending pod. After the node was added it figured out it's too small, added a node from different node group and finally removed the incorrectly added node. With your flag set to true it would keep adding more nodes to the wrong node group.

Also implementing TemplateNodeInfo is optional for cloudproviders and setting your flag to true would break CA for any provider who doesn't implement it.

So I recommend against what you're trying to do. That being said I don't mind adding the flag as long as the default is false and flag description clearly warns about the risk of using it.

@@ -214,7 +214,7 @@ func (m *AwsManager) getAsgTemplate(asg *asg) (*asgTemplate, error) {
return nil, fmt.Errorf("Unable to get first AvailabilityZone for %s", asg.Name)
}

az := asg.AvailabilityZones[0]
az := asg.AvailabilityZones[rand.Intn(len(asg.AvailabilityZones))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to picking random node this makes multi-az ASG sort of work. Except we run predicates for one node and expect the result will always be exactly the same for every node in node group. As soon as you use a scheduling config that is zone-specific (nodeSelector on zone label, zone-level pod affinity/antiaffinity, etc) CA will start behaving incorrectly.

You really should be using a separate ASG for each zone if you want CA to work correctly in every scenario.

expendablePodsPriorityCutoff = flag.Int("expendable-pods-priority-cutoff", -10, "Pods with priority below cutoff will be expendable. They can be killed without any consideration during scale down and they don't cause scale up. Pods with null priority (PodPriority disabled) are non expendable.")
regional = flag.Bool("regional", false, "Cluster is regional.")
scaleUpTemplateFromCloudProvider = flag.Bool("scale-up-cloud-provider-template", false,
"Should CA build the template nodes using up-to-date cloud provider configuration instead of a random existing node.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the description is misleading. I'd prefer if it mentioned the risks I talk about in my other comment.

nodeInfos, err := GetNodeInfosForGroups(nodes, context.CloudProvider, context.ClientSet,

existingNodes := nodes
if context.ScaleUpTemplateFromCloudProvider {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should happen inside GetNodeInfosForGroups. The way it is now relies on the fact that we only call that method here, but if we ever need to call it elsewhere we're likely to end up with some inconsistent behavior. Also it relies on how GetNodeInfosForGroups work internally. Given that we don't really test CA with all possible flag values (ie. this branch is unlikely to be tested), it may easily break as soon as someone needs to change anything.

@@ -887,6 +887,118 @@ func TestScaleUpAutoprovisionedNodeGroup(t *testing.T) {
assert.Equal(t, "autoprovisioned-T1-1", getStringFromChan(expandedGroups))
}

func TestScaleUpAutoprovisionFavorExistingNodes(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your change has nothing to do with autoprovisioning (which is our word for CA automatically creating new node groups, currently only working on GKE). You should use non-autoprovisioning test as a starting point.

@aermakov-zalando
Copy link
Contributor Author

@MaciekPytel Sorry it took so long to reply, I was busy with other things.

I didn't really look at the scale-up-from-zero code and expected it to handle everything correctly, but I now see how its current behavior could cause issues if I force it, but I'd also like to solve the issues we have with the current faulty logic. The main two issues we want to solve are:

  1. template labels should be taken from the template, not from a random node
  2. instance type change should be handled at least somehow, otherwise we won't be able to scale up the cluster if the users suddenly want to deploy bigger pods than what's currently supported.

A combination of the current approach (use existing nodes as templates) and scale-up-from-0 logic could solve both of these, I believe, but will require changing the NodeGroup interface a bit, so please tell me if you're OK with it first. What I want to do is pass existing nodes to TemplateNodeInfo, and then change the logic in the following way:

  • If there are existing nodes launched from the current template (LaunchConfiguration/LaunchTemplate for AWS), use allocation information from one of these, otherwise fallback to scale-up-from-0 logic which then guesstimates. This should give the autoscaler good enough information after at least one node exists, and the group will have at most one flapping node if there are any issues.
  • Reset the labels/taints on the template node to only the template ones (optional, since a lot of people might be relying on the current logic)

Comments/suggestions?

@MaciekPytel
Copy link
Contributor

Sorry for late reply, I was on vacation and completely missed your reply.

First of all - there are a lot of implementations of cloudprovider (3 + 2 in progress in main repo and I know about at least 3 different providers that implemented their cloudprovider in fork). We've been GA for a year now and it's effectively our API. I'm ok-ish with extending the interface with new optional methods, but I'm against any major changes (especially backward incompatible ones) without an extremely good reason. Also if we were to do such changes we probably should go through a deprecation process, which makes me even less happy about it.

Fundamentally, a NodeGroup is a set of identical, interchangeable nodes. It's the central assumption in CA and a we depend on that both in core code and various cloudprovider-specific places. Just create a separate NodeGroup with a different instance type - CA works quite well with multiple NodeGroups. It's the same advice we give people who want multi-zone clusters: use a separate ASG in each zone instead of multi-zonal ASGs.

I'm more sympathetic to the labels and taints use-case. I can see a lot of reasons for temporarily tainting a node and CA should play better with it. The simplest solution would be to just say that any label or taint prefixed with custom.cluster-autoscaler.kubernetes.io/ (or something like that) would be ignored when creating template node. Would that help? If not we can think about adding some optional method to cloudprovider that filters out labels and taints, but that's likely to bring some complexity (in my experience an extra API call per NodeGroup per loop is likely to be a scalability problem and caching makes things complex).

@aermakov-zalando
Copy link
Contributor Author

I still think that what the current CA behaviour is a surprising and undocumented hack that's very painful when you actually run into it. Software should be as simple as possible and not try to magically guess for the user which is what CA does now (badly!).

I see your point about not wanting to change the provider interface. What about reconciling the difference in the CA code itself? You could still keep the provider interface the same as before, where CA asks it to provide a template node, and then merge it with the existing nodes:

  • labels/taints from the existing will be overwritten by the template node's labels/taints (if configured)
  • allocatable can be computed in a smarter way (e.g. get it from the latest existing node if the instance type is the same as the template's one, or just recompute it using the existing bad heuristics)

This should make it possible to implement better logic while keeping the provider interface intact, even though it doesn't cover more complicated future improvements (e.g. node pools with different instance types, which is something that'd be possible with AWS Spot Fleets, for example). For this you'd actually need to implement scale-up-from-zero calculation properly.

@MaciekPytel
Copy link
Contributor

I think the problem once again boils to the definition of NodeGroup. A NodeGroup is not an ASG, it is a scalable set of identical nodes. Nothing in the core logic expects it to be anything more and I don't think we want to change that - the interface is meant to be a common denominator between clouds and there are significant differences between ASG, MIG, VMSS, etc.

I'd argue that once we assume the nodes in NodeGroup are identical using a copy of an existing node as a template is the simplest solution. I can explain it in a single sentence to someone who doesn't know CA, which is not the case for combining allocatable of existing node with taints and labels from TemplateNodeInfo with some sort of fallback if one or the other is unavailable.

If we want to support cloud provider specific features like changing instance type in ASG perhaps the right way to do this would be to have cloudprovider implementation that handles it behind the scenes, representing machines with different instance type as a separate NodeGroup. There is no requirement for NodeGroup to map 1-1 to ASG, as long as we manage to implement things like TargetSize() in a consistent way.


Taints are a special case, because they can represent both node config (ex. tainting node with expensive resources to limit what pods can schedule on this node) and state (some temporary condition of the node). If it's the latter we should discard the taint. The problem is how to distinguish those - I dislike both the flag and the prefix approach proposed in #1350, but I don't have any better idea. We don't have any CRD or dynamic config that we could use to pass those and we're not likely to add it in predictable future.

@aermakov-zalando
Copy link
Contributor Author

I'd argue that once we assume the nodes in NodeGroup are identical using a copy of an existing node as a template is the simplest solution.

Yes, but it's also a bad solution, unless you expect the cluster operators to never change taints or labels. And the way it's completely automagical is even worse, because it will work most of the time so people won't even read the documentation and just rely on it, until they change a label or add a taint and then their cluster suddenly stops scaling for a couple of pods. Once that happens it's also impossible to fix it without either getting rid of affinities/selectors, babysitting the cluster to manually scale it up or bringing up nodes until you get lucky and the provider gives you a node with a name that'll sort before the "broken" one.

You now also have two code paths, one for scale-up-from-0 and another one for magically-copy-from-existing-node. One of them might be exercised a lot less often, so the configuration getting out of sync could be noticed at the worst possible moment, triggering an incident. If CA used just one set of truth instead of trying to be smart, the error would've been noticed immediately.

I would also argue that the assumption that all the nodes in the node pool are identical is also dubious at best. Yes, it works for GKE because you have tooling that allows you to treat node pools as immutable and migrate from one to another. Is every user of CA supposed to implement the same tooling now just to achieve something as simple as changing the instance type? Or are the users supposed to manually babysit the nodes, draining them one by one after creating the fiftieth copy of the same ASG even though AWS easily allows changing an instance type of an existing ASG? The only real problem, as you've said, is the capacity estimation of new nodes, but there's no real requirement that all nodes spawned from an ASG look exactly the same as existing ones, they just have to look exactly the same as each other so the scale-up calculations can work correctly.

@losipiuk
Copy link
Contributor

losipiuk commented Dec 3, 2018

I would also argue that the assumption that all the nodes in the node pool are identical is also dubious at best. Yes, it works for GKE because you have tooling that allows you to treat node pools as immutable and migrate from one to another. Is every user of CA supposed to implement the same tooling now just to achieve something as simple as changing the instance type? Or are the users supposed to manually babysit the nodes, draining them one by one after creating the fiftieth copy of the same ASG even though AWS easily allows changing an instance type of an existing ASG? The only real problem, as you've said, is the capacity estimation of new nodes, but there's no real requirement that all nodes spawned from an ASG look exactly the same as existing ones, they just have to look exactly the same as each other so the scale-up calculations can work correctly.

As @MaciekPytel already said fact that node-group is set of identical nodes is principal design constraint and it will not be changed. All the logic in CA depends on that assumption.
Yet it is important to note that node-group is purely logical concept. It currently maps in AWS cloud provider to ASG in 1:1 manner, but it is something that can be changed.
If we have ASG which may consist of multiple types of machines you may expose it as set of node-groups to CA. Then NodeGroup.MaxSize() can be used to enforce that only one of node-groups (matching current template) is up-scalable.

@losipiuk
Copy link
Contributor

losipiuk commented Dec 3, 2018

As we discussed during sig-meeting, it seems like an option to be able to influence how the function of predicting next node shape for give node group looks like, may came handy.
On the other hand it is not obvious that everyone needs the same logic.

Given that we suggest implementing adding new *Processor interface which will be called in places when we need to determine the node shape for node pool.
Then we can have default implementation of processor which basically replicates current logic of CA.
And we can have another implementation which allows handling Zalando case (preferably in as generic way as possible - so it is not limited to Zalando only).
The processor implementation as well as maybe some parametrization of the processor itself would be done using CA program flags.

As an example how we use Processors to have pluggable logic you may want to look at NodeGroupSetProcessor which was introduced by 6f5e6aa

@MaciekPytel, do you want to add something?

@mwielgus
Copy link
Contributor

mwielgus commented Feb 1, 2019

No activity for almost 2 months, closing. Please reopen once needed.

@mwielgus mwielgus closed this Feb 1, 2019
@aermakov-zalando aermakov-zalando deleted the autoscaler-ignore-existing-nodes-upstream branch March 13, 2019 16:21
bpineau added a commit to DataDog/autoscaler that referenced this pull request Oct 13, 2020
This is a second attempt at kubernetes#1021 (might also provides an alternate
solution for kubernetes#2892 and kubernetes#3608).

Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting
to false and describing the limitations (not all cloud providers)
and the risk of using it (loss of accuracy, risk of upscaling
unusable nodes or leaving pending pods).

Some uses cases includes:
* Balance Similar when uspscaling from zero (but I believe kubernetes#3608 is a better approach)
* Edited/updated ASGs/MIGs taints and labels
* Updated instance type

Per previous discussion, the later two cases could be covered in
the long term by custom node-shape discovery Processors (open to
discuss that option too).
bpineau added a commit to DataDog/autoscaler that referenced this pull request Oct 15, 2020
This is a second attempt at kubernetes#1021 (might also provides an alternate
solution for kubernetes#2892 and kubernetes#3608).

Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting
to false and describing the limitations (not all cloud providers)
and the risk of using it (loss of accuracy, risk of upscaling
unusable nodes or leaving pending pods).

Some uses cases includes:
* Balance Similar when uspscaling from zero (but I believe kubernetes#3608 is a better approach)
* Edited/updated ASGs/MIGs taints and labels
* Updated instance type

Per previous discussion, the later two cases could be covered in
the long term by custom node-shape discovery Processors (open to
discuss that option too).
bpineau added a commit to DataDog/autoscaler that referenced this pull request Jan 18, 2021
This is a second attempt at kubernetes#1021 (might also provides an alternate
solution for kubernetes#2892 and kubernetes#3608).

Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting
to false and describing the limitations (not all cloud providers)
and the risk of using it (loss of accuracy, risk of upscaling
unusable nodes or leaving pending pods).

Some uses cases includes:
* Balance Similar when uspscaling from zero (but I believe kubernetes#3608 is a better approach)
* Edited/updated ASGs/MIGs taints and labels
* Updated instance type

Per previous discussion, the later two cases could be covered in
the long term by custom node-shape discovery Processors (open to
discuss that option too).
bpineau added a commit to DataDog/autoscaler that referenced this pull request Jan 25, 2021
This is a second attempt at kubernetes#1021 (might also provides an alternate
solution for kubernetes#2892 and kubernetes#3608).

Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting
to false and describing the limitations (not all cloud providers)
and the risk of using it (loss of accuracy, risk of upscaling
unusable nodes or leaving pending pods).

Some uses cases includes:
* Balance Similar when uspscaling from zero (but I believe kubernetes#3608 is a better approach)
* Edited/updated ASGs/MIGs taints and labels
* Updated instance type

Per previous discussion, the later two cases could be covered in
the long term by custom node-shape discovery Processors (open to
discuss that option too).
bpineau added a commit to DataDog/autoscaler that referenced this pull request Apr 8, 2021
This is a third attempt at kubernetes#1021 (might also provides an alternate solution for kubernetes#2892 and kubernetes#3608 amd kubernetes#3609).
Some uses cases includes: balance Similar when uspscaling from zero, edited/updated ASGs/MIGs taints and labels, updated instance type.

Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting to false and describing the limitations (not all cloud providers) and the risk of using it (loss of accuracy, risk of upscaling unusable nodes or leaving pending pods).

Per kubernetes#3609 discussion, using a NodeInfo processor is prefered.

`GetNodeInfosForGroups()` fate (and the opportunity to split NodeInfoProcessor in NodeInfoProcessor and NodeInfoDecoratorProcessor) left open for discussion as I'd like to collect feedback on refactoring plans here.
bpineau added a commit to DataDog/autoscaler that referenced this pull request Apr 8, 2021
This is a third attempt at kubernetes#1021 (might also provides an alternate solution for kubernetes#2892 and kubernetes#3608 amd kubernetes#3609).
Some uses cases includes: balance Similar when uspscaling from zero, edited/updated ASGs/MIGs taints and labels, updated instance type.

Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting to false and describing the limitations (not all cloud providers) and the risk of using it (loss of accuracy, risk of upscaling unusable nodes or leaving pending pods).

Per kubernetes#3609 discussion, using a NodeInfo processor is prefered.

`GetNodeInfosForGroups()` fate (and the opportunity to split NodeInfoProcessor in NodeInfoProcessor and NodeInfoDecoratorProcessor) left open for discussion as I'd like to collect feedback on refactoring plans here.
bpineau added a commit to DataDog/autoscaler that referenced this pull request Apr 8, 2021
This is a third attempt at kubernetes#1021 (might also provides an alternate solution for kubernetes#2892 and kubernetes#3608 amd kubernetes#3609).
Some uses cases includes: balance Similar when uspscaling from zero, edited/updated ASGs/MIGs taints and labels, updated instance type.

Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting to false and describing the limitations (not all cloud providers) and the risk of using it (loss of accuracy, risk of upscaling unusable nodes or leaving pending pods).

Per kubernetes#3609 discussion, using a NodeInfo processor is prefered.
bpineau added a commit to DataDog/autoscaler that referenced this pull request Apr 9, 2021
This is a third attempt at kubernetes#1021 (might also provides an alternate solution for kubernetes#2892 and kubernetes#3608 amd kubernetes#3609).
Some uses cases includes: balance Similar when uspscaling from zero, edited/updated ASGs/MIGs taints and labels, updated instance type.

Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting to false and describing the limitations (not all cloud providers) and the risk of using it (loss of accuracy, risk of upscaling unusable nodes or leaving pending pods).

Per kubernetes#3609 discussion, using a NodeInfo processor is prefered.
bpineau added a commit to DataDog/autoscaler that referenced this pull request Apr 19, 2021
This is a third attempt at kubernetes#1021 (might also provides an alternate solution for kubernetes#2892 and kubernetes#3608 amd kubernetes#3609).
Some uses cases includes: balance Similar when uspscaling from zero, edited/updated ASGs/MIGs taints and labels, updated instance type.

Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting to false and describing the limitations (not all cloud providers) and the risk of using it (loss of accuracy, risk of upscaling unusable nodes or leaving pending pods).

Per kubernetes#3609 discussion, using a NodeInfo processor is prefered.

`GetNodeInfosForGroups()` fate (and the opportunity to split NodeInfoProcessor in NodeInfoProcessor and NodeInfoDecoratorProcessor) left open for discussion as I'd like to collect feedback on refactoring plans here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants