-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scale up: ignore existing nodes #1021
Scale up: ignore existing nodes #1021
Conversation
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Add --scale-up-cloud-provider-template option, defaulting to false. If enabled, the autoscaler will always create template nodes using the cloud provider configuration, ignoring existing nodes.
If an ASG uses multiple availability zones, select a random one when constructing a template node instead of the first one. This allows the autoscaler to make progress on nodes with specific zone requirements (e.g. due to node selectors or volume bindings), even if it takes a couple of attempts.
Rebased on master. Any updates? @aleksandra-malinowska |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why we pick an existing node is because building template in cloudprovider is just guessing how the node will look like. There is a lot of logic that hard-codes some values or copy-pastes code from other repos.
To give you an example:
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_manager.go#L268
This is pretty bad, it means we assume kubelet doesn't reserve any resources for system.
Also we have hard-coded kube-proxy request:
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/utils.go#L36
And a whole lot of other stuff like that. Cloudprovider template is just an approximation we use for scale-from-0, becuse we have nothing else we can use.
Recently I've seen a case where CA overestimated node memory when scaling from 0 and it added a node to a group which was too small for the pending pod. After the node was added it figured out it's too small, added a node from different node group and finally removed the incorrectly added node. With your flag set to true it would keep adding more nodes to the wrong node group.
Also implementing TemplateNodeInfo is optional for cloudproviders and setting your flag to true would break CA for any provider who doesn't implement it.
So I recommend against what you're trying to do. That being said I don't mind adding the flag as long as the default is false and flag description clearly warns about the risk of using it.
@@ -214,7 +214,7 @@ func (m *AwsManager) getAsgTemplate(asg *asg) (*asgTemplate, error) { | |||
return nil, fmt.Errorf("Unable to get first AvailabilityZone for %s", asg.Name) | |||
} | |||
|
|||
az := asg.AvailabilityZones[0] | |||
az := asg.AvailabilityZones[rand.Intn(len(asg.AvailabilityZones))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to picking random node this makes multi-az ASG sort of work. Except we run predicates for one node and expect the result will always be exactly the same for every node in node group. As soon as you use a scheduling config that is zone-specific (nodeSelector on zone label, zone-level pod affinity/antiaffinity, etc) CA will start behaving incorrectly.
You really should be using a separate ASG for each zone if you want CA to work correctly in every scenario.
expendablePodsPriorityCutoff = flag.Int("expendable-pods-priority-cutoff", -10, "Pods with priority below cutoff will be expendable. They can be killed without any consideration during scale down and they don't cause scale up. Pods with null priority (PodPriority disabled) are non expendable.") | ||
regional = flag.Bool("regional", false, "Cluster is regional.") | ||
scaleUpTemplateFromCloudProvider = flag.Bool("scale-up-cloud-provider-template", false, | ||
"Should CA build the template nodes using up-to-date cloud provider configuration instead of a random existing node.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the description is misleading. I'd prefer if it mentioned the risks I talk about in my other comment.
nodeInfos, err := GetNodeInfosForGroups(nodes, context.CloudProvider, context.ClientSet, | ||
|
||
existingNodes := nodes | ||
if context.ScaleUpTemplateFromCloudProvider { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should happen inside GetNodeInfosForGroups. The way it is now relies on the fact that we only call that method here, but if we ever need to call it elsewhere we're likely to end up with some inconsistent behavior. Also it relies on how GetNodeInfosForGroups work internally. Given that we don't really test CA with all possible flag values (ie. this branch is unlikely to be tested), it may easily break as soon as someone needs to change anything.
@@ -887,6 +887,118 @@ func TestScaleUpAutoprovisionedNodeGroup(t *testing.T) { | |||
assert.Equal(t, "autoprovisioned-T1-1", getStringFromChan(expandedGroups)) | |||
} | |||
|
|||
func TestScaleUpAutoprovisionFavorExistingNodes(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your change has nothing to do with autoprovisioning (which is our word for CA automatically creating new node groups, currently only working on GKE). You should use non-autoprovisioning test as a starting point.
@MaciekPytel Sorry it took so long to reply, I was busy with other things. I didn't really look at the scale-up-from-zero code and expected it to handle everything correctly, but I now see how its current behavior could cause issues if I force it, but I'd also like to solve the issues we have with the current faulty logic. The main two issues we want to solve are:
A combination of the current approach (use existing nodes as templates) and scale-up-from-0 logic could solve both of these, I believe, but will require changing the
Comments/suggestions? |
Sorry for late reply, I was on vacation and completely missed your reply. First of all - there are a lot of implementations of cloudprovider (3 + 2 in progress in main repo and I know about at least 3 different providers that implemented their cloudprovider in fork). We've been GA for a year now and it's effectively our API. I'm ok-ish with extending the interface with new optional methods, but I'm against any major changes (especially backward incompatible ones) without an extremely good reason. Also if we were to do such changes we probably should go through a deprecation process, which makes me even less happy about it. Fundamentally, a NodeGroup is a set of identical, interchangeable nodes. It's the central assumption in CA and a we depend on that both in core code and various cloudprovider-specific places. Just create a separate NodeGroup with a different instance type - CA works quite well with multiple NodeGroups. It's the same advice we give people who want multi-zone clusters: use a separate ASG in each zone instead of multi-zonal ASGs. I'm more sympathetic to the labels and taints use-case. I can see a lot of reasons for temporarily tainting a node and CA should play better with it. The simplest solution would be to just say that any label or taint prefixed with |
I still think that what the current CA behaviour is a surprising and undocumented hack that's very painful when you actually run into it. Software should be as simple as possible and not try to magically guess for the user which is what CA does now (badly!). I see your point about not wanting to change the provider interface. What about reconciling the difference in the CA code itself? You could still keep the provider interface the same as before, where CA asks it to provide a template node, and then merge it with the existing nodes:
This should make it possible to implement better logic while keeping the provider interface intact, even though it doesn't cover more complicated future improvements (e.g. node pools with different instance types, which is something that'd be possible with AWS Spot Fleets, for example). For this you'd actually need to implement scale-up-from-zero calculation properly. |
I think the problem once again boils to the definition of NodeGroup. A NodeGroup is not an ASG, it is a scalable set of identical nodes. Nothing in the core logic expects it to be anything more and I don't think we want to change that - the interface is meant to be a common denominator between clouds and there are significant differences between ASG, MIG, VMSS, etc. I'd argue that once we assume the nodes in NodeGroup are identical using a copy of an existing node as a template is the simplest solution. I can explain it in a single sentence to someone who doesn't know CA, which is not the case for combining allocatable of existing node with taints and labels from TemplateNodeInfo with some sort of fallback if one or the other is unavailable. If we want to support cloud provider specific features like changing instance type in ASG perhaps the right way to do this would be to have cloudprovider implementation that handles it behind the scenes, representing machines with different instance type as a separate NodeGroup. There is no requirement for NodeGroup to map 1-1 to ASG, as long as we manage to implement things like TargetSize() in a consistent way. Taints are a special case, because they can represent both node config (ex. tainting node with expensive resources to limit what pods can schedule on this node) and state (some temporary condition of the node). If it's the latter we should discard the taint. The problem is how to distinguish those - I dislike both the flag and the prefix approach proposed in #1350, but I don't have any better idea. We don't have any CRD or dynamic config that we could use to pass those and we're not likely to add it in predictable future. |
Yes, but it's also a bad solution, unless you expect the cluster operators to never change taints or labels. And the way it's completely automagical is even worse, because it will work most of the time so people won't even read the documentation and just rely on it, until they change a label or add a taint and then their cluster suddenly stops scaling for a couple of pods. Once that happens it's also impossible to fix it without either getting rid of affinities/selectors, babysitting the cluster to manually scale it up or bringing up nodes until you get lucky and the provider gives you a node with a name that'll sort before the "broken" one. You now also have two code paths, one for I would also argue that the assumption that all the nodes in the node pool are identical is also dubious at best. Yes, it works for GKE because you have tooling that allows you to treat node pools as immutable and migrate from one to another. Is every user of CA supposed to implement the same tooling now just to achieve something as simple as changing the instance type? Or are the users supposed to manually babysit the nodes, draining them one by one after creating the fiftieth copy of the same ASG even though AWS easily allows changing an instance type of an existing ASG? The only real problem, as you've said, is the capacity estimation of new nodes, but there's no real requirement that all nodes spawned from an ASG look exactly the same as existing ones, they just have to look exactly the same as each other so the scale-up calculations can work correctly. |
As @MaciekPytel already said fact that node-group is set of identical nodes is principal design constraint and it will not be changed. All the logic in CA depends on that assumption. |
As we discussed during sig-meeting, it seems like an option to be able to influence how the function of predicting next node shape for give node group looks like, may came handy. Given that we suggest implementing adding new As an example how we use @MaciekPytel, do you want to add something? |
No activity for almost 2 months, closing. Please reopen once needed. |
This is a second attempt at kubernetes#1021 (might also provides an alternate solution for kubernetes#2892 and kubernetes#3608). Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting to false and describing the limitations (not all cloud providers) and the risk of using it (loss of accuracy, risk of upscaling unusable nodes or leaving pending pods). Some uses cases includes: * Balance Similar when uspscaling from zero (but I believe kubernetes#3608 is a better approach) * Edited/updated ASGs/MIGs taints and labels * Updated instance type Per previous discussion, the later two cases could be covered in the long term by custom node-shape discovery Processors (open to discuss that option too).
This is a second attempt at kubernetes#1021 (might also provides an alternate solution for kubernetes#2892 and kubernetes#3608). Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting to false and describing the limitations (not all cloud providers) and the risk of using it (loss of accuracy, risk of upscaling unusable nodes or leaving pending pods). Some uses cases includes: * Balance Similar when uspscaling from zero (but I believe kubernetes#3608 is a better approach) * Edited/updated ASGs/MIGs taints and labels * Updated instance type Per previous discussion, the later two cases could be covered in the long term by custom node-shape discovery Processors (open to discuss that option too).
This is a second attempt at kubernetes#1021 (might also provides an alternate solution for kubernetes#2892 and kubernetes#3608). Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting to false and describing the limitations (not all cloud providers) and the risk of using it (loss of accuracy, risk of upscaling unusable nodes or leaving pending pods). Some uses cases includes: * Balance Similar when uspscaling from zero (but I believe kubernetes#3608 is a better approach) * Edited/updated ASGs/MIGs taints and labels * Updated instance type Per previous discussion, the later two cases could be covered in the long term by custom node-shape discovery Processors (open to discuss that option too).
This is a second attempt at kubernetes#1021 (might also provides an alternate solution for kubernetes#2892 and kubernetes#3608). Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting to false and describing the limitations (not all cloud providers) and the risk of using it (loss of accuracy, risk of upscaling unusable nodes or leaving pending pods). Some uses cases includes: * Balance Similar when uspscaling from zero (but I believe kubernetes#3608 is a better approach) * Edited/updated ASGs/MIGs taints and labels * Updated instance type Per previous discussion, the later two cases could be covered in the long term by custom node-shape discovery Processors (open to discuss that option too).
This is a third attempt at kubernetes#1021 (might also provides an alternate solution for kubernetes#2892 and kubernetes#3608 amd kubernetes#3609). Some uses cases includes: balance Similar when uspscaling from zero, edited/updated ASGs/MIGs taints and labels, updated instance type. Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting to false and describing the limitations (not all cloud providers) and the risk of using it (loss of accuracy, risk of upscaling unusable nodes or leaving pending pods). Per kubernetes#3609 discussion, using a NodeInfo processor is prefered. `GetNodeInfosForGroups()` fate (and the opportunity to split NodeInfoProcessor in NodeInfoProcessor and NodeInfoDecoratorProcessor) left open for discussion as I'd like to collect feedback on refactoring plans here.
This is a third attempt at kubernetes#1021 (might also provides an alternate solution for kubernetes#2892 and kubernetes#3608 amd kubernetes#3609). Some uses cases includes: balance Similar when uspscaling from zero, edited/updated ASGs/MIGs taints and labels, updated instance type. Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting to false and describing the limitations (not all cloud providers) and the risk of using it (loss of accuracy, risk of upscaling unusable nodes or leaving pending pods). Per kubernetes#3609 discussion, using a NodeInfo processor is prefered. `GetNodeInfosForGroups()` fate (and the opportunity to split NodeInfoProcessor in NodeInfoProcessor and NodeInfoDecoratorProcessor) left open for discussion as I'd like to collect feedback on refactoring plans here.
This is a third attempt at kubernetes#1021 (might also provides an alternate solution for kubernetes#2892 and kubernetes#3608 amd kubernetes#3609). Some uses cases includes: balance Similar when uspscaling from zero, edited/updated ASGs/MIGs taints and labels, updated instance type. Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting to false and describing the limitations (not all cloud providers) and the risk of using it (loss of accuracy, risk of upscaling unusable nodes or leaving pending pods). Per kubernetes#3609 discussion, using a NodeInfo processor is prefered.
This is a third attempt at kubernetes#1021 (might also provides an alternate solution for kubernetes#2892 and kubernetes#3608 amd kubernetes#3609). Some uses cases includes: balance Similar when uspscaling from zero, edited/updated ASGs/MIGs taints and labels, updated instance type. Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting to false and describing the limitations (not all cloud providers) and the risk of using it (loss of accuracy, risk of upscaling unusable nodes or leaving pending pods). Per kubernetes#3609 discussion, using a NodeInfo processor is prefered.
This is a third attempt at kubernetes#1021 (might also provides an alternate solution for kubernetes#2892 and kubernetes#3608 amd kubernetes#3609). Some uses cases includes: balance Similar when uspscaling from zero, edited/updated ASGs/MIGs taints and labels, updated instance type. Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting to false and describing the limitations (not all cloud providers) and the risk of using it (loss of accuracy, risk of upscaling unusable nodes or leaving pending pods). Per kubernetes#3609 discussion, using a NodeInfo processor is prefered. `GetNodeInfosForGroups()` fate (and the opportunity to split NodeInfoProcessor in NodeInfoProcessor and NodeInfoDecoratorProcessor) left open for discussion as I'd like to collect feedback on refactoring plans here.
Add a new flag,
--scale-up-cloud-provider-template
, that forces the autoscaler to always construct the template node from the cloud provider data (e.g. AWS launch configuration) instead of a random existing node. This fixes both the trivial issues where instance type change is not taken into account until the user gets lucky with node ordering and more complicated ones where pods using node affinities sometimes don't trigger scale up (see comments in #789).I think that this should be the default behaviour, because the current one is non-intuitive and hard to debug. However, since the users might rely on existing logic and use node selectors without the corresponding tags on the ASGs, the flag defaults to
false
.One more change I've made selects a random AZ when constructing a template node for a multi-AZ ASG in the AWS cloud provider. Previously, the autoscaler always used the first AZ, but since this was masked by the default logic of using a random existing node it sort of worked when existing nodes got added or removed. Using a random AZ allows the autoscaler to progress even in these non-ideal setups.