Fix BalanceSimilarNodeGroups when scaling from zero #3608

bpineau · 2020-10-13T12:21:45Z

Use real-world nodeInfo from similar groups when BalanceSimilarNodeGroups.

Where available, the CA evaluates node groups capacities and labels using real-world nodes, for optimal accuracy (accounting for kubelet and system reserved resources, for instance).

It fallbacks to synthetic nodeInfos inferred from ASG/MIG/VMSs templates when no real nodes are available yet (ie. upscaling from zero/empty node group).

That asymmetric evaluation can prevent BalanceSimilarNodeGroups from working reliably when upscaling from zero:

The first upscaled nodeGroup will get a node (capacity initially evaluated from ASG/MIG/VMSS template, then from the first real node, runtime labels)
Other empty nodeGroups will likely be considered dissimilar to the first one: we're comparing ASG templates with real-world nodeInfo
In the case of least-waste expander for instance, CA might prefer the nodegroup from real-world node, accounting for reserved resources (best usage) over the others

This change set implements Maciek Pytel suggestion (from a previous attempt at fixing this issue): we try to use real-world nodes examples from ~similar node groups, before a last resort fallback to synthetic nodeInfos built from ASG templates.

We compare nodeInfos using the FindSimilarNodeGroups processor primitives in order to leverage per cloud provider comparators (eg. label exclusions).

We look for similar nodegroups using synthetic nodeInfo from ASG/MIG (where available): comparing their nodeInfo from real-world nodes to a synthetic nodeInfo from the empty nodegroup wouldn't work (that's the root cause of this issue). When ~similar nodegroups are found that way, we use their nodeInfos built from real-world nodes as models for empty nodegroups (but keeping original location related labels).

Tested with AWS ASGs, GCP MIGs and Azure VMSS.

Use real-world nodeInfo from similar groups when BalanceSimilarNodeGroups. Where available, the CA evaluates node groups capacities and labels using real-world nodes, for optimal accuracy (accounting for kubelet and system reserved resources, for instance). It fallbacks to synthetic nodeInfos infered from ASG/MIG/VMSs templates when no real nodes are available yet (ie. upscaling from zero/empty node group). That asymetric evaluation can prevent `BalanceSimilarNodeGroups` from working reliably when upscaling from zero: * The first upscaled nodeGroup will get a node (capacity initially evaluated from ASG/MIG/VMSS template, then from the first real node, runtime labels) * Other empty nodeGroups will likely be considered dissimilar to the first one: we're comparing ASG templates with real-world nodeInfo * In the case of `least-waste` expander for instance, CA might favor the nodegroup from real-world node, accounting for reserved resources (best usage) over the others This change set implements [Maciek Pytel suggestion (from a previous attempt at fixing this issue)](kubernetes#2892 (comment)): we try to use real-world nodes examples from ~similar node groups, before a last resort fallback to synthetic nodeInfos built from ASG templates. We compare nodeInfos using the `FindSimilarNodeGroups` processor primitives in order to leverage per cloud provider comparators (eg. label exclusions). We look for similar nodegroups using synthetic nodeInfo from ASG/MIG (where available): comparing their nodeInfo from real-world nodes to a synthetic nodeInfo from the empty nodegroup wouldn't work (that's the root cause of this issue). When ~similar nodegroups are found that way, we use their nodeInfos built from real-world nodes as models for empty nodegroups (but keeping original location related labels). Tested with AWS ASGs, GCP MIGs and Azure VMSS.

k8s-ci-robot · 2020-10-13T12:22:07Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign feiskyer after the PR has been reviewed.
You can assign the PR to them by writing /assign @feiskyer in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

cluster-autoscaler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

This is a second attempt at kubernetes#1021 (might also provides an alternate solution for kubernetes#2892 and kubernetes#3608). Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting to false and describing the limitations (not all cloud providers) and the risk of using it (loss of accuracy, risk of upscaling unusable nodes or leaving pending pods). Some uses cases includes: * Balance Similar when uspscaling from zero (but I believe kubernetes#3608 is a better approach) * Edited/updated ASGs/MIGs taints and labels * Updated instance type Per previous discussion, the later two cases could be covered in the long term by custom node-shape discovery Processors (open to discuss that option too).

MaciekPytel · 2020-10-15T13:00:06Z

I think this should be implemented as a processor:

A custom implementation of NodeGroupSetProcessor (which may internally call to BalancingNodeGroupSetProcessor or reuse some of its code).
Or, possibly, a NodeInfoProcessor (if you actually want to use template node from 'similar' group for binpacking and not just balancing).

Those processors were added specifically to allow to use custom logic to replace balancing (NodeGroupSetProcessor) and allow custom overrides to NodeInfo map produced by GetNodeInfosForGroups (NodeInfoProcessor). My reasons are:

The current implementation adds even more complexity to what are already most complex pieces of code (and it spreads the logic across multiple parts of the code). Encapsulating all this logic in one place would improve readability.
- The cache can be held in processor. There is no reason why a processor couldn't be stateful (and there is an existing example of a processor using internal cache: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/filter_out_schedulable.go#L34)
Core logic should not assume any particular implementation of a processor. You can replace the whole balancing logic with something completely different by switching NodeGroupSetProcessor (ex. balance all nodegroups that have a certain label, or for a more concrete example - balance all MIGs in a GKE nodepool).

mwielgus · 2020-11-19T23:33:33Z

Ping. Please respond to @MaciekPytel comment.

bpineau · 2020-11-20T09:49:12Z

@mwielgus : Maciek comment makes perfect sense, I want to secure a bit of time to try implementing the suggested options (I suspect I'll come back with questions if you don't mind).

Comparing synthetic NodeInfos obtained from nodegroup's TemplateInfo() to NodeInfos obtained from real-world nodes is bound to fail, even with kube reservations provided through nodegroups labels/annotations (for instance: kernel mem reservation is hard to predict). This makes `balance-similar-node-groups` likely to misbehave when `scale-up-from-zero` is enabled (and a first nodegroup gets a real node), for instance. Following [Maciek Pytel suggestion](kubernetes#3608 (comment)) (from discussions on a previous attempt at solving this), we can implement a NodeInfo Processor that would improve template-generated NodeInfos whenever a node was created off a similar nodegroup. We're storing node's virtual origin through machineid, which works fine but is a bit ugly (suggestions welcome). Tested this solves balance-similar-node-groups + scale-up-from-zero, with various instance types on AWS and GCP. Previous attempts to solve that issue/discussions: * kubernetes#2892 (comment) * kubernetes#3608 (comment)

bpineau · 2020-12-14T10:38:10Z

Implemented @MaciekPytel suggestion in PR #3761 ; closing in favour of the new PR.

Comparing synthetic NodeInfos obtained from nodegroup's TemplateInfo() to NodeInfos obtained from real-world nodes is bound to fail, even with kube reservations provided through nodegroups labels/annotations (for instance: kernel mem reservation is hard to predict). This makes `balance-similar-node-groups` likely to misbehave when `scale-up-from-zero` is enabled (and a first nodegroup gets a real node), for instance. Following [Maciek Pytel suggestion](kubernetes#3608 (comment)) (from discussions on a previous attempt at solving this), we can implement a NodeInfo Processor that would improve template-generated NodeInfos whenever a node was created off a similar nodegroup. We're storing node's virtual origin through machineid, which works fine but is a bit ugly (suggestions welcome). Tested this solves balance-similar-node-groups + scale-up-from-zero, with various instance types on AWS and GCP. Previous attempts to solve that issue/discussions: * kubernetes#2892 (comment) * kubernetes#3608 (comment)

This is a second attempt at kubernetes#1021 (might also provides an alternate solution for kubernetes#2892 and kubernetes#3608). Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting to false and describing the limitations (not all cloud providers) and the risk of using it (loss of accuracy, risk of upscaling unusable nodes or leaving pending pods). Some uses cases includes: * Balance Similar when uspscaling from zero (but I believe kubernetes#3608 is a better approach) * Edited/updated ASGs/MIGs taints and labels * Updated instance type Per previous discussion, the later two cases could be covered in the long term by custom node-shape discovery Processors (open to discuss that option too).

Comparing synthetic NodeInfos obtained from nodegroup's TemplateInfo() to NodeInfos obtained from real-world nodes is bound to fail, even with kube reservations provided through nodegroups labels/annotations (for instance: kernel mem reservation is hard to predict). This makes `balance-similar-node-groups` likely to misbehave when `scale-up-from-zero` is enabled (and a first nodegroup gets a real node), for instance. Following [Maciek Pytel suggestion](kubernetes#3608 (comment)) (from discussions on a previous attempt at solving this), we can implement a NodeInfo Processor that would improve template-generated NodeInfos whenever a node was created off a similar nodegroup. We're storing node's virtual origin through machineid, which works fine but is a bit ugly (suggestions welcome). Tested this solves balance-similar-node-groups + scale-up-from-zero, with various instance types on AWS and GCP. Previous attempts to solve that issue/discussions: * kubernetes#2892 (comment) * kubernetes#3608 (comment)

This is a third attempt at kubernetes#1021 (might also provides an alternate solution for kubernetes#2892 and kubernetes#3608 amd kubernetes#3609). Some uses cases includes: balance Similar when uspscaling from zero, edited/updated ASGs/MIGs taints and labels, updated instance type. Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting to false and describing the limitations (not all cloud providers) and the risk of using it (loss of accuracy, risk of upscaling unusable nodes or leaving pending pods). Per kubernetes#3609 discussion, using a NodeInfo processor is prefered. `GetNodeInfosForGroups()` fate (and the opportunity to split NodeInfoProcessor in NodeInfoProcessor and NodeInfoDecoratorProcessor) left open for discussion as I'd like to collect feedback on refactoring plans here.

This is a third attempt at kubernetes#1021 (might also provides an alternate solution for kubernetes#2892 and kubernetes#3608 amd kubernetes#3609). Some uses cases includes: balance Similar when uspscaling from zero, edited/updated ASGs/MIGs taints and labels, updated instance type. Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting to false and describing the limitations (not all cloud providers) and the risk of using it (loss of accuracy, risk of upscaling unusable nodes or leaving pending pods). Per kubernetes#3609 discussion, using a NodeInfo processor is prefered.

Comparing synthetic NodeInfos obtained from nodegroup's TemplateInfo() to NodeInfos obtained from real-world nodes is bound to fail, even with kube reservations provided through nodegroups labels/annotations (for instance: kernel mem reservation is hard to predict). This makes `balance-similar-node-groups` likely to misbehave when `scale-up-from-zero` is enabled (and a first nodegroup gets a real node), for instance. Following [Maciek Pytel suggestion](kubernetes#3608 (comment)) (from discussions on a previous attempt at solving this), we can implement a NodeInfo Processor that would improve template-generated NodeInfos whenever a node was created off a similar nodegroup. We're storing node's virtual origin through machineid, which works fine but is a bit ugly (suggestions welcome). Tested this solves balance-similar-node-groups + scale-up-from-zero, with various instance types on AWS and GCP. Previous attempts to solve that issue/discussions: * kubernetes#2892 (comment) * kubernetes#3608 (comment)

This is a third attempt at kubernetes#1021 (might also provides an alternate solution for kubernetes#2892 and kubernetes#3608 amd kubernetes#3609). Some uses cases includes: balance Similar when uspscaling from zero, edited/updated ASGs/MIGs taints and labels, updated instance type. Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting to false and describing the limitations (not all cloud providers) and the risk of using it (loss of accuracy, risk of upscaling unusable nodes or leaving pending pods). Per kubernetes#3609 discussion, using a NodeInfo processor is prefered. `GetNodeInfosForGroups()` fate (and the opportunity to split NodeInfoProcessor in NodeInfoProcessor and NodeInfoDecoratorProcessor) left open for discussion as I'd like to collect feedback on refactoring plans here.

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 13, 2020

k8s-ci-robot requested review from feiskyer and Jeffwan October 13, 2020 12:22

bpineau mentioned this pull request Oct 13, 2020

Flag for exclusive usage of template infos (with warnings) #3609

Closed

bpineau mentioned this pull request Dec 14, 2020

NodeInfo processor to refine template-based NodeInfos #3761

Closed

bpineau closed this Dec 14, 2020

bpineau mentioned this pull request Apr 8, 2021

NodeInfo Processor for exclusive usage of template infos #4000

Closed

bpineau mentioned this pull request Mar 7, 2023

scaling from 0 only scales one of the ASG's detected in spite of balance-similar-node-groups #5352

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix BalanceSimilarNodeGroups when scaling from zero #3608

Fix BalanceSimilarNodeGroups when scaling from zero #3608

bpineau commented Oct 13, 2020

k8s-ci-robot commented Oct 13, 2020

MaciekPytel commented Oct 15, 2020

mwielgus commented Nov 19, 2020

bpineau commented Nov 20, 2020

bpineau commented Dec 14, 2020

Fix BalanceSimilarNodeGroups when scaling from zero #3608

Fix BalanceSimilarNodeGroups when scaling from zero #3608

Conversation

bpineau commented Oct 13, 2020

k8s-ci-robot commented Oct 13, 2020

MaciekPytel commented Oct 15, 2020

mwielgus commented Nov 19, 2020

bpineau commented Nov 20, 2020

bpineau commented Dec 14, 2020