Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix BalanceSimilarNodeGroups when scaling from zero #3608

Closed

Conversation

bpineau
Copy link
Contributor

@bpineau bpineau commented Oct 13, 2020

Use real-world nodeInfo from similar groups when BalanceSimilarNodeGroups.

Where available, the CA evaluates node groups capacities and labels using real-world nodes, for optimal accuracy (accounting for kubelet and system reserved resources, for instance).

It fallbacks to synthetic nodeInfos inferred from ASG/MIG/VMSs templates when no real nodes are available yet (ie. upscaling from zero/empty node group).

That asymmetric evaluation can prevent BalanceSimilarNodeGroups from working reliably when upscaling from zero:

  • The first upscaled nodeGroup will get a node (capacity initially evaluated from ASG/MIG/VMSS template, then from the first real node, runtime labels)
  • Other empty nodeGroups will likely be considered dissimilar to the first one: we're comparing ASG templates with real-world nodeInfo
  • In the case of least-waste expander for instance, CA might prefer the nodegroup from real-world node, accounting for reserved resources (best usage) over the others

This change set implements Maciek Pytel suggestion (from a previous attempt at fixing this issue): we try to use real-world nodes examples from ~similar node groups, before a last resort fallback to synthetic nodeInfos built from ASG templates.

We compare nodeInfos using the FindSimilarNodeGroups processor primitives in order to leverage per cloud provider comparators (eg. label exclusions).

We look for similar nodegroups using synthetic nodeInfo from ASG/MIG (where available): comparing their nodeInfo from real-world nodes to a synthetic nodeInfo from the empty nodegroup wouldn't work (that's the root cause of this issue). When ~similar nodegroups are found that way, we use their nodeInfos built from real-world nodes as models for empty nodegroups (but keeping original location related labels).

Tested with AWS ASGs, GCP MIGs and Azure VMSS.

Use real-world nodeInfo from similar groups when BalanceSimilarNodeGroups.

Where available, the CA evaluates node groups capacities and labels using
real-world nodes, for optimal accuracy (accounting for kubelet and system
reserved resources, for instance).

It fallbacks to synthetic nodeInfos infered from ASG/MIG/VMSs templates when
no real nodes are available yet (ie. upscaling from zero/empty node group).

That asymetric evaluation can prevent `BalanceSimilarNodeGroups` from working
reliably when upscaling from zero:
* The first upscaled nodeGroup will get a node (capacity initially evaluated from ASG/MIG/VMSS template, then from the first real node, runtime labels)
* Other empty nodeGroups will likely be considered dissimilar to the first one: we're comparing ASG templates with real-world nodeInfo
* In the case of `least-waste` expander for instance, CA might favor the nodegroup from real-world node, accounting for reserved resources (best usage) over the others

This change set implements [Maciek Pytel suggestion (from a previous attempt at fixing this issue)](kubernetes#2892 (comment)):
we try to use real-world nodes examples from ~similar node groups, before
a last resort fallback to synthetic nodeInfos built from ASG templates.

We compare nodeInfos using the `FindSimilarNodeGroups` processor primitives
in order to leverage per cloud provider comparators (eg. label exclusions).

We look for similar nodegroups using synthetic nodeInfo from ASG/MIG (where
available): comparing their nodeInfo from real-world nodes to a synthetic
nodeInfo from the empty nodegroup wouldn't work (that's the root cause of this
issue). When ~similar nodegroups are found that way, we use their nodeInfos
built from real-world nodes as models for empty nodegroups (but keeping
original location related labels).

Tested with AWS ASGs, GCP MIGs and Azure VMSS.
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 13, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign feiskyer after the PR has been reviewed.
You can assign the PR to them by writing /assign @feiskyer in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bpineau added a commit to DataDog/autoscaler that referenced this pull request Oct 13, 2020
This is a second attempt at kubernetes#1021 (might also provides an alternate
solution for kubernetes#2892 and kubernetes#3608).

Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting
to false and describing the limitations (not all cloud providers)
and the risk of using it (loss of accuracy, risk of upscaling
unusable nodes or leaving pending pods).

Some uses cases includes:
* Balance Similar when uspscaling from zero (but I believe kubernetes#3608 is a better approach)
* Edited/updated ASGs/MIGs taints and labels
* Updated instance type

Per previous discussion, the later two cases could be covered in
the long term by custom node-shape discovery Processors (open to
discuss that option too).
bpineau added a commit to DataDog/autoscaler that referenced this pull request Oct 15, 2020
This is a second attempt at kubernetes#1021 (might also provides an alternate
solution for kubernetes#2892 and kubernetes#3608).

Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting
to false and describing the limitations (not all cloud providers)
and the risk of using it (loss of accuracy, risk of upscaling
unusable nodes or leaving pending pods).

Some uses cases includes:
* Balance Similar when uspscaling from zero (but I believe kubernetes#3608 is a better approach)
* Edited/updated ASGs/MIGs taints and labels
* Updated instance type

Per previous discussion, the later two cases could be covered in
the long term by custom node-shape discovery Processors (open to
discuss that option too).
@MaciekPytel
Copy link
Contributor

I think this should be implemented as a processor:

  • A custom implementation of NodeGroupSetProcessor (which may internally call to BalancingNodeGroupSetProcessor or reuse some of its code).
  • Or, possibly, a NodeInfoProcessor (if you actually want to use template node from 'similar' group for binpacking and not just balancing).

Those processors were added specifically to allow to use custom logic to replace balancing (NodeGroupSetProcessor) and allow custom overrides to NodeInfo map produced by GetNodeInfosForGroups (NodeInfoProcessor). My reasons are:

  • The current implementation adds even more complexity to what are already most complex pieces of code (and it spreads the logic across multiple parts of the code). Encapsulating all this logic in one place would improve readability.
  • Core logic should not assume any particular implementation of a processor. You can replace the whole balancing logic with something completely different by switching NodeGroupSetProcessor (ex. balance all nodegroups that have a certain label, or for a more concrete example - balance all MIGs in a GKE nodepool).

@mwielgus
Copy link
Contributor

Ping. Please respond to @MaciekPytel comment.

@bpineau
Copy link
Contributor Author

bpineau commented Nov 20, 2020

@mwielgus : Maciek comment makes perfect sense, I want to secure a bit of time to try implementing the suggested options (I suspect I'll come back with questions if you don't mind).

bpineau added a commit to DataDog/autoscaler that referenced this pull request Dec 14, 2020
Comparing synthetic NodeInfos obtained from nodegroup's TemplateInfo()
to NodeInfos obtained from real-world nodes is bound to fail, even with
kube reservations provided through nodegroups labels/annotations
(for instance: kernel mem reservation is hard to predict).

This makes `balance-similar-node-groups` likely to misbehave when
`scale-up-from-zero` is enabled (and a first nodegroup gets a real
node), for instance.

Following [Maciek Pytel suggestion](kubernetes#3608 (comment))
(from discussions on a previous attempt at solving this), we can
implement a NodeInfo Processor that would improve template-generated
NodeInfos whenever a node was created off a similar nodegroup.

We're storing node's virtual origin through machineid, which works
fine but is a bit ugly (suggestions welcome).

Tested this solves balance-similar-node-groups + scale-up-from-zero,
with various instance types on AWS and GCP.

Previous attempts to solve that issue/discussions:
* kubernetes#2892 (comment)
* kubernetes#3608 (comment)
@bpineau
Copy link
Contributor Author

bpineau commented Dec 14, 2020

Implemented @MaciekPytel suggestion in PR #3761 ; closing in favour of the new PR.

@bpineau bpineau closed this Dec 14, 2020
bpineau added a commit to DataDog/autoscaler that referenced this pull request Dec 14, 2020
Comparing synthetic NodeInfos obtained from nodegroup's TemplateInfo()
to NodeInfos obtained from real-world nodes is bound to fail, even with
kube reservations provided through nodegroups labels/annotations
(for instance: kernel mem reservation is hard to predict).

This makes `balance-similar-node-groups` likely to misbehave when
`scale-up-from-zero` is enabled (and a first nodegroup gets a real
node), for instance.

Following [Maciek Pytel suggestion](kubernetes#3608 (comment))
(from discussions on a previous attempt at solving this), we can
implement a NodeInfo Processor that would improve template-generated
NodeInfos whenever a node was created off a similar nodegroup.

We're storing node's virtual origin through machineid, which works
fine but is a bit ugly (suggestions welcome).

Tested this solves balance-similar-node-groups + scale-up-from-zero,
with various instance types on AWS and GCP.

Previous attempts to solve that issue/discussions:
* kubernetes#2892 (comment)
* kubernetes#3608 (comment)
bpineau added a commit to DataDog/autoscaler that referenced this pull request Dec 14, 2020
Comparing synthetic NodeInfos obtained from nodegroup's TemplateInfo()
to NodeInfos obtained from real-world nodes is bound to fail, even with
kube reservations provided through nodegroups labels/annotations
(for instance: kernel mem reservation is hard to predict).

This makes `balance-similar-node-groups` likely to misbehave when
`scale-up-from-zero` is enabled (and a first nodegroup gets a real
node), for instance.

Following [Maciek Pytel suggestion](kubernetes#3608 (comment))
(from discussions on a previous attempt at solving this), we can
implement a NodeInfo Processor that would improve template-generated
NodeInfos whenever a node was created off a similar nodegroup.

We're storing node's virtual origin through machineid, which works
fine but is a bit ugly (suggestions welcome).

Tested this solves balance-similar-node-groups + scale-up-from-zero,
with various instance types on AWS and GCP.

Previous attempts to solve that issue/discussions:
* kubernetes#2892 (comment)
* kubernetes#3608 (comment)
bpineau added a commit to DataDog/autoscaler that referenced this pull request Dec 23, 2020
Comparing synthetic NodeInfos obtained from nodegroup's TemplateInfo()
to NodeInfos obtained from real-world nodes is bound to fail, even with
kube reservations provided through nodegroups labels/annotations
(for instance: kernel mem reservation is hard to predict).

This makes `balance-similar-node-groups` likely to misbehave when
`scale-up-from-zero` is enabled (and a first nodegroup gets a real
node), for instance.

Following [Maciek Pytel suggestion](kubernetes#3608 (comment))
(from discussions on a previous attempt at solving this), we can
implement a NodeInfo Processor that would improve template-generated
NodeInfos whenever a node was created off a similar nodegroup.

We're storing node's virtual origin through machineid, which works
fine but is a bit ugly (suggestions welcome).

Tested this solves balance-similar-node-groups + scale-up-from-zero,
with various instance types on AWS and GCP.

Previous attempts to solve that issue/discussions:
* kubernetes#2892 (comment)
* kubernetes#3608 (comment)
bpineau added a commit to DataDog/autoscaler that referenced this pull request Jan 18, 2021
Comparing synthetic NodeInfos obtained from nodegroup's TemplateInfo()
to NodeInfos obtained from real-world nodes is bound to fail, even with
kube reservations provided through nodegroups labels/annotations
(for instance: kernel mem reservation is hard to predict).

This makes `balance-similar-node-groups` likely to misbehave when
`scale-up-from-zero` is enabled (and a first nodegroup gets a real
node), for instance.

Following [Maciek Pytel suggestion](kubernetes#3608 (comment))
(from discussions on a previous attempt at solving this), we can
implement a NodeInfo Processor that would improve template-generated
NodeInfos whenever a node was created off a similar nodegroup.

We're storing node's virtual origin through machineid, which works
fine but is a bit ugly (suggestions welcome).

Tested this solves balance-similar-node-groups + scale-up-from-zero,
with various instance types on AWS and GCP.

Previous attempts to solve that issue/discussions:
* kubernetes#2892 (comment)
* kubernetes#3608 (comment)
bpineau added a commit to DataDog/autoscaler that referenced this pull request Jan 18, 2021
This is a second attempt at kubernetes#1021 (might also provides an alternate
solution for kubernetes#2892 and kubernetes#3608).

Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting
to false and describing the limitations (not all cloud providers)
and the risk of using it (loss of accuracy, risk of upscaling
unusable nodes or leaving pending pods).

Some uses cases includes:
* Balance Similar when uspscaling from zero (but I believe kubernetes#3608 is a better approach)
* Edited/updated ASGs/MIGs taints and labels
* Updated instance type

Per previous discussion, the later two cases could be covered in
the long term by custom node-shape discovery Processors (open to
discuss that option too).
bpineau added a commit to DataDog/autoscaler that referenced this pull request Jan 25, 2021
This is a second attempt at kubernetes#1021 (might also provides an alternate
solution for kubernetes#2892 and kubernetes#3608).

Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting
to false and describing the limitations (not all cloud providers)
and the risk of using it (loss of accuracy, risk of upscaling
unusable nodes or leaving pending pods).

Some uses cases includes:
* Balance Similar when uspscaling from zero (but I believe kubernetes#3608 is a better approach)
* Edited/updated ASGs/MIGs taints and labels
* Updated instance type

Per previous discussion, the later two cases could be covered in
the long term by custom node-shape discovery Processors (open to
discuss that option too).
bpineau added a commit to DataDog/autoscaler that referenced this pull request Jan 25, 2021
Comparing synthetic NodeInfos obtained from nodegroup's TemplateInfo()
to NodeInfos obtained from real-world nodes is bound to fail, even with
kube reservations provided through nodegroups labels/annotations
(for instance: kernel mem reservation is hard to predict).

This makes `balance-similar-node-groups` likely to misbehave when
`scale-up-from-zero` is enabled (and a first nodegroup gets a real
node), for instance.

Following [Maciek Pytel suggestion](kubernetes#3608 (comment))
(from discussions on a previous attempt at solving this), we can
implement a NodeInfo Processor that would improve template-generated
NodeInfos whenever a node was created off a similar nodegroup.

We're storing node's virtual origin through machineid, which works
fine but is a bit ugly (suggestions welcome).

Tested this solves balance-similar-node-groups + scale-up-from-zero,
with various instance types on AWS and GCP.

Previous attempts to solve that issue/discussions:
* kubernetes#2892 (comment)
* kubernetes#3608 (comment)
bpineau added a commit to DataDog/autoscaler that referenced this pull request Feb 26, 2021
Comparing synthetic NodeInfos obtained from nodegroup's TemplateInfo()
to NodeInfos obtained from real-world nodes is bound to fail, even with
kube reservations provided through nodegroups labels/annotations
(for instance: kernel mem reservation is hard to predict).

This makes `balance-similar-node-groups` likely to misbehave when
`scale-up-from-zero` is enabled (and a first nodegroup gets a real
node), for instance.

Following [Maciek Pytel suggestion](kubernetes#3608 (comment))
(from discussions on a previous attempt at solving this), we can
implement a NodeInfo Processor that would improve template-generated
NodeInfos whenever a node was created off a similar nodegroup.

We're storing node's virtual origin through machineid, which works
fine but is a bit ugly (suggestions welcome).

Tested this solves balance-similar-node-groups + scale-up-from-zero,
with various instance types on AWS and GCP.

Previous attempts to solve that issue/discussions:
* kubernetes#2892 (comment)
* kubernetes#3608 (comment)
bpineau added a commit to DataDog/autoscaler that referenced this pull request Apr 8, 2021
This is a third attempt at kubernetes#1021 (might also provides an alternate solution for kubernetes#2892 and kubernetes#3608 amd kubernetes#3609).
Some uses cases includes: balance Similar when uspscaling from zero, edited/updated ASGs/MIGs taints and labels, updated instance type.

Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting to false and describing the limitations (not all cloud providers) and the risk of using it (loss of accuracy, risk of upscaling unusable nodes or leaving pending pods).

Per kubernetes#3609 discussion, using a NodeInfo processor is prefered.

`GetNodeInfosForGroups()` fate (and the opportunity to split NodeInfoProcessor in NodeInfoProcessor and NodeInfoDecoratorProcessor) left open for discussion as I'd like to collect feedback on refactoring plans here.
bpineau added a commit to DataDog/autoscaler that referenced this pull request Apr 8, 2021
This is a third attempt at kubernetes#1021 (might also provides an alternate solution for kubernetes#2892 and kubernetes#3608 amd kubernetes#3609).
Some uses cases includes: balance Similar when uspscaling from zero, edited/updated ASGs/MIGs taints and labels, updated instance type.

Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting to false and describing the limitations (not all cloud providers) and the risk of using it (loss of accuracy, risk of upscaling unusable nodes or leaving pending pods).

Per kubernetes#3609 discussion, using a NodeInfo processor is prefered.

`GetNodeInfosForGroups()` fate (and the opportunity to split NodeInfoProcessor in NodeInfoProcessor and NodeInfoDecoratorProcessor) left open for discussion as I'd like to collect feedback on refactoring plans here.
bpineau added a commit to DataDog/autoscaler that referenced this pull request Apr 8, 2021
This is a third attempt at kubernetes#1021 (might also provides an alternate solution for kubernetes#2892 and kubernetes#3608 amd kubernetes#3609).
Some uses cases includes: balance Similar when uspscaling from zero, edited/updated ASGs/MIGs taints and labels, updated instance type.

Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting to false and describing the limitations (not all cloud providers) and the risk of using it (loss of accuracy, risk of upscaling unusable nodes or leaving pending pods).

Per kubernetes#3609 discussion, using a NodeInfo processor is prefered.
bpineau added a commit to DataDog/autoscaler that referenced this pull request Apr 9, 2021
This is a third attempt at kubernetes#1021 (might also provides an alternate solution for kubernetes#2892 and kubernetes#3608 amd kubernetes#3609).
Some uses cases includes: balance Similar when uspscaling from zero, edited/updated ASGs/MIGs taints and labels, updated instance type.

Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting to false and describing the limitations (not all cloud providers) and the risk of using it (loss of accuracy, risk of upscaling unusable nodes or leaving pending pods).

Per kubernetes#3609 discussion, using a NodeInfo processor is prefered.
bpineau added a commit to DataDog/autoscaler that referenced this pull request Apr 19, 2021
Comparing synthetic NodeInfos obtained from nodegroup's TemplateInfo()
to NodeInfos obtained from real-world nodes is bound to fail, even with
kube reservations provided through nodegroups labels/annotations
(for instance: kernel mem reservation is hard to predict).

This makes `balance-similar-node-groups` likely to misbehave when
`scale-up-from-zero` is enabled (and a first nodegroup gets a real
node), for instance.

Following [Maciek Pytel suggestion](kubernetes#3608 (comment))
(from discussions on a previous attempt at solving this), we can
implement a NodeInfo Processor that would improve template-generated
NodeInfos whenever a node was created off a similar nodegroup.

We're storing node's virtual origin through machineid, which works
fine but is a bit ugly (suggestions welcome).

Tested this solves balance-similar-node-groups + scale-up-from-zero,
with various instance types on AWS and GCP.

Previous attempts to solve that issue/discussions:
* kubernetes#2892 (comment)
* kubernetes#3608 (comment)
bpineau added a commit to DataDog/autoscaler that referenced this pull request Apr 19, 2021
This is a third attempt at kubernetes#1021 (might also provides an alternate solution for kubernetes#2892 and kubernetes#3608 amd kubernetes#3609).
Some uses cases includes: balance Similar when uspscaling from zero, edited/updated ASGs/MIGs taints and labels, updated instance type.

Per kubernetes#1021 discussion, a flag might be acceptable, if defaulting to false and describing the limitations (not all cloud providers) and the risk of using it (loss of accuracy, risk of upscaling unusable nodes or leaving pending pods).

Per kubernetes#3609 discussion, using a NodeInfo processor is prefered.

`GetNodeInfosForGroups()` fate (and the opportunity to split NodeInfoProcessor in NodeInfoProcessor and NodeInfoDecoratorProcessor) left open for discussion as I'd like to collect feedback on refactoring plans here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants