-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subtract toBeDeleted nodes from number of upcoming nodes; cleanup toBeDeleted taints from all nodes, not only ready ones #4211
Conversation
…eDeleted taints from all nodes, not only ready ones
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some unit tests that would fail without this change?
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: alfredkrohmer The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@x13n Added a unit test in
|
76b3d5f
to
a1c801f
Compare
I read the original issue and while I understand the need to clean up taints on all nodes instead of just ready ones, I don't understand why we're reducing the number of nodes that should come up by subtracting the nodes that are being deleted. Suppose:
|
That's not the case here, this is not being subtracted from the nodes that should come up, it's subtracted from the number of nodes that are supposedly coming up right now (that cluster-autoscaler thinks are coming up right now). |
The number of nodes that are supposedly coming up right now will later impact the number of nodes that should come up: CA will think that there's no need to create more nodes, since they are coming up already. |
Exactly
My PR is reducing the number of nodes that CA thinks is coming up by subtracting the number of nodes being deleted - otherwise CA thinks these nodes that are being deleted are actually coming up right now which is not true. This actually increases (not decreases) the number of nodes that should come up. |
Hm, you're right, this will increase, not decrease the number of upcoming nodes. That brings about another concern though: if some nodes are tainted for deletion, but don't actually end up deleted, we may exceed the max size. Nodes that are being deleted can end up staying around e.g. if CA is restarted in the meantime. |
That's the other part of the PR where deletion taints are properly removed on CA startup. |
Proper removal of deletion taints is not my concern here. We may still exceed max # of nodes in the following scenario:
Now we have both X and Y up and running even though they might be exceeding the limit. |
I don't see how this behavior could be triggered by my PR. My PR doesn't touch the scale-up logic; again, it only changes the number of nodes CA thinks are coming up right now, it doesn't change the number of nodes it thinks that are there. I don't see how this would cause exceeding the max number of nodes in a NG. |
@x13n I think I found what you mean. Here: autoscaler/cluster-autoscaler/core/scale_up.go Lines 476 to 478 in 86068ba
the number of upcoming nodes is considered to check if the maximum number of nodes in the whole cluster would be exceeded. If you want, I can add/subtract the deleted nodes here as well, then it's the same behavior as before. However, I'm not sure what the best course of action would actually be here (allow to exceed the max while nodes are being deleted vs risk that scale up is not possible when max would be exceeded). As for the per-node group max number of nodes, this shouldn't be affected by my change, as this check considers only the actual number of nodes in the node group (independent of the state of the node in k8s): autoscaler/cluster-autoscaler/processors/nodegroupset/balancing_processor.go Lines 87 to 113 in 86068ba
|
Apologies for late response. Yes, I meant the logic in autoscaler/cluster-autoscaler/core/scale_up.go Lines 476 to 478 in 86068ba
I think deleted nodes should also be considered here. In edge cases when deletion takes arbitrarily long time, we could easily exceed the maximum if deleted nodes are not accounted for in the scale up logic. |
I tried to somehow get the number of deleted nodes at this point in the code to add / subtract them and so I went through the call graph up to where One thing upfront that I found:
In the
allNodes and readyNodes are retrieved (allNodes includes unready and cordoned nodes, readyNodes doesn't)
In the same function, First time to derive node group infos from the ready nodes:
Second time for another check that the max number of nodes in the cluster is not yet reached:
Third time to finally call the
I believe it would be more correct to pass
Any thoughts? |
Apologies again for super slow response. I had to take some time to go through the existing code. I think the proposal makes sense, I'm just a bit worried I didn't think about some edge case though - historically, clusterstate changes were a source of "interesting" bugs - and would prefer to keep this change behind a feature flag because of this. Can be on by default and can be removed in a few versions if there are no issues, but it might be safer to have a quick rollback mechanism. WDYT? |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale Maybe a feature flag is an overkill, but would appreciate another pair of eyes on this. @MaciekPytel - can you take a look? |
…des from being considered from available capacity to schedule Pending Pods to.
Sorry for missing this for so long. I understand the problem being solved. I'm worried that this PR may fix it, but at the same time introduce an 'opposite' issue.
The description of Deleted in Readiness struct makes it pretty clear to me that the intent of 'Deleted' node in Readiness is the case 2 above. And I think the current implementation addresses this problem correctly. If Deleted nodes are defined as the nodes that no longer exist on provider side, than we shouldn't subtract them from TargetSize. I think the bug is that we're incorrectly identifying nodes as Deleted and the way to fix both problems 1 and 2 would be to fix that. Unfortunately that would probably require a slightly bigger code change: I think we should check if a particular instance exists on cloudprovider side as part of readiness calculation. This is very similar to how we identify unregistered nodes, just doing the check in other direction (check if there is a matching instance for a given node). WDYT @alfredkrohmer @x13n ? |
Hi @MaciekPytel, thank you for taking a look! I wouldn't want to introduce new issues into the autoscaler, but I'm concerned that the readiness calculation isn't aware of nodes being drained & deleted. To be exact, the readiness calculation doesn't consider taints that may exist on the nodes. If a node has the ToBeDeleted taint, it has an effect of NoSchedule. Pods wouldn't be capable of being scheduled against the node, but the autoscaler believes that enough resources exist to handle the pending pods. Unfortunately this blocks scale ups indefinitely while the node is being drained and deleted. This can be observed in the logs obtained in the bug #4456.
I have found that scaling up the pods enough to exceed the capacity/resources available of the tainted node will result in a new node being added. Given that the Deleted field is the count of nodes with the ToBeDeleted taint, I assumed that this value could only be zero or 1 in typical scenarios. One alternative that I've been exploring is to have a separate controller monitor for the ToBeDeleted taint, and triggering an over provision a new node with Pause pods to prevent scale ups from being blocked. This does move the burden of overprovisioning outside of the autoscaler instead of having the readiness check adjusted to remove upcoming nodes that shouldn't be considered viable. Please let me know your thoughts! Thanks! |
I think this is the problem. It's not intended to be the number of nodes with ToBeDeleted taint. It's intended to be a number of nodes that don't have a matching VM on cloudprovider side, because the VM has already been deleted. My guess is that looking at a taint seemed like an easy proxy for nodes that have already been deleted by CA 1. As we can see, it isn't. I think the right way to fix this issue without introducing new problems is to stop identifying nodes that are being deleted as already deleted. This should fix your scenario, as the node that is being deleted would not be counted as Deleted, meaning that it would be counted as either Ready or Unready, both of which are subtracted from upcoming nodes. It also means that nodes that have actually been deleted (in the sense that VM no longer exists and it's just a kubernetes object that haven't been cleaned up yet) would still not be included in upcoming nodes calculation. This part is correct - those nodes no longer count towards NodeGroup.TargetSize() and so should not be subtracted from this number. We already get list of cloudprovider instances in UpdateNodes() before the updateReadinessStats() call. I think we could use this data in updateReadinessStats() to only count node as Deleted if there is no corresponding instance. The potentially tricky part would be understanding if there is any negative interaction with node_instances_cache (do we need to call InvalidateNodeInstancesCacheEntry() on scale-down? I'm guessing yes, but I haven't really dived into the code deeply enough to know). Footnotes
|
I think that part makes sense to me. I'd still clear node taints from all nodes on startup though - I don't see a reason to keep a (possibly intermittently) unready node tainted when CA restarts. It wouldn't automatically resume deletion anyway. |
I made a commit based on the suggestions provided, but not certain that I got it right. Within the updateReadinessStats, I switched to using the Cluster State Registry of the latest nodes unregistered in the cloud provider. I also made a change to the GetReadinessState function to include the ToBeDeleted taint key to mark any node as Unready. I don't know if this is necessary, but felt that it made sense to mark it that way. After building and deploying the changes from this commit1 to a test cluster, the scale ups were no longer being blocked when a node was tainted with ToBeDeleted. However, I'm not sure that these code changes exactly match to what was described. Additionally, I didn't make any attempts to address the scale down that may need to call InvalidateNodeInstancesCacheEntry. Footnotes |
+1. No reason to keep taints. I don't think we ever made conscious decision to skip any nodes when un-tainting, we just implicitly assumed readyNodes ~= allNodes as normally nodes removed by CA don't become unready until deleted on cloudprovider side. I'm guessing this is an issue when using |
I don't think that implements what I described. An "unregistered node" is a VM that has no matching Node object in k8s. I think a "deleted node" should be a Node object in k8s that has no VM. I don't think we can infer this information from unregistered nodes, we need something similar to getNotRegisteredNodes() working "in other direction" (ie. for each node see if there is a matching instance").
Hmm, good point. It certainly seems logical, but I'm not sure if this doesn't have some undesired side effects (ex. marking nodepool as unhealthy during large scale-down?). |
Hi @MaciekPytel & @x13n, I've created a new pull request that incorporates some of the changes (tests for UpcomingNodes, clearing taints from ALL nodes on startup) and new changes to identify those nodes that no longer have a corresponding VM on the cloud provider. Please let me know your feedback on the proposed changes. If possible, I'd like to have a discussion regarding the InvalidateNodeInstancesCacheEntry during scale down. Thanks! |
This has been obsoleted by #5054 now. /close |
@x13n: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@x13n Correct me if I'm wrong, but I belive #5054 doesn't actually address the main problem described in #3949. Our problem is not nodes deleted in the cloud provider and still having a k8s node, but rather that nodes that are being drained by cluster-autoscaler for binpacking are counted as "upcoming" nodes. Or is the PR that you linked fixing the problem by rewiring a bunch of logic so that it will be fixed "on the way, too"? |
The "right way" to address #3949 was discussed above in this PR. In order to get rid of the bug, CA needs to only identify deleted (as in - VM removed) nodes as Once proper counting of deleted instances is implemented, the case you described in #3949 will stop happening: existing instance will be considered by CA as Does that make sense? |
Fixes #3949