Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: canary scaledown with maxsurge #1429
fix: canary scaledown with maxsurge #1429
Changes from 13 commits
437370b
8ab2deb
755fb2c
0693cd4
281fe96
d78e4d3
e2363d3
fbcbc0a
c35986f
b992e37
1cf3a59
1eb388d
51e418e
a85037f
eb2d61c
8e875cd
cd27fdb
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think this is not correct. According to the sequence of events:
Notice that at T-2, we have now violated maxUnavailable because we only have 2 pods available when the spec (maxUnavailable: 0) requires 4 to be available.
Although this is a day-0 issue, I think the proper fix may be to address #738. If we fix #738, then this problem may also be fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are correct that it requires changes to adhere to both maxSurge and minAvailabile. But bug referred in #738 is to fix weights inline to deployment. The current I am trying to fix is not scaleDown completely on either stable or canary which I will address as a part of this PR and will take #738 to do refactor to fix the whole calculation. (I don't want to refactor the whole block and take it step by step based on user feedback)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add label 1.2 to #738 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two are closely related, if not the same issue. The behavior you saw in #1415 is a manifestation of the same underlying problem but with different parameters of maxSurge/minAvailable.
This fix appears to addressing the part of symptom, whereas #738 is actually the root problem. I believe if we had fix #738, then the incident in #1415 would not have happened.
I do recognize that this fix improves the current behavior, but I'm pretty sure the code in this PR will need to be thrown away in order to address #738
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, #738 would. be the fix we need to act on. Current Issue, there are high chances of downtime when bad images are deployed. Given fix will improve and fix the issue. We need to refactor most of the calculation to fix #738 and that can go into 1.2 given the complexity.