-
Notifications
You must be signed in to change notification settings - Fork 904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: canary scaledown with maxsurge #1429
Conversation
Signed-off-by: hari rongali <[email protected]>
Signed-off-by: hari rongali <[email protected]>
Codecov Report
@@ Coverage Diff @@
## master #1429 +/- ##
==========================================
+ Coverage 81.38% 81.41% +0.03%
==========================================
Files 108 108
Lines 10043 10060 +17
==========================================
+ Hits 8173 8190 +17
Misses 1309 1309
Partials 561 561
Continue to review full report at Codecov.
|
Signed-off-by: hari rongali <[email protected]>
Signed-off-by: hari rongali <[email protected]>
Signed-off-by: hari rongali <[email protected]>
Signed-off-by: hari rongali <[email protected]>
Signed-off-by: hari rongali <[email protected]>
Signed-off-by: hari rongali <[email protected]>
Signed-off-by: hari rongali <[email protected]>
Signed-off-by: hari rongali <[email protected]>
@alexmt / @jessesuen Please review. |
Signed-off-by: hari rongali <[email protected]>
Signed-off-by: hari rongali <[email protected]>
Signed-off-by: hari rongali <[email protected]>
@jessesuen Please review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CalculateReplicaCountsForCanary is arguably the most critical part of rollout code and the new code is not covered in unit tests. I will insist that we test this with additional unit tests.
Signed-off-by: hari rongali <[email protected]>
test/e2e/canary_test.go
Outdated
WaitForRolloutReplicas(6). | ||
Then(). | ||
ExpectCanaryStablePodCount(4, 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think this is not correct. According to the sequence of events:
- At T-0, we were at: 6 pods (2 canary, 4 stable, 4 available)
- At T-1, we had a scale-up event and went to: 10 pods (6 canary, 4 stable, 4 available).
- At T-2, we had a scale-down event and went back down to 6 pods (4 canary, 2 stable, 2 available)
Notice that at T-2, we have now violated maxUnavailable because we only have 2 pods available when the spec (maxUnavailable: 0) requires 4 to be available.
Although this is a day-0 issue, I think the proper fix may be to address #738. If we fix #738, then this problem may also be fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are correct that it requires changes to adhere to both maxSurge and minAvailabile. But bug referred in #738 is to fix weights inline to deployment. The current I am trying to fix is not scaleDown completely on either stable or canary which I will address as a part of this PR and will take #738 to do refactor to fix the whole calculation. (I don't want to refactor the whole block and take it step by step based on user feedback)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add label 1.2 to #738 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But bug referred in #738 is to fix weights inline to deployment
The two are closely related, if not the same issue. The behavior you saw in #1415 is a manifestation of the same underlying problem but with different parameters of maxSurge/minAvailable.
This fix appears to addressing the part of symptom, whereas #738 is actually the root problem. I believe if we had fix #738, then the incident in #1415 would not have happened.
I do recognize that this fix improves the current behavior, but I'm pretty sure the code in this PR will need to be thrown away in order to address #738
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: hari rongali <[email protected]>
Signed-off-by: hari rongali <[email protected]>
Signed-off-by: hari rongali <[email protected]>
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@harikrongali great work!
So with this fix, we will now fix the bug where scale down could violate maxUnavailable. But we still have issue that scale up could violate maxSurge (#738), correct?
// 1. adjust adjustRSReplicaCount to be within maxSurge | ||
adjustRSReplicaCount = adjustRSReplicaCount - overTheLimitVal | ||
// 2. Calculate availability corresponding to adjusted count | ||
totalAvailableReplicas = totalAvailableReplicas + minValue(adjustRS.Status.AvailableReplicas, adjustRSReplicaCount) | ||
// 3. Calculate decrease in availability of adjustRS because of (1) | ||
extraAvailableAdjustRS = maxValue(0, adjustRS.Status.AvailableReplicas-adjustRSReplicaCount) | ||
|
||
// 4. Now calculate how far count is from maxUnavailable limit | ||
moreToNeedAvailableReplicas := maxValue(0, minAvailableReplicaCount-totalAvailableReplicas) | ||
// 5. From (3), we got the count for decrease in availability because of (1), | ||
// take the min of (3) & (4) and add it back to adjustRS | ||
// remaining of moreToNeedAvailableReplicas can be ignored as it is part of drainRS, | ||
// there is no case of deviating from maxUnavailable limit from drainRS as in the event of said case, | ||
// scaleDown calculation wont even occur as we are checking | ||
// replicasToScaleDown <= minAvailableReplicaCount in caller function | ||
adjustRSReplicaCount = adjustRSReplicaCount + minValue(extraAvailableAdjustRS, moreToNeedAvailableReplicas) | ||
// 6. Calculate final overTheLimit because of adjustment | ||
overTheLimitVal = maxValue(0, adjustRSReplicaCount+drainRSReplicaCount-maxReplicaCountAllowed) | ||
// 7. we can safely subtract from drainRS and other cases like deviation from maxUnavailable limit | ||
// wont occur as said in (5) | ||
drainRSReplicaCount = drainRSReplicaCount - overTheLimitVal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really appreciate the detailed explanation for the calculation.
@jessesuen yes. issue #738 still exists where scale-up could violate. |
Signed-off-by: hari rongali <[email protected]>
…oproj#1429)" This reverts commit 7c6a0c5.
Signed-off-by: hari rongali [email protected]
fixes #1415
Checklist:
"fix(controller): Updates such and such. Fixes #1234"
.