fix: canary scaledown with maxsurge #1429

harikrongali · 2021-08-19T09:31:41Z

Signed-off-by: hari rongali [email protected]

fixes #1415

Checklist:

Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
The title of the PR is (a) conventional, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
I've signed my commits with DCO
I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
My builds are green. Try syncing with master if they are not.
My organization is added to USERS.md.

Signed-off-by: hari rongali <[email protected]>

codecov · 2021-08-19T09:39:11Z

Codecov Report

Merging #1429 (cd27fdb) into master (0b70775) will increase coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1429      +/-   ##
==========================================
+ Coverage   81.38%   81.41%   +0.03%     
==========================================
  Files         108      108              
  Lines       10043    10060      +17     
==========================================
+ Hits         8173     8190      +17     
  Misses       1309     1309              
  Partials      561      561

Impacted Files	Coverage Δ
utils/replicaset/canary.go	`83.82% <100.00%> (+1.26%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0b70775...cd27fdb. Read the comment docs.

Signed-off-by: hari rongali <[email protected]>

harikrongali · 2021-08-19T18:32:17Z

@alexmt / @jessesuen Please review.

test/e2e/canary_test.go

Signed-off-by: hari rongali <[email protected]>

harikrongali · 2021-08-20T17:17:05Z

@jessesuen Please review

jessesuen

CalculateReplicaCountsForCanary is arguably the most critical part of rollout code and the new code is not covered in unit tests. I will insist that we test this with additional unit tests.

Signed-off-by: hari rongali <[email protected]>

test/e2e/canary_test.go

utils/replicaset/canary.go

jessesuen · 2021-08-20T19:17:00Z

test/e2e/canary_test.go

+	    WaitForRolloutReplicas(6).
+		Then().
+		ExpectCanaryStablePodCount(4, 2)


I still think this is not correct. According to the sequence of events:

At T-0, we were at: 6 pods (2 canary, 4 stable, 4 available)

At T-1, we had a scale-up event and went to: 10 pods (6 canary, 4 stable, 4 available).

At T-2, we had a scale-down event and went back down to 6 pods (4 canary, 2 stable, 2 available)

Notice that at T-2, we have now violated maxUnavailable because we only have 2 pods available when the spec (maxUnavailable: 0) requires 4 to be available.

Although this is a day-0 issue, I think the proper fix may be to address #738. If we fix #738, then this problem may also be fixed.

Yes, you are correct that it requires changes to adhere to both maxSurge and minAvailabile. But bug referred in #738 is to fix weights inline to deployment. The current I am trying to fix is not scaleDown completely on either stable or canary which I will address as a part of this PR and will take #738 to do refactor to fix the whole calculation. (I don't want to refactor the whole block and take it step by step based on user feedback)

Can you add label 1.2 to #738 ?

But bug referred in #738 is to fix weights inline to deployment

The two are closely related, if not the same issue. The behavior you saw in #1415 is a manifestation of the same underlying problem but with different parameters of maxSurge/minAvailable.

This fix appears to addressing the part of symptom, whereas #738 is actually the root problem. I believe if we had fix #738, then the incident in #1415 would not have happened.

I do recognize that this fix improves the current behavior, but I'm pretty sure the code in this PR will need to be thrown away in order to address #738

Agree, #738 would. be the fix we need to act on. Current Issue, there are high chances of downtime when bad images are deployed. Given fix will improve and fix the issue. We need to refactor most of the calculation to fix #738 and that can go into 1.2 given the complexity.

Signed-off-by: hari rongali <[email protected]>

sonarqubecloud · 2021-08-21T05:22:24Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
0.0% Duplication

jessesuen

@harikrongali great work!

So with this fix, we will now fix the bug where scale down could violate maxUnavailable. But we still have issue that scale up could violate maxSurge (#738), correct?

jessesuen · 2021-08-23T18:54:36Z

utils/replicaset/canary.go

+		// 1. adjust adjustRSReplicaCount to be within maxSurge
+		adjustRSReplicaCount = adjustRSReplicaCount - overTheLimitVal
+		// 2. Calculate availability corresponding to adjusted count
+		totalAvailableReplicas = totalAvailableReplicas + minValue(adjustRS.Status.AvailableReplicas, adjustRSReplicaCount)
+		// 3. Calculate decrease in availability of adjustRS because of (1)
+		extraAvailableAdjustRS = maxValue(0, adjustRS.Status.AvailableReplicas-adjustRSReplicaCount)
+
+		// 4. Now calculate how far count is from maxUnavailable limit
+		moreToNeedAvailableReplicas := maxValue(0, minAvailableReplicaCount-totalAvailableReplicas)
+		// 5. From (3), we got the count for decrease in availability because of (1),
+		// take the min of (3) & (4) and add it back to adjustRS
+		// remaining of moreToNeedAvailableReplicas can be ignored as it is part of drainRS,
+		// there is no case of deviating from maxUnavailable limit from drainRS as in the event of said case,
+		// scaleDown calculation wont even occur as we are checking
+		// replicasToScaleDown <= minAvailableReplicaCount in caller function
+		adjustRSReplicaCount = adjustRSReplicaCount + minValue(extraAvailableAdjustRS, moreToNeedAvailableReplicas)
+		// 6. Calculate final overTheLimit because of adjustment
+		overTheLimitVal = maxValue(0, adjustRSReplicaCount+drainRSReplicaCount-maxReplicaCountAllowed)
+		// 7. we can safely subtract from drainRS and other cases like deviation from maxUnavailable limit
+		// wont occur as said in (5)
+		drainRSReplicaCount = drainRSReplicaCount - overTheLimitVal


Really appreciate the detailed explanation for the calculation.

harikrongali · 2021-08-23T20:00:16Z

@jessesuen yes. issue #738 still exists where scale-up could violate.

Signed-off-by: hari rongali <[email protected]>

…oproj#1429)" This reverts commit 7c6a0c5.

harikrongali added 2 commits August 19, 2021 02:29

fix: canary scaledown with maxsurge

437370b

Signed-off-by: hari rongali <[email protected]>

fix: remove empty file

8ab2deb

Signed-off-by: hari rongali <[email protected]>

harikrongali added 8 commits August 19, 2021 03:54

fix: tests

755fb2c

Signed-off-by: hari rongali <[email protected]>

fix: tests

0693cd4

Signed-off-by: hari rongali <[email protected]>

fix: canary & stable scaledown

281fe96

Signed-off-by: hari rongali <[email protected]>

cleanup

d78e4d3

Signed-off-by: hari rongali <[email protected]>

cleanup

e2363d3

Signed-off-by: hari rongali <[email protected]>

fix: add comment

fbcbc0a

Signed-off-by: hari rongali <[email protected]>

fix: tests

c35986f

Signed-off-by: hari rongali <[email protected]>

fix: tests

b992e37

Signed-off-by: hari rongali <[email protected]>

jessesuen reviewed Aug 19, 2021

View reviewed changes

test/e2e/canary_test.go Outdated Show resolved Hide resolved

jessesuen modified the milestone: v1.1 Aug 19, 2021

harikrongali added 3 commits August 19, 2021 23:52

fix: review comment and maxSurge limit

1cf3a59

Signed-off-by: hari rongali <[email protected]>

fix: lint

1eb388d

Signed-off-by: hari rongali <[email protected]>

fix: comment

51e418e

Signed-off-by: hari rongali <[email protected]>

jessesuen requested changes Aug 20, 2021

View reviewed changes

fix: add tests

a85037f

Signed-off-by: hari rongali <[email protected]>

jessesuen reviewed Aug 20, 2021

View reviewed changes

test/e2e/canary_test.go Outdated Show resolved Hide resolved

utils/replicaset/canary.go Outdated Show resolved Hide resolved

jessesuen reviewed Aug 20, 2021

View reviewed changes

harikrongali added 3 commits August 20, 2021 12:38

fix: comments

eb2d61c

Signed-off-by: hari rongali <[email protected]>

fix: adjust counter for maxUnavailable limit

8e875cd

Signed-off-by: hari rongali <[email protected]>

fix: remove log

cd27fdb

Signed-off-by: hari rongali <[email protected]>

jessesuen approved these changes Aug 23, 2021

View reviewed changes

jessesuen merged commit 7c6a0c5 into argoproj:master Aug 23, 2021

jessesuen pushed a commit that referenced this pull request Aug 23, 2021

fix: canary scaledown event could violate maxUnavailable (#1429)

6d84b7e

Signed-off-by: hari rongali <[email protected]>

jessesuen added a commit to jessesuen/argo-rollouts that referenced this pull request Sep 2, 2021

Revert "fix: canary scaledown event could violate maxUnavailable (arg…

5774de2

…oproj#1429)" This reverts commit 7c6a0c5.

jessesuen mentioned this pull request Sep 2, 2021

fix: Revert canary calculations #1465

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: canary scaledown with maxsurge #1429

fix: canary scaledown with maxsurge #1429

harikrongali commented Aug 19, 2021 •

edited

Loading

codecov bot commented Aug 19, 2021 •

edited

Loading

harikrongali commented Aug 19, 2021

harikrongali commented Aug 20, 2021

jessesuen left a comment

jessesuen Aug 20, 2021

harikrongali Aug 20, 2021 •

edited

Loading

harikrongali Aug 20, 2021 •

edited

Loading

jessesuen Aug 21, 2021 •

edited

Loading

harikrongali Aug 21, 2021 •

edited

Loading

sonarqubecloud bot commented Aug 21, 2021

jessesuen left a comment

jessesuen Aug 23, 2021

harikrongali commented Aug 23, 2021

fix: canary scaledown with maxsurge #1429

fix: canary scaledown with maxsurge #1429

Conversation

harikrongali commented Aug 19, 2021 • edited Loading

codecov bot commented Aug 19, 2021 • edited Loading

Codecov Report

harikrongali commented Aug 19, 2021

harikrongali commented Aug 20, 2021

jessesuen left a comment

Choose a reason for hiding this comment

jessesuen Aug 20, 2021

Choose a reason for hiding this comment

harikrongali Aug 20, 2021 • edited Loading

Choose a reason for hiding this comment

harikrongali Aug 20, 2021 • edited Loading

Choose a reason for hiding this comment

jessesuen Aug 21, 2021 • edited Loading

Choose a reason for hiding this comment

harikrongali Aug 21, 2021 • edited Loading

Choose a reason for hiding this comment

sonarqubecloud bot commented Aug 21, 2021

jessesuen left a comment

Choose a reason for hiding this comment

jessesuen Aug 23, 2021

Choose a reason for hiding this comment

harikrongali commented Aug 23, 2021

harikrongali commented Aug 19, 2021 •

edited

Loading

codecov bot commented Aug 19, 2021 •

edited

Loading

harikrongali Aug 20, 2021 •

edited

Loading

harikrongali Aug 20, 2021 •

edited

Loading

jessesuen Aug 21, 2021 •

edited

Loading

harikrongali Aug 21, 2021 •

edited

Loading