Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: canary scaledown with maxsurge #1429

Merged
merged 17 commits into from
Aug 23, 2021

Conversation

harikrongali
Copy link
Contributor

@harikrongali harikrongali commented Aug 19, 2021

Signed-off-by: hari rongali [email protected]

fixes #1415

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
  • The title of the PR is (a) conventional, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
  • I've signed my commits with DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My builds are green. Try syncing with master if they are not.
  • My organization is added to USERS.md.

@codecov
Copy link

codecov bot commented Aug 19, 2021

Codecov Report

Merging #1429 (cd27fdb) into master (0b70775) will increase coverage by 0.03%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1429      +/-   ##
==========================================
+ Coverage   81.38%   81.41%   +0.03%     
==========================================
  Files         108      108              
  Lines       10043    10060      +17     
==========================================
+ Hits         8173     8190      +17     
  Misses       1309     1309              
  Partials      561      561              
Impacted Files Coverage Δ
utils/replicaset/canary.go 83.82% <100.00%> (+1.26%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0b70775...cd27fdb. Read the comment docs.

Signed-off-by: hari rongali <[email protected]>
Signed-off-by: hari rongali <[email protected]>
Signed-off-by: hari rongali <[email protected]>
Signed-off-by: hari rongali <[email protected]>
Signed-off-by: hari rongali <[email protected]>
Signed-off-by: hari rongali <[email protected]>
Signed-off-by: hari rongali <[email protected]>
@harikrongali
Copy link
Contributor Author

@alexmt / @jessesuen Please review.

test/e2e/canary_test.go Outdated Show resolved Hide resolved
@jessesuen jessesuen modified the milestone: v1.1 Aug 19, 2021
Signed-off-by: hari rongali <[email protected]>
Signed-off-by: hari rongali <[email protected]>
@harikrongali
Copy link
Contributor Author

@jessesuen Please review

Copy link
Member

@jessesuen jessesuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CalculateReplicaCountsForCanary is arguably the most critical part of rollout code and the new code is not covered in unit tests. I will insist that we test this with additional unit tests.

Signed-off-by: hari rongali <[email protected]>
test/e2e/canary_test.go Outdated Show resolved Hide resolved
utils/replicaset/canary.go Outdated Show resolved Hide resolved
Comment on lines 216 to 218
WaitForRolloutReplicas(6).
Then().
ExpectCanaryStablePodCount(4, 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this is not correct. According to the sequence of events:

  1. At T-0, we were at: 6 pods (2 canary, 4 stable, 4 available)
  2. At T-1, we had a scale-up event and went to: 10 pods (6 canary, 4 stable, 4 available).
  3. At T-2, we had a scale-down event and went back down to 6 pods (4 canary, 2 stable, 2 available)

Notice that at T-2, we have now violated maxUnavailable because we only have 2 pods available when the spec (maxUnavailable: 0) requires 4 to be available.

Although this is a day-0 issue, I think the proper fix may be to address #738. If we fix #738, then this problem may also be fixed.

Copy link
Contributor Author

@harikrongali harikrongali Aug 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are correct that it requires changes to adhere to both maxSurge and minAvailabile. But bug referred in #738 is to fix weights inline to deployment. The current I am trying to fix is not scaleDown completely on either stable or canary which I will address as a part of this PR and will take #738 to do refactor to fix the whole calculation. (I don't want to refactor the whole block and take it step by step based on user feedback)

Copy link
Contributor Author

@harikrongali harikrongali Aug 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add label 1.2 to #738 ?

Copy link
Member

@jessesuen jessesuen Aug 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But bug referred in #738 is to fix weights inline to deployment

The two are closely related, if not the same issue. The behavior you saw in #1415 is a manifestation of the same underlying problem but with different parameters of maxSurge/minAvailable.

This fix appears to addressing the part of symptom, whereas #738 is actually the root problem. I believe if we had fix #738, then the incident in #1415 would not have happened.

I do recognize that this fix improves the current behavior, but I'm pretty sure the code in this PR will need to be thrown away in order to address #738

Copy link
Contributor Author

@harikrongali harikrongali Aug 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, #738 would. be the fix we need to act on. Current Issue, there are high chances of downtime when bad images are deployed. Given fix will improve and fix the issue. We need to refactor most of the calculation to fix #738 and that can go into 1.2 given the complexity.

Signed-off-by: hari rongali <[email protected]>
Signed-off-by: hari rongali <[email protected]>
@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

Copy link
Member

@jessesuen jessesuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harikrongali great work!

So with this fix, we will now fix the bug where scale down could violate maxUnavailable. But we still have issue that scale up could violate maxSurge (#738), correct?

Comment on lines +227 to +247
// 1. adjust adjustRSReplicaCount to be within maxSurge
adjustRSReplicaCount = adjustRSReplicaCount - overTheLimitVal
// 2. Calculate availability corresponding to adjusted count
totalAvailableReplicas = totalAvailableReplicas + minValue(adjustRS.Status.AvailableReplicas, adjustRSReplicaCount)
// 3. Calculate decrease in availability of adjustRS because of (1)
extraAvailableAdjustRS = maxValue(0, adjustRS.Status.AvailableReplicas-adjustRSReplicaCount)

// 4. Now calculate how far count is from maxUnavailable limit
moreToNeedAvailableReplicas := maxValue(0, minAvailableReplicaCount-totalAvailableReplicas)
// 5. From (3), we got the count for decrease in availability because of (1),
// take the min of (3) & (4) and add it back to adjustRS
// remaining of moreToNeedAvailableReplicas can be ignored as it is part of drainRS,
// there is no case of deviating from maxUnavailable limit from drainRS as in the event of said case,
// scaleDown calculation wont even occur as we are checking
// replicasToScaleDown <= minAvailableReplicaCount in caller function
adjustRSReplicaCount = adjustRSReplicaCount + minValue(extraAvailableAdjustRS, moreToNeedAvailableReplicas)
// 6. Calculate final overTheLimit because of adjustment
overTheLimitVal = maxValue(0, adjustRSReplicaCount+drainRSReplicaCount-maxReplicaCountAllowed)
// 7. we can safely subtract from drainRS and other cases like deviation from maxUnavailable limit
// wont occur as said in (5)
drainRSReplicaCount = drainRSReplicaCount - overTheLimitVal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really appreciate the detailed explanation for the calculation.

@harikrongali
Copy link
Contributor Author

@jessesuen yes. issue #738 still exists where scale-up could violate.

@jessesuen jessesuen merged commit 7c6a0c5 into argoproj:master Aug 23, 2021
jessesuen pushed a commit that referenced this pull request Aug 23, 2021
jessesuen added a commit to jessesuen/argo-rollouts that referenced this pull request Sep 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Canary deployment scaledDown stable pods during HPA scale event
2 participants