Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: introduce abortScaleDownDelaySeconds to control scale down of preview/canary upon abort #1160

Merged
merged 5 commits into from
Jul 19, 2021

Conversation

huikang
Copy link
Member

@huikang huikang commented May 11, 2021

Address #941 #793

@huikang huikang force-pushed the feat-scaledownonAbort branch from 9f8bc40 to f127146 Compare May 11, 2021 22:43
@huikang huikang marked this pull request as draft May 11, 2021 22:44
@huikang huikang force-pushed the feat-scaledownonAbort branch 8 times, most recently from 4966f26 to 7d7278e Compare May 11, 2021 23:33
@huikang huikang changed the title Feat scaledownon abort feat: scaledown on abort May 11, 2021
@codecov
Copy link

codecov bot commented May 11, 2021

Codecov Report

Merging #1160 (0113a3a) into master (de0d8e0) will increase coverage by 0.02%.
The diff coverage is 88.09%.

❗ Current head 0113a3a differs from pull request most recent head 9f5394b. Consider uploading reports for the commit 9f5394b to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1160      +/-   ##
==========================================
+ Coverage   81.27%   81.29%   +0.02%     
==========================================
  Files         107      107              
  Lines        9824     9862      +38     
==========================================
+ Hits         7984     8017      +33     
- Misses       1297     1299       +2     
- Partials      543      546       +3     
Impacted Files Coverage Δ
rollout/replicaset.go 67.52% <82.75%> (+4.18%) ⬆️
rollout/sync.go 76.47% <100.00%> (+0.04%) ⬆️
utils/defaults/defaults.go 88.13% <100.00%> (+2.42%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update de0d8e0...9f5394b. Read the comment docs.

@huikang huikang force-pushed the feat-scaledownonAbort branch 3 times, most recently from 878d26b to ff9bb55 Compare May 12, 2021 03:49
@huikang huikang marked this pull request as ready for review May 12, 2021 03:49
@huikang huikang force-pushed the feat-scaledownonAbort branch 2 times, most recently from 5ccccc4 to 5710ac4 Compare May 13, 2021 16:19
@huikang
Copy link
Member Author

huikang commented May 17, 2021

Hi, @jessesuen , could you please review this PR at your convenience? Thanks.

@jessesuen
Copy link
Member

Sure I'll take a look after v1.0 release this week. This will need to wait for v1.1 since v1.0 is about to release.

@huikang huikang force-pushed the feat-scaledownonAbort branch 3 times, most recently from 95e0b6c to 56a747c Compare May 24, 2021 16:26
@huikang
Copy link
Member Author

huikang commented May 24, 2021

Hi, @jessesuen , I added two test cases in e2e. Please take a look when you can. Thanks.

@jessesuen
Copy link
Member

jessesuen commented May 28, 2021

Had a chance to look at this. The use case is valid, but I'd like to tweak the spec a bit to make this fit with the current design. Some considerations:

  1. If we scale down the canary/desired ReplicaSet immediately upon the abort, there is a possibility of 500 routing errors due to the fact that mesh/ingress controllers can take time to shift traffic. Usually, this happens within seconds, but in general, you want to give it 30s minimum.

  2. I think the time frame in which to scale down the canary after abort should be configurable. A simple boolean would make retrying an aborted rollout go through the whole scale up process and pod provisioning phase.

Therefore, I am suggesting the following change:

Instead of a scaleDownOnAbort: true boolean. I think this should be a duration field, which acts similarly to spec.strategy.canary.scaleDownDelaySeconds.

spec:
  strategy:
    canary:
      abortScaleDownDelay: 1h

When a rollout aborts, it would immediately attach a deadline timestamp in the future to the ReplicaSet. At that timestamp, the rollout controller will scaledown the canary stack. During that delay, traffic would have already shifted back to the stable stack.

Having a delay also allows Rollouts to be retried much easier/quicker. If the Rollout is retried within the abortScaleDownDelay, then there is no time wasted provisioning pods because the ReplicaSet of the canary is already running.

@huikang
Copy link
Member Author

huikang commented Jun 1, 2021

@jessesuen , thanks for detailing the proposed design. I agree that having a scaledown delay duration would improve the traffic shitting cases. Following your example of canary, I think the spec for b/g would be

spec:
  strategy:
    blueGreen:
      abortScaleDownDelay: 30s

Is my understanding correct?

@huikang huikang force-pushed the feat-scaledownonAbort branch from 56a747c to c7ec16f Compare June 1, 2021 14:17
@huikang huikang marked this pull request as draft June 2, 2021 01:29
@huikang huikang force-pushed the feat-scaledownonAbort branch 2 times, most recently from bc057be to 0a3a6eb Compare June 2, 2021 01:45
Copy link
Member Author

@huikang huikang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngms06, thanks for reviewing. The PR is updated.

@@ -115,6 +115,10 @@ spec:
# down. Defaults to nil
scaleDownDelayRevisionLimit: 2

# Adds a delay before scaling down the p preview replicaset if update is
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

now := metav1.Now()
scaleDownAt := metav1.NewTime(scaleDownAtTime)
if scaleDownAt.After(now.Time) {
c.log.Infof("RS '%s' has not reached the scaleDownTime", c.newRS.Name)
Copy link
Member Author

@huikang huikang Jun 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since line 129 logs the scale down time scaleDownAtStr, I think we don't need to replicate it here.

@jessesuen jessesuen changed the title feat: scaledown on abort feat: introduce abortScaleDownDelaySeconds to scale down preview/canary upon abort Jun 24, 2021
@jessesuen
Copy link
Member

jessesuen commented Jun 24, 2021

@huikang I'm finally reviewing this now but during my review, I realized some of my assumptions of v1.0 behavior were not correct. It appears we already scale down canary upon abort, but a bug is preventing this from happening when used with setCanaryScale.

So this PR will be our opportunity to get some consistency of our behavior between all the strategies (blue-green, canary, canary w/ traffic routing, etc...).

Let me test some behavior with the various options and document what the desired behavior should/will be with all the options:

  • abortScaleDownDelaySeconds = nil (default behavior)
  • abortScaleDownDelaySeconds = 0
  • abortScaleDownDelaySeconds > 0

@jessesuen jessesuen changed the title feat: introduce abortScaleDownDelaySeconds to scale down preview/canary upon abort feat: introduce abortScaleDownDelaySeconds to control scale down of preview/canary upon abort Jun 24, 2021
Copy link
Member

@jessesuen jessesuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay in reviewing this. So here is the matrix of behavior for when abort happens, and what I feel the v1.1 behavior should be:

strategy v1.0 behavior abortScaleDownDelaySeconds desired v1.1 behavior
blue-green does not scale down nil scales down after 30 seconds
blue-green does not scale down 0 does not scale down
blue-green does not scale down N scales down after N seconds
basic canary rolling update back to stable N/A rolling update back to stable
canary w/ traffic routing scales down immediately nil scales down after 30 seconds
canary w/ traffic routing scales down immediately 0 does not scale down
canary w/ traffic routing scales down immediately N scales down after N seconds
canary w/ traffic routing + setCanaryScale does not scale down (bug) * should behave like above

@huikang - I think v1.1 should make the behavior consistent across all strategies. Basically, upon an abort, we should scale down the new replicaset for all strategies. Users can then choose to leave the new replicaset scaled up indefinitely by setting abortScaleDownDelaySeconds to 0, or adjust the value to something larger (or smaller).

Comment on lines 302 to 305
# Add a delay before scaling down the canary pods when update
# is aborted for canary strategy using replicas of setCanaryScale.
# Default is 0, meaning canary pods are not scaled down.
AbortScaleDownDelaySeconds: 30
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abortScaleDownDelaySeconds should be lowercased

Comment on lines 289 to 291
// AbortScaleDownDelaySeconds adds a delay before scaling down the canary pods when update
// is aborted for canary strategy using replicas of setCanaryScale.
// Default is 0, meaning canary pods are not scaled down.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a comment stating that this field is only used with traffic routing and not applicable for basic canary.

@@ -120,10 +119,52 @@ func (c *rolloutContext) reconcileNewReplicaSet() (bool, error) {
if err != nil {
return false, err
}

abortScaleDownDelaySeconds := time.Duration(defaults.GetAbortScaleDownDelaySecondsOrDefault(c.rollout))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this inside the else clause since it's only used there.

Hui Kang added 3 commits July 15, 2021 22:16
- if an update is aborted, the preview RS and canary RS will be
  aborted.
- scaleDownOnAbort is false by default
- updated doc
- added e2e test

Signed-off-by: Hui Kang <[email protected]>
@huikang huikang force-pushed the feat-scaledownonAbort branch 2 times, most recently from eabcf84 to 90aca07 Compare July 16, 2021 02:46
@huikang
Copy link
Member Author

huikang commented Jul 16, 2021

Sorry for the delay in reviewing this. So here is the matrix of behavior for when abort happens, and what I feel the v1.1 behavior should be:
strategy v1.0 behavior abortScaleDownDelaySeconds desired v1.1 behavior
blue-green does not scale down nil scales down after 30 seconds
blue-green does not scale down 0 does not scale down
blue-green does not scale down N scales down after N seconds
basic canary rolling update back to stable N/A rolling update back to stable
canary w/ traffic routing scales down immediately nil scales down after 30 seconds
canary w/ traffic routing scales down immediately 0 does not scale down
canary w/ traffic routing scales down immediately N scales down after N seconds
canary w/ traffic routing + setCanaryScale does not scale down (bug) * should behave like above

@huikang - I think v1.1 should make the behavior consistent across all strategies. Basically, upon an abort, we should scale down the new replicaset for all strategies. Users can then choose to leave the new replicaset scaled up indefinitely by setting abortScaleDownDelaySeconds to 0, or adjust the value to something larger (or smaller).

@jessesuen , thanks for summarizing. I will update the PR to reflect the descried behavior as listed in the table. In addition, I will update the doc to include the table.

- doc: scaledown policy summary

Signed-off-by: Hui Kang <[email protected]>
@huikang huikang force-pushed the feat-scaledownonAbort branch 8 times, most recently from 9c9ad12 to 549fc14 Compare July 17, 2021 03:13
- bg e2e: scaledown aborted pod in 30sec; so check the delay
  annotation
- canary e2e: canay pod is kept running when
  abortScaleDownDelaySeconds=0

Signed-off-by: Hui Kang <[email protected]>
@huikang huikang force-pushed the feat-scaledownonAbort branch from 549fc14 to 9f5394b Compare July 17, 2021 03:35
@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 2 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@huikang
Copy link
Member Author

huikang commented Jul 17, 2021

Hi, @jessesuen , the PR has been updated based upon the desired behavior of v1.1. Please help review at your convenience. Thanks.

Copy link
Member

@jessesuen jessesuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

@jessesuen jessesuen merged commit 68cbef9 into argoproj:master Jul 19, 2021
@dudadornelles
Copy link
Contributor

Was this released? Really looking forward to use this feature, thanks for the hard work folks!

@alexandresavicki
Copy link

Was this released? Really looking forward to use this feature, thanks for the hard work folks!

I'm try to use this parameter using version 1.0.4 but i'm getting error:

# rollouts.argoproj.io "dummy-app" was not valid:
# * : Invalid value: "The edited file failed validation": ValidationError(Rollout.spec.strategy.canary): unknown field "abortScaleDownDelaySeconds" in io.argoproj.v1alpha1.Rollout.spec.strategy.canary

@jessesuen
Copy link
Member

Not yet. It's a v1.1 feature. You can try the feature out by installing the manifests from master, which use the :latest images

@alexandresavicki
Copy link

alexandresavicki commented Aug 25, 2021

@jessesuen Thanks a lot for your reply. Just one more doubt i'm looking at this parameter trying bypass some downtime when canary fails for any reason (eg.: Analysis step do not match specified conditions) so canary will be aborted and rollback will be started. I understand the rollback process it's very aggressive such as traffic shift it's slow and can cause downtime in my application.
There is an another way to bypass this scenario?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants