feat: introduce abortScaleDownDelaySeconds to control scale down of preview/canary upon abort #1160

huikang · 2021-05-11T22:42:13Z

codecov · 2021-05-11T23:41:14Z

Codecov Report

Merging #1160 (0113a3a) into master (de0d8e0) will increase coverage by 0.02%.
The diff coverage is 88.09%.

❗ Current head 0113a3a differs from pull request most recent head 9f5394b. Consider uploading reports for the commit 9f5394b to get more accurate results

@@            Coverage Diff             @@
##           master    #1160      +/-   ##
==========================================
+ Coverage   81.27%   81.29%   +0.02%     
==========================================
  Files         107      107              
  Lines        9824     9862      +38     
==========================================
+ Hits         7984     8017      +33     
- Misses       1297     1299       +2     
- Partials      543      546       +3

Impacted Files	Coverage Δ
rollout/replicaset.go	`67.52% <82.75%> (+4.18%)`	⬆️
rollout/sync.go	`76.47% <100.00%> (+0.04%)`	⬆️
utils/defaults/defaults.go	`88.13% <100.00%> (+2.42%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update de0d8e0...9f5394b. Read the comment docs.

huikang · 2021-05-17T19:40:03Z

Hi, @jessesuen , could you please review this PR at your convenience? Thanks.

jessesuen · 2021-05-18T10:31:00Z

Sure I'll take a look after v1.0 release this week. This will need to wait for v1.1 since v1.0 is about to release.

huikang · 2021-05-24T17:14:18Z

Hi, @jessesuen , I added two test cases in e2e. Please take a look when you can. Thanks.

jessesuen · 2021-05-28T23:34:06Z

Had a chance to look at this. The use case is valid, but I'd like to tweak the spec a bit to make this fit with the current design. Some considerations:

If we scale down the canary/desired ReplicaSet immediately upon the abort, there is a possibility of 500 routing errors due to the fact that mesh/ingress controllers can take time to shift traffic. Usually, this happens within seconds, but in general, you want to give it 30s minimum.
I think the time frame in which to scale down the canary after abort should be configurable. A simple boolean would make retrying an aborted rollout go through the whole scale up process and pod provisioning phase.

Therefore, I am suggesting the following change:

Instead of a scaleDownOnAbort: true boolean. I think this should be a duration field, which acts similarly to spec.strategy.canary.scaleDownDelaySeconds.

spec:
  strategy:
    canary:
      abortScaleDownDelay: 1h

When a rollout aborts, it would immediately attach a deadline timestamp in the future to the ReplicaSet. At that timestamp, the rollout controller will scaledown the canary stack. During that delay, traffic would have already shifted back to the stable stack.

Having a delay also allows Rollouts to be retried much easier/quicker. If the Rollout is retried within the abortScaleDownDelay, then there is no time wasted provisioning pods because the ReplicaSet of the canary is already running.

huikang · 2021-06-01T01:51:02Z

@jessesuen , thanks for detailing the proposed design. I agree that having a scaledown delay duration would improve the traffic shitting cases. Following your example of canary, I think the spec for b/g would be

spec:
  strategy:
    blueGreen:
      abortScaleDownDelay: 30s

Is my understanding correct?

huikang

@ngms06, thanks for reviewing. The PR is updated.

huikang · 2021-06-17T01:56:47Z

docs/features/specification.md

@@ -115,6 +115,10 @@ spec:
      # down. Defaults to nil
      scaleDownDelayRevisionLimit: 2

+      # Adds a delay before scaling down the p preview replicaset if update is


huikang · 2021-06-17T01:58:45Z

rollout/replicaset.go

+				now := metav1.Now()
+				scaleDownAt := metav1.NewTime(scaleDownAtTime)
+				if scaleDownAt.After(now.Time) {
+					c.log.Infof("RS '%s' has not reached the scaleDownTime", c.newRS.Name)


Since line 129 logs the scale down time scaleDownAtStr, I think we don't need to replicate it here.

jessesuen · 2021-06-24T18:24:34Z

@huikang I'm finally reviewing this now but during my review, I realized some of my assumptions of v1.0 behavior were not correct. It appears we already scale down canary upon abort, but a bug is preventing this from happening when used with setCanaryScale.

So this PR will be our opportunity to get some consistency of our behavior between all the strategies (blue-green, canary, canary w/ traffic routing, etc...).

Let me test some behavior with the various options and document what the desired behavior should/will be with all the options:

abortScaleDownDelaySeconds = nil (default behavior)
abortScaleDownDelaySeconds = 0
abortScaleDownDelaySeconds > 0

jessesuen

Sorry for the delay in reviewing this. So here is the matrix of behavior for when abort happens, and what I feel the v1.1 behavior should be:

strategy	v1.0 behavior	abortScaleDownDelaySeconds	desired v1.1 behavior
blue-green	does not scale down	nil	scales down after 30 seconds
blue-green	does not scale down	0	does not scale down
blue-green	does not scale down	N	scales down after N seconds
basic canary	rolling update back to stable	N/A	rolling update back to stable
canary w/ traffic routing	scales down immediately	nil	scales down after 30 seconds
canary w/ traffic routing	scales down immediately	0	does not scale down
canary w/ traffic routing	scales down immediately	N	scales down after N seconds
canary w/ traffic routing + setCanaryScale	does not scale down (bug)	*	should behave like above

@huikang - I think v1.1 should make the behavior consistent across all strategies. Basically, upon an abort, we should scale down the new replicaset for all strategies. Users can then choose to leave the new replicaset scaled up indefinitely by setting abortScaleDownDelaySeconds to 0, or adjust the value to something larger (or smaller).

jessesuen · 2021-07-15T18:02:55Z

docs/features/specification.md

+      # Add a delay before scaling down the canary pods when update
+      # is aborted for canary strategy using replicas of setCanaryScale.
+      # Default is 0, meaning canary pods are not scaled down.
+      AbortScaleDownDelaySeconds: 30


abortScaleDownDelaySeconds should be lowercased

jessesuen · 2021-07-15T18:03:39Z

pkg/apis/rollouts/v1alpha1/types.go

+	// AbortScaleDownDelaySeconds adds a delay before scaling down the canary pods when update
+	// is aborted for canary strategy using replicas of setCanaryScale.
+	// Default is 0, meaning canary pods are not scaled down.


Let's add a comment stating that this field is only used with traffic routing and not applicable for basic canary.

jessesuen · 2021-07-15T18:12:45Z

rollout/replicaset.go

@@ -120,10 +119,52 @@ func (c *rolloutContext) reconcileNewReplicaSet() (bool, error) {
 	if err != nil {
 		return false, err
 	}
+
+	abortScaleDownDelaySeconds := time.Duration(defaults.GetAbortScaleDownDelaySecondsOrDefault(c.rollout))


Let's move this inside the else clause since it's only used there.

- if an update is aborted, the preview RS and canary RS will be aborted. - scaleDownOnAbort is false by default - updated doc - added e2e test Signed-off-by: Hui Kang <[email protected]>

Signed-off-by: Hui Kang <[email protected]>

huikang · 2021-07-16T02:50:11Z

Sorry for the delay in reviewing this. So here is the matrix of behavior for when abort happens, and what I feel the v1.1 behavior should be:
strategy v1.0 behavior abortScaleDownDelaySeconds desired v1.1 behavior
blue-green does not scale down nil scales down after 30 seconds
blue-green does not scale down 0 does not scale down
blue-green does not scale down N scales down after N seconds
basic canary rolling update back to stable N/A rolling update back to stable
canary w/ traffic routing scales down immediately nil scales down after 30 seconds
canary w/ traffic routing scales down immediately 0 does not scale down
canary w/ traffic routing scales down immediately N scales down after N seconds
canary w/ traffic routing + setCanaryScale does not scale down (bug) * should behave like above

@huikang - I think v1.1 should make the behavior consistent across all strategies. Basically, upon an abort, we should scale down the new replicaset for all strategies. Users can then choose to leave the new replicaset scaled up indefinitely by setting abortScaleDownDelaySeconds to 0, or adjust the value to something larger (or smaller).

@jessesuen , thanks for summarizing. I will update the PR to reflect the descried behavior as listed in the table. In addition, I will update the doc to include the table.

- doc: scaledown policy summary Signed-off-by: Hui Kang <[email protected]>

- bg e2e: scaledown aborted pod in 30sec; so check the delay annotation - canary e2e: canay pod is kept running when abortScaleDownDelaySeconds=0 Signed-off-by: Hui Kang <[email protected]>

sonarqubecloud · 2021-07-17T03:36:00Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
2 Code Smells

No Coverage information
0.0% Duplication

huikang · 2021-07-17T04:01:11Z

Hi, @jessesuen , the PR has been updated based upon the desired behavior of v1.1. Please help review at your convenience. Thanks.

jessesuen

Great work!

dudadornelles · 2021-08-05T21:00:14Z

Was this released? Really looking forward to use this feature, thanks for the hard work folks!

alexandresavicki · 2021-08-24T23:14:41Z

Was this released? Really looking forward to use this feature, thanks for the hard work folks!

I'm try to use this parameter using version 1.0.4 but i'm getting error:

# rollouts.argoproj.io "dummy-app" was not valid:
# * : Invalid value: "The edited file failed validation": ValidationError(Rollout.spec.strategy.canary): unknown field "abortScaleDownDelaySeconds" in io.argoproj.v1alpha1.Rollout.spec.strategy.canary

jessesuen · 2021-08-24T23:22:17Z

Not yet. It's a v1.1 feature. You can try the feature out by installing the manifests from master, which use the :latest images

alexandresavicki · 2021-08-25T12:35:24Z

@jessesuen Thanks a lot for your reply. Just one more doubt i'm looking at this parameter trying bypass some downtime when canary fails for any reason (eg.: Analysis step do not match specified conditions) so canary will be aborted and rollback will be started. I understand the rollback process it's very aggressive such as traffic shift it's slow and can cause downtime in my application.
There is an another way to bypass this scenario?

huikang force-pushed the feat-scaledownonAbort branch from 9f8bc40 to f127146 Compare May 11, 2021 22:43

huikang marked this pull request as draft May 11, 2021 22:44

huikang force-pushed the feat-scaledownonAbort branch 8 times, most recently from 4966f26 to 7d7278e Compare May 11, 2021 23:33

huikang changed the title ~~Feat scaledownon abort~~ feat: scaledown on abort May 11, 2021

huikang force-pushed the feat-scaledownonAbort branch 3 times, most recently from 878d26b to ff9bb55 Compare May 12, 2021 03:49

huikang marked this pull request as ready for review May 12, 2021 03:49

huikang force-pushed the feat-scaledownonAbort branch 2 times, most recently from 5ccccc4 to 5710ac4 Compare May 13, 2021 16:19

huikang force-pushed the feat-scaledownonAbort branch 3 times, most recently from 95e0b6c to 56a747c Compare May 24, 2021 16:26

huikang force-pushed the feat-scaledownonAbort branch from 56a747c to c7ec16f Compare June 1, 2021 14:17

huikang marked this pull request as draft June 2, 2021 01:29

huikang force-pushed the feat-scaledownonAbort branch 2 times, most recently from bc057be to 0a3a6eb Compare June 2, 2021 01:45

huikang commented Jun 17, 2021

View reviewed changes

jessesuen changed the title ~~feat: scaledown on abort~~ feat: introduce abortScaleDownDelaySeconds to scale down preview/canary upon abort Jun 24, 2021

jessesuen changed the title ~~feat: introduce abortScaleDownDelaySeconds to scale down preview/canary upon abort~~ feat: introduce abortScaleDownDelaySeconds to control scale down of preview/canary upon abort Jun 24, 2021

jessesuen requested changes Jul 15, 2021

View reviewed changes

Hui Kang added 3 commits July 15, 2021 22:16

feat: scaleDownOnAbort for rollout strategies

070c42a

- if an update is aborted, the preview RS and canary RS will be aborted. - scaleDownOnAbort is false by default - updated doc - added e2e test Signed-off-by: Hui Kang <[email protected]>

Add b/g and canary e2e test on scaledownOnAbort

781b972

Signed-off-by: Hui Kang <[email protected]>

Use field AbortScaleDownDelaySeconds

62cc7c2

Signed-off-by: Hui Kang <[email protected]>

huikang force-pushed the feat-scaledownonAbort branch 2 times, most recently from eabcf84 to 90aca07 Compare July 16, 2021 02:46

address review comments

42aa2d9

- doc: scaledown policy summary Signed-off-by: Hui Kang <[email protected]>

huikang force-pushed the feat-scaledownonAbort branch 8 times, most recently from 9c9ad12 to 549fc14 Compare July 17, 2021 03:13

Set default DefaultAbortScaleDownDelaySeconds to 30s

9f5394b

- bg e2e: scaledown aborted pod in 30sec; so check the delay annotation - canary e2e: canay pod is kept running when abortScaleDownDelaySeconds=0 Signed-off-by: Hui Kang <[email protected]>

huikang force-pushed the feat-scaledownonAbort branch from 549fc14 to 9f5394b Compare July 17, 2021 03:35

jessesuen approved these changes Jul 19, 2021

View reviewed changes

jessesuen merged commit 68cbef9 into argoproj:master Jul 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: introduce abortScaleDownDelaySeconds to control scale down of preview/canary upon abort #1160

feat: introduce abortScaleDownDelaySeconds to control scale down of preview/canary upon abort #1160

huikang commented May 11, 2021

codecov bot commented May 11, 2021 •

edited

Loading

huikang commented May 17, 2021

jessesuen commented May 18, 2021

huikang commented May 24, 2021

jessesuen commented May 28, 2021 •

edited

Loading

huikang commented Jun 1, 2021

huikang left a comment

huikang Jun 17, 2021

huikang Jun 17, 2021 •

edited

Loading

jessesuen commented Jun 24, 2021 •

edited

Loading

jessesuen left a comment •

edited

Loading

jessesuen Jul 15, 2021

jessesuen Jul 15, 2021

jessesuen Jul 15, 2021

huikang commented Jul 16, 2021

sonarqubecloud bot commented Jul 17, 2021

huikang commented Jul 17, 2021

jessesuen left a comment

dudadornelles commented Aug 5, 2021

alexandresavicki commented Aug 24, 2021

jessesuen commented Aug 24, 2021

alexandresavicki commented Aug 25, 2021 •

edited

Loading

feat: introduce abortScaleDownDelaySeconds to control scale down of preview/canary upon abort #1160

feat: introduce abortScaleDownDelaySeconds to control scale down of preview/canary upon abort #1160

Conversation

huikang commented May 11, 2021

codecov bot commented May 11, 2021 • edited Loading

Codecov Report

huikang commented May 17, 2021

jessesuen commented May 18, 2021

huikang commented May 24, 2021

jessesuen commented May 28, 2021 • edited Loading

huikang commented Jun 1, 2021

huikang left a comment

Choose a reason for hiding this comment

huikang Jun 17, 2021

Choose a reason for hiding this comment

huikang Jun 17, 2021 • edited Loading

Choose a reason for hiding this comment

jessesuen commented Jun 24, 2021 • edited Loading

jessesuen left a comment • edited Loading

Choose a reason for hiding this comment

jessesuen Jul 15, 2021

Choose a reason for hiding this comment

jessesuen Jul 15, 2021

Choose a reason for hiding this comment

jessesuen Jul 15, 2021

Choose a reason for hiding this comment

huikang commented Jul 16, 2021

sonarqubecloud bot commented Jul 17, 2021

huikang commented Jul 17, 2021

jessesuen left a comment

Choose a reason for hiding this comment

dudadornelles commented Aug 5, 2021

alexandresavicki commented Aug 24, 2021

jessesuen commented Aug 24, 2021

alexandresavicki commented Aug 25, 2021 • edited Loading

codecov bot commented May 11, 2021 •

edited

Loading

jessesuen commented May 28, 2021 •

edited

Loading

huikang Jun 17, 2021 •

edited

Loading

jessesuen commented Jun 24, 2021 •

edited

Loading

jessesuen left a comment •

edited

Loading

alexandresavicki commented Aug 25, 2021 •

edited

Loading