-
Notifications
You must be signed in to change notification settings - Fork 904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: introduce abortScaleDownDelaySeconds to control scale down of preview/canary upon abort #1160
Conversation
9f8bc40
to
f127146
Compare
4966f26
to
7d7278e
Compare
Codecov Report
@@ Coverage Diff @@
## master #1160 +/- ##
==========================================
+ Coverage 81.27% 81.29% +0.02%
==========================================
Files 107 107
Lines 9824 9862 +38
==========================================
+ Hits 7984 8017 +33
- Misses 1297 1299 +2
- Partials 543 546 +3
Continue to review full report at Codecov.
|
878d26b
to
ff9bb55
Compare
5ccccc4
to
5710ac4
Compare
Hi, @jessesuen , could you please review this PR at your convenience? Thanks. |
Sure I'll take a look after v1.0 release this week. This will need to wait for v1.1 since v1.0 is about to release. |
95e0b6c
to
56a747c
Compare
Hi, @jessesuen , I added two test cases in e2e. Please take a look when you can. Thanks. |
Had a chance to look at this. The use case is valid, but I'd like to tweak the spec a bit to make this fit with the current design. Some considerations:
Therefore, I am suggesting the following change: Instead of a spec:
strategy:
canary:
abortScaleDownDelay: 1h When a rollout aborts, it would immediately attach a deadline timestamp in the future to the ReplicaSet. At that timestamp, the rollout controller will scaledown the canary stack. During that delay, traffic would have already shifted back to the stable stack. Having a delay also allows Rollouts to be retried much easier/quicker. If the Rollout is retried within the abortScaleDownDelay, then there is no time wasted provisioning pods because the ReplicaSet of the canary is already running. |
@jessesuen , thanks for detailing the proposed design. I agree that having a scaledown delay duration would improve the traffic shitting cases. Following your example of canary, I think the spec for b/g would be
Is my understanding correct? |
56a747c
to
c7ec16f
Compare
bc057be
to
0a3a6eb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngms06, thanks for reviewing. The PR is updated.
docs/features/specification.md
Outdated
@@ -115,6 +115,10 @@ spec: | |||
# down. Defaults to nil | |||
scaleDownDelayRevisionLimit: 2 | |||
|
|||
# Adds a delay before scaling down the p preview replicaset if update is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
now := metav1.Now() | ||
scaleDownAt := metav1.NewTime(scaleDownAtTime) | ||
if scaleDownAt.After(now.Time) { | ||
c.log.Infof("RS '%s' has not reached the scaleDownTime", c.newRS.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since line 129 logs the scale down time scaleDownAtStr
, I think we don't need to replicate it here.
@huikang I'm finally reviewing this now but during my review, I realized some of my assumptions of v1.0 behavior were not correct. It appears we already scale down canary upon abort, but a bug is preventing this from happening when used with So this PR will be our opportunity to get some consistency of our behavior between all the strategies (blue-green, canary, canary w/ traffic routing, etc...). Let me test some behavior with the various options and document what the desired behavior should/will be with all the options:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay in reviewing this. So here is the matrix of behavior for when abort happens, and what I feel the v1.1 behavior should be:
strategy | v1.0 behavior | abortScaleDownDelaySeconds | desired v1.1 behavior |
---|---|---|---|
blue-green | does not scale down | nil | scales down after 30 seconds |
blue-green | does not scale down | 0 | does not scale down |
blue-green | does not scale down | N | scales down after N seconds |
basic canary | rolling update back to stable | N/A | rolling update back to stable |
canary w/ traffic routing | scales down immediately | nil | scales down after 30 seconds |
canary w/ traffic routing | scales down immediately | 0 | does not scale down |
canary w/ traffic routing | scales down immediately | N | scales down after N seconds |
canary w/ traffic routing + setCanaryScale | does not scale down (bug) | * | should behave like above |
@huikang - I think v1.1 should make the behavior consistent across all strategies. Basically, upon an abort, we should scale down the new replicaset for all strategies. Users can then choose to leave the new replicaset scaled up indefinitely by setting abortScaleDownDelaySeconds
to 0, or adjust the value to something larger (or smaller).
docs/features/specification.md
Outdated
# Add a delay before scaling down the canary pods when update | ||
# is aborted for canary strategy using replicas of setCanaryScale. | ||
# Default is 0, meaning canary pods are not scaled down. | ||
AbortScaleDownDelaySeconds: 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
abortScaleDownDelaySeconds should be lowercased
pkg/apis/rollouts/v1alpha1/types.go
Outdated
// AbortScaleDownDelaySeconds adds a delay before scaling down the canary pods when update | ||
// is aborted for canary strategy using replicas of setCanaryScale. | ||
// Default is 0, meaning canary pods are not scaled down. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a comment stating that this field is only used with traffic routing and not applicable for basic canary.
rollout/replicaset.go
Outdated
@@ -120,10 +119,52 @@ func (c *rolloutContext) reconcileNewReplicaSet() (bool, error) { | |||
if err != nil { | |||
return false, err | |||
} | |||
|
|||
abortScaleDownDelaySeconds := time.Duration(defaults.GetAbortScaleDownDelaySecondsOrDefault(c.rollout)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this inside the else clause since it's only used there.
- if an update is aborted, the preview RS and canary RS will be aborted. - scaleDownOnAbort is false by default - updated doc - added e2e test Signed-off-by: Hui Kang <[email protected]>
Signed-off-by: Hui Kang <[email protected]>
Signed-off-by: Hui Kang <[email protected]>
eabcf84
to
90aca07
Compare
@jessesuen , thanks for summarizing. I will update the PR to reflect the descried behavior as listed in the table. In addition, I will update the doc to include the table. |
- doc: scaledown policy summary Signed-off-by: Hui Kang <[email protected]>
9c9ad12
to
549fc14
Compare
- bg e2e: scaledown aborted pod in 30sec; so check the delay annotation - canary e2e: canay pod is kept running when abortScaleDownDelaySeconds=0 Signed-off-by: Hui Kang <[email protected]>
549fc14
to
9f5394b
Compare
Kudos, SonarCloud Quality Gate passed!
|
Hi, @jessesuen , the PR has been updated based upon the desired behavior of v1.1. Please help review at your convenience. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
Was this released? Really looking forward to use this feature, thanks for the hard work folks! |
I'm try to use this parameter using version 1.0.4 but i'm getting error:
|
Not yet. It's a v1.1 feature. You can try the feature out by installing the manifests from master, which use the |
@jessesuen Thanks a lot for your reply. Just one more doubt i'm looking at this parameter trying bypass some downtime when canary fails for any reason (eg.: Analysis step do not match specified conditions) so canary will be aborted and rollback will be started. I understand the rollback process it's very aggressive such as traffic shift it's slow and can cause downtime in my application. |
Address #941 #793