Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: introduce abortScaleDownDelaySeconds to control scale down of preview/canary upon abort #1160

Merged
merged 5 commits into from
Jul 19, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions docs/features/scaledown-aborted-rs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Scaledown New Replicaset on Aborted Rollout

Upon an aborted update, we may scale down the new replicaset for all strategies. Users can then choose to leave the new replicaset scaled up indefinitely by setting abortScaleDownDelaySeconds to 0, or adjust the value to something larger (or smaller).

The following table summarizes the behavior under combinations of rollout strategy and `abortScaleDownDelaySeconds`. Note that `abortScaleDownDelaySeconds` is not applicable to argo-rollouts v1.0.
`abortScaleDownDelaySeconds = nil` is the default, which means in v1.1 across all rollout strategies, the new replicaset
is scaled down in 30 seconds on abort by default.

| strategy | v1.0 behavior | abortScaleDownDelaySeconds | v1.1 behavior |
|--------------------------------------------:|:-----------------------------:|:--------------------------:|:-----------------------------:|
| blue-green | does not scale down | nil | scales down after 30 seconds |
| blue-green | does not scale down | 0 | does not scale down |
| blue-green | does not scale down | N | scales down after N seconds |
| basic canary | rolling update back to stable | N/A | rolling update back to stable |
| canary w/ traffic routing | scales down immediately | nil | scales down after 30 seconds |
| canary w/ traffic routing | scales down immediately | 0 | does not scale down |
| canary w/ traffic routing | scales down immediately | N | scales down after N seconds |
| canary w/ traffic routing + setCanaryScale | does not scale down (bug) | * | should behave like canary w/ traffic routing |
9 changes: 9 additions & 0 deletions docs/features/specification.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,10 @@ spec:
# down. Defaults to nil
scaleDownDelayRevisionLimit: 2

# Add a delay in second before scaling down the preview replicaset
# if update is aborted. 0 means not to scale down. Default is 30 second
abortScaleDownDelaySeconds: 30

# Anti Affinity configuration between desired and previous ReplicaSet.
# Only one must be specified
antiAffinity:
Expand Down Expand Up @@ -295,6 +299,11 @@ spec:
rootService: root-svc # optional
trafficSplitName: rollout-example-traffic-split # optional

# Add a delay in second before scaling down the canary pods when update
# is aborted for canary strategy with traffic routing (not applicable for basic canary).
# 0 means canary pods are not scaled down. Default is 30 seconds.
abortScaleDownDelaySeconds: 30

status:
pauseConditions:
- reason: StepPause
Expand Down
6 changes: 6 additions & 0 deletions manifests/crds/rollout-crd.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,9 @@ spec:
properties:
blueGreen:
properties:
abortScaleDownDelaySeconds:
format: int32
type: integer
activeMetadata:
properties:
annotations:
Expand Down Expand Up @@ -224,6 +227,9 @@ spec:
type: object
canary:
properties:
abortScaleDownDelaySeconds:
format: int32
type: integer
analysis:
properties:
args:
Expand Down
6 changes: 6 additions & 0 deletions manifests/install.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9780,6 +9780,9 @@ spec:
properties:
blueGreen:
properties:
abortScaleDownDelaySeconds:
format: int32
type: integer
activeMetadata:
properties:
annotations:
Expand Down Expand Up @@ -9913,6 +9916,9 @@ spec:
type: object
canary:
properties:
abortScaleDownDelaySeconds:
format: int32
type: integer
analysis:
properties:
args:
Expand Down
6 changes: 6 additions & 0 deletions manifests/namespace-install.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9780,6 +9780,9 @@ spec:
properties:
blueGreen:
properties:
abortScaleDownDelaySeconds:
format: int32
type: integer
activeMetadata:
properties:
annotations:
Expand Down Expand Up @@ -9913,6 +9916,9 @@ spec:
type: object
canary:
properties:
abortScaleDownDelaySeconds:
format: int32
type: integer
analysis:
properties:
args:
Expand Down
10 changes: 10 additions & 0 deletions pkg/apiclient/rollout/rollout.swagger.json
Original file line number Diff line number Diff line change
Expand Up @@ -662,6 +662,11 @@
"activeMetadata": {
"$ref": "#/definitions/github.jparrowsec.cn.argoproj.argo_rollouts.pkg.apis.rollouts.v1alpha1.PodTemplateMetadata",
"title": "ActiveMetadata specify labels and annotations which will be attached to the active pods for\nthe duration which they act as a active pod, and will be removed after"
},
"abortScaleDownDelaySeconds": {
"type": "integer",
"format": "int32",
"title": "AbortScaleDownDelaySeconds adds a delay in second before scaling down the preview replicaset\nif update is aborted. 0 means not to scale down.\nDefault is 30 second\n+optional"
}
},
"title": "BlueGreenStrategy defines parameters for Blue Green deployment"
Expand Down Expand Up @@ -766,6 +771,11 @@
"type": "integer",
"format": "int32",
"title": "ScaleDownDelayRevisionLimit limits the number of old RS that can run at one time before getting scaled down\n+optional"
},
"abortScaleDownDelaySeconds": {
"type": "integer",
"format": "int32",
"title": "AbortScaleDownDelaySeconds adds a delay in second before scaling down the canary pods when update\nis aborted for canary strategy with traffic routing (not applicable for basic canary).\n0 means canary pods are not scaled down.\nDefault is 30 seconds.\n+optional"
}
},
"title": "CanaryStrategy defines parameters for a Replica Based Canary"
Expand Down
11 changes: 11 additions & 0 deletions pkg/apis/rollouts/v1alpha1/types.go
Original file line number Diff line number Diff line change
Expand Up @@ -202,6 +202,11 @@ type BlueGreenStrategy struct {
// ActiveMetadata specify labels and annotations which will be attached to the active pods for
// the duration which they act as a active pod, and will be removed after
ActiveMetadata *PodTemplateMetadata `json:"activeMetadata,omitempty" protobuf:"bytes,13,opt,name=activeMetadata"`
// AbortScaleDownDelaySeconds adds a delay in second before scaling down the preview replicaset
// if update is aborted. 0 means not to scale down.
// Default is 30 second
// +optional
AbortScaleDownDelaySeconds *int32 `json:"abortScaleDownDelaySeconds,omitempty" protobuf:"varint,14,opt,name=abortScaleDownDelaySeconds"`
}

// AntiAffinity defines which inter-pod scheduling rule to use for anti-affinity injection
Expand Down Expand Up @@ -282,6 +287,12 @@ type CanaryStrategy struct {
// ScaleDownDelayRevisionLimit limits the number of old RS that can run at one time before getting scaled down
// +optional
ScaleDownDelayRevisionLimit *int32 `json:"scaleDownDelayRevisionLimit,omitempty" protobuf:"varint,12,opt,name=scaleDownDelayRevisionLimit"`
// AbortScaleDownDelaySeconds adds a delay in second before scaling down the canary pods when update
// is aborted for canary strategy with traffic routing (not applicable for basic canary).
// 0 means canary pods are not scaled down.
// Default is 30 seconds.
// +optional
AbortScaleDownDelaySeconds *int32 `json:"abortScaleDownDelaySeconds,omitempty" protobuf:"varint,13,opt,name=abortScaleDownDelaySeconds"`
}

// ALBTrafficRouting configuration for ALB ingress controller to control traffic routing
Expand Down
10 changes: 10 additions & 0 deletions pkg/apis/rollouts/v1alpha1/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 4 additions & 2 deletions rollout/bluegreen_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,11 @@ var (

func newBlueGreenRollout(name string, replicas int, revisionHistoryLimit *int32, activeSvc string, previewSvc string) *v1alpha1.Rollout {
rollout := newRollout(name, replicas, revisionHistoryLimit, map[string]string{"foo": "bar"})
abortScaleDownDelaySeconds := int32(0)
rollout.Spec.Strategy.BlueGreen = &v1alpha1.BlueGreenStrategy{
ActiveService: activeSvc,
PreviewService: previewSvc,
ActiveService: activeSvc,
PreviewService: previewSvc,
AbortScaleDownDelaySeconds: &abortScaleDownDelaySeconds,
}
rollout.Status.CurrentStepHash = conditions.ComputeStepHash(rollout)
rollout.Status.CurrentPodHash = controller.ComputeHash(&rollout.Spec.Template, rollout.Status.CollisionCount)
Expand Down
51 changes: 46 additions & 5 deletions rollout/replicaset.go
Original file line number Diff line number Diff line change
Expand Up @@ -39,13 +39,12 @@ func (c *rolloutContext) removeScaleDownDelay(rs *appsv1.ReplicaSet) error {
}

// addScaleDownDelay injects the `scale-down-deadline` annotation to the ReplicaSet, or if
// scaleDownDelaySeconds is zero, removes it if it exists
func (c *rolloutContext) addScaleDownDelay(rs *appsv1.ReplicaSet) error {
// scaleDownDelaySeconds is zero, removes the annotation if it exists
func (c *rolloutContext) addScaleDownDelay(rs *appsv1.ReplicaSet, scaleDownDelaySeconds time.Duration) error {
if rs == nil {
return nil
}
ctx := context.TODO()
scaleDownDelaySeconds := time.Duration(defaults.GetScaleDownDelaySecondsOrDefault(c.rollout))
if scaleDownDelaySeconds == 0 {
// If scaledown deadline is zero, it means we need to remove any replicasets with the delay
// This might happen if we switch from canary with traffic routing to basic canary
Expand Down Expand Up @@ -91,11 +90,11 @@ func (c *Controller) getReplicaSetsForRollouts(r *v1alpha1.Rollout) ([]*appsv1.R
return cm.ClaimReplicaSets(rsList)
}

// removeScaleDownDelays removes the scale-down-deadline annotation from the new/stable ReplicaSets,
// removeScaleDownDeadlines removes the scale-down-deadline annotation from the new/stable ReplicaSets,
// in the event that we moved back to an older revision that is still within its scaleDownDelay.
func (c *rolloutContext) removeScaleDownDeadlines() error {
var toRemove []*appsv1.ReplicaSet
if c.newRS != nil {
if c.newRS != nil && !c.isScaleDownOnabort() {
toRemove = append(toRemove, c.newRS)
}
if c.stableRS != nil {
Expand All @@ -120,10 +119,52 @@ func (c *rolloutContext) reconcileNewReplicaSet() (bool, error) {
if err != nil {
return false, err
}

if c.isScaleDownOnabort() {
c.log.Infof("Scale down new rs '%s' on abort", c.newRS.Name)

// if the newRS has scale down annotation, check if it should be scaled down now
if scaleDownAtStr, ok := c.newRS.Annotations[v1alpha1.DefaultReplicaSetScaleDownDeadlineAnnotationKey]; ok {
c.log.Infof("New rs '%s' has scaledown deadline annotation: %s", c.newRS.Name, scaleDownAtStr)
scaleDownAtTime, err := time.Parse(time.RFC3339, scaleDownAtStr)
if err != nil {
c.log.Warnf("Unable to read scaleDownAt label on rs '%s'", c.newRS.Name)
} else {
now := metav1.Now()
scaleDownAt := metav1.NewTime(scaleDownAtTime)
if scaleDownAt.After(now.Time) {
c.log.Infof("RS '%s' has not reached the scaleDownTime", c.newRS.Name)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add scaleDownAt to log line?

Copy link
Member Author

@huikang huikang Jun 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since line 129 logs the scale down time scaleDownAtStr, I think we don't need to replicate it here.

remainingTime := scaleDownAt.Sub(now.Time)
if remainingTime < c.resyncPeriod {
c.enqueueRolloutAfter(c.rollout, remainingTime)
return false, nil
}
} else {
c.log.Infof("RS '%s' has reached the scaleDownTime", c.newRS.Name)
newReplicasCount = int32(0)
}
}
} else {
abortScaleDownDelaySeconds := time.Duration(defaults.GetAbortScaleDownDelaySecondsOrDefault(c.rollout))
err = c.addScaleDownDelay(c.newRS, abortScaleDownDelaySeconds)
if err != nil {
return false, err
}
}
}

scaled, _, err := c.scaleReplicaSetAndRecordEvent(c.newRS, newReplicasCount)
return scaled, err
}

func (c *rolloutContext) isScaleDownOnabort() bool {
abortScaleDownDelaySeconds := time.Duration(defaults.GetAbortScaleDownDelaySecondsOrDefault(c.rollout))
if c.pauseContext != nil && c.pauseContext.IsAborted() && abortScaleDownDelaySeconds > 0 {
return true
}
return false
}

// reconcileOtherReplicaSets reconciles "other" ReplicaSets.
// Other ReplicaSets are ReplicaSets are neither the new or stable (allRSs - newRS - stableRS)
func (c *rolloutContext) reconcileOtherReplicaSets() (bool, error) {
Expand Down
71 changes: 66 additions & 5 deletions rollout/replicaset_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -121,12 +121,16 @@ func TestGetReplicaSetsForRollouts(t *testing.T) {

func TestReconcileNewReplicaSet(t *testing.T) {
tests := []struct {
name string
rolloutReplicas int
newReplicas int
scaleExpected bool
expectedNewReplicas int
name string
rolloutReplicas int
newReplicas int
scaleExpected bool
abortScaleDownDelaySeconds int32
abortScaleDownAnnotated bool
abortScaleDownDelayPassed bool
expectedNewReplicas int
}{

{
name: "New Replica Set matches rollout replica: No scale",
rolloutReplicas: 10,
Expand All @@ -147,6 +151,38 @@ func TestReconcileNewReplicaSet(t *testing.T) {
scaleExpected: true,
expectedNewReplicas: 10,
},

{
name: "New Replica scaled down to 0: scale down on abort - deadline passed",
rolloutReplicas: 10,
newReplicas: 10,
// ScaleDownOnAbort: true,
abortScaleDownDelaySeconds: 5,
abortScaleDownAnnotated: true,
abortScaleDownDelayPassed: true,
scaleExpected: true,
expectedNewReplicas: 0,
},
{
name: "New Replica scaled down to 0: scale down on abort - deadline not passed",
rolloutReplicas: 10,
newReplicas: 8,
// ScaleDownOnAbort: true,
abortScaleDownDelaySeconds: 5,
abortScaleDownAnnotated: true,
abortScaleDownDelayPassed: false,
scaleExpected: false,
expectedNewReplicas: 0,
},
{
name: "New Replica scaled down to 0: scale down on abort - add annotation",
rolloutReplicas: 10,
newReplicas: 10,
abortScaleDownDelaySeconds: 5,
abortScaleDownAnnotated: false,
scaleExpected: false,
expectedNewReplicas: 0,
},
}
for i := range tests {
test := tests[i]
Expand All @@ -164,8 +200,33 @@ func TestReconcileNewReplicaSet(t *testing.T) {
argoprojclientset: &fake,
kubeclientset: &k8sfake,
recorder: record.NewFakeEventRecorder(),
resyncPeriod: 30 * time.Second,
},
}
roCtx.enqueueRolloutAfter = func(obj interface{}, duration time.Duration) {}
if test.abortScaleDownDelaySeconds > 0 {
roCtx.pauseContext = &pauseContext{
rollout: rollout,
}
rollout.Status.Abort = true
// rollout.Spec.ScaleDownOnAbort = true
rollout.Spec.Strategy = v1alpha1.RolloutStrategy{
BlueGreen: &v1alpha1.BlueGreenStrategy{
AbortScaleDownDelaySeconds: &test.abortScaleDownDelaySeconds,
},
}

if test.abortScaleDownAnnotated {
var deadline string
if test.abortScaleDownDelayPassed {
deadline = metav1.Now().Add(-time.Duration(test.abortScaleDownDelaySeconds) * time.Second).UTC().Format(time.RFC3339)
} else {
deadline = metav1.Now().Add(time.Duration(test.abortScaleDownDelaySeconds) * time.Second).UTC().Format(time.RFC3339)
}
roCtx.newRS.Annotations[v1alpha1.DefaultReplicaSetScaleDownDeadlineAnnotationKey] = deadline
}
}

scaled, err := roCtx.reconcileNewReplicaSet()
if err != nil {
t.Errorf("unexpected error: %v", err)
Expand Down
5 changes: 3 additions & 2 deletions rollout/sync.go
Original file line number Diff line number Diff line change
Expand Up @@ -390,7 +390,7 @@ func (c *rolloutContext) scaleReplicaSet(rs *appsv1.ReplicaSet, newScale int32,
oldScale := defaults.GetReplicasOrDefault(rs.Spec.Replicas)
*(rsCopy.Spec.Replicas) = newScale
annotations.SetReplicasAnnotations(rsCopy, rolloutReplicas)
if fullScaleDown {
if fullScaleDown && !c.isScaleDownOnabort() {
delete(rsCopy.Annotations, v1alpha1.DefaultReplicaSetScaleDownDeadlineAnnotationKey)
}
rs, err = c.kubeclientset.AppsV1().ReplicaSets(rsCopy.Namespace).Update(ctx, rsCopy, metav1.UpdateOptions{})
Expand Down Expand Up @@ -927,7 +927,8 @@ func (c *rolloutContext) promoteStable(newStatus *v1alpha1.RolloutStatus, reason
// Now that we've marked the desired RS as stable, start the scale-down countdown on the previous stable RS
previousStableRS, _ := replicasetutil.GetReplicaSetByTemplateHash(c.olderRSs, previousStableHash)
if replicasetutil.GetReplicaCountForReplicaSets([]*appsv1.ReplicaSet{previousStableRS}) > 0 {
err := c.addScaleDownDelay(previousStableRS)
scaleDownDelaySeconds := time.Duration(defaults.GetScaleDownDelaySecondsOrDefault(c.rollout))
err := c.addScaleDownDelay(previousStableRS, scaleDownDelaySeconds)
if err != nil {
return err
}
Expand Down
Loading