Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: canary scaledown with maxsurge #1429

Merged
merged 17 commits into from
Aug 23, 2021
60 changes: 56 additions & 4 deletions test/e2e/canary_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,58 @@ func (s *CanarySuite) TestRolloutScalingWhenPaused() {
ExpectCanaryStablePodCount(1, 3)
}


// TestRolloutWithMaxSurgeScalingDuringUpdate verifies behavior when scaling a rollout up/down in middle of update and with maxSurge 100%
func (s *CanarySuite) TestRolloutWithMaxSurgeScalingDuringUpdate() {
s.Given().
HealthyRollout(`
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: updatescaling
spec:
replicas: 4
strategy:
canary:
maxSurge: 100
harikrongali marked this conversation as resolved.
Show resolved Hide resolved
selector:
matchLabels:
app: updatescaling
template:
metadata:
labels:
app: updatescaling
spec:
containers:
- name: updatescaling
image: nginx:1.19-alpine
resources:
requests:
memory: 16Mi
cpu: 1m`).
When().
PatchSpec(`
spec:
template:
spec:
containers:
- name: updatescaling
command: [/bad-command]`).
WaitForRolloutReplicas(7).
Then().
ExpectCanaryStablePodCount(4, 3).
When().
ScaleRollout(8).
WaitForRolloutReplicas(11).
Then().
ExpectCanaryStablePodCount(8, 3).
When().
ScaleRollout(4).
WaitForRolloutReplicas(7).
Then().
ExpectCanaryStablePodCount(4, 3)
}

// TestRolloutScalingDuringUpdate verifies behavior when scaling a rollout up/down in middle of update
func (s *CanarySuite) TestRolloutScalingDuringUpdate() {
s.Given().
Expand Down Expand Up @@ -156,12 +208,12 @@ spec:
ScaleRollout(8).
WaitForRolloutReplicas(10).
Then().
// NOTE: the numbers below may change in the future.
// See: https://github.com/argoproj/argo-rollouts/issues/738
ExpectCanaryStablePodCount(6, 4).
When().
ScaleRollout(4)
// WaitForRolloutReplicas(4) // this doesn't work yet (bug)
ScaleRollout(4).
WaitForRolloutReplicas(6).
Then().
ExpectCanaryStablePodCount(4, 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this is not correct. According to the sequence of events:

  1. At T-0, we were at: 6 pods (2 canary, 4 stable, 4 available)
  2. At T-1, we had a scale-up event and went to: 10 pods (6 canary, 4 stable, 4 available).
  3. At T-2, we had a scale-down event and went back down to 6 pods (4 canary, 2 stable, 2 available)

Notice that at T-2, we have now violated maxUnavailable because we only have 2 pods available when the spec (maxUnavailable: 0) requires 4 to be available.

Although this is a day-0 issue, I think the proper fix may be to address #738. If we fix #738, then this problem may also be fixed.

Copy link
Contributor Author

@harikrongali harikrongali Aug 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are correct that it requires changes to adhere to both maxSurge and minAvailabile. But bug referred in #738 is to fix weights inline to deployment. The current I am trying to fix is not scaleDown completely on either stable or canary which I will address as a part of this PR and will take #738 to do refactor to fix the whole calculation. (I don't want to refactor the whole block and take it step by step based on user feedback)

Copy link
Contributor Author

@harikrongali harikrongali Aug 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add label 1.2 to #738 ?

Copy link
Member

@jessesuen jessesuen Aug 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But bug referred in #738 is to fix weights inline to deployment

The two are closely related, if not the same issue. The behavior you saw in #1415 is a manifestation of the same underlying problem but with different parameters of maxSurge/minAvailable.

This fix appears to addressing the part of symptom, whereas #738 is actually the root problem. I believe if we had fix #738, then the incident in #1415 would not have happened.

I do recognize that this fix improves the current behavior, but I'm pretty sure the code in this PR will need to be thrown away in order to address #738

Copy link
Contributor Author

@harikrongali harikrongali Aug 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, #738 would. be the fix we need to act on. Current Issue, there are high chances of downtime when bad images are deployed. Given fix will improve and fix the issue. We need to refactor most of the calculation to fix #738 and that can go into 1.2 given the complexity.

}

// TestReduceWeightAndHonorMaxUnavailable verifies we honor maxUnavailable when decreasing weight or aborting
Expand Down
45 changes: 25 additions & 20 deletions utils/replicaset/canary.go
Original file line number Diff line number Diff line change
Expand Up @@ -180,39 +180,44 @@ func CalculateReplicaCountsForCanary(rollout *v1alpha1.Rollout, newRS *appsv1.Re
// weight (e.g. we are aborting), then we can ignore pod availability of the canaryRS.
isIncreasing := newRS == nil || desiredNewRSReplicaCount >= *newRS.Spec.Replicas
replicasToScaleDown := GetReplicasForScaleDown(newRS, !isIncreasing) + GetReplicasForScaleDown(stableRS, isIncreasing)

if replicasToScaleDown <= minAvailableReplicaCount {
// Cannot scale down stableRS or newRS without going below min available replica count
return newRSReplicaCount, stableRSReplicaCount
}

scaleDownCount := replicasToScaleDown - minAvailableReplicaCount
// default to any value, will be calculated in later steps --> maxAllowed value - updated RS (to be within maxSurge)
allowedAvailableReplicaCount := int32(0)
if !isIncreasing {
// Skip scalingDown Stable replicaSet when Canary availability is not taken into calculation for scaleDown
newRSReplicaCount, allowedAvailableReplicaCount = calculateScaleDownReplicaCount(newRS, desiredNewRSReplicaCount, maxReplicaCountAllowed, scaleDownCount, newRSReplicaCount)
if allowedAvailableReplicaCount < stableRSReplicaCount {
// ScaleDown stableAvailableSet to allowed value to adhere to maxSurge
stableRSReplicaCount = allowedAvailableReplicaCount
}
} else {
// Skip scalingDown canary replicaSet when StableSet availability is not taken into calculation for scaleDown
stableRSReplicaCount, allowedAvailableReplicaCount = calculateScaleDownReplicaCount(stableRS, desiredStableRSReplicaCount, maxReplicaCountAllowed, scaleDownCount, stableRSReplicaCount)
if allowedAvailableReplicaCount < newRSReplicaCount {
// ScaleDown canaryAllowedSet to allowed value to adhere to maxSurge
newRSReplicaCount = allowedAvailableReplicaCount
}
}
return newRSReplicaCount, stableRSReplicaCount
}

if newRS != nil && *newRS.Spec.Replicas > desiredNewRSReplicaCount {
func calculateScaleDownReplicaCount(replicaSet *appsv1.ReplicaSet, desireRSReplicaCount int32, maxReplicaCountAllowed int32, scaleDownCount int32, updatedReplicaCount int32) (int32, int32) {
harikrongali marked this conversation as resolved.
Show resolved Hide resolved
if replicaSet != nil && *replicaSet.Spec.Replicas > desireRSReplicaCount {
// if the controller doesn't have to use every replica to achieve the desired count, it only scales down to the
// desired count.
if *newRS.Spec.Replicas-scaleDownCount < desiredNewRSReplicaCount {
newRSReplicaCount = desiredNewRSReplicaCount
// Calculating how many replicas were used to scale down to the desired count
scaleDownCount = scaleDownCount - (*newRS.Spec.Replicas - desiredNewRSReplicaCount)
if *replicaSet.Spec.Replicas-scaleDownCount < desireRSReplicaCount {
updatedReplicaCount = desireRSReplicaCount
} else {
// The controller is using every replica it can to get closer to desired state.
newRSReplicaCount = *newRS.Spec.Replicas - scaleDownCount
scaleDownCount = 0
updatedReplicaCount = *replicaSet.Spec.Replicas - scaleDownCount
}
}

if scaleStableRS && *stableRS.Spec.Replicas > desiredStableRSReplicaCount {
// This follows the same logic as scaling down the newRS except with the stableRS and it does not need to
// set the scaleDownCount again since it's not used again
if *stableRS.Spec.Replicas-scaleDownCount < desiredStableRSReplicaCount {
stableRSReplicaCount = desiredStableRSReplicaCount
} else {
stableRSReplicaCount = *stableRS.Spec.Replicas - scaleDownCount
}
}

return newRSReplicaCount, stableRSReplicaCount
return updatedReplicaCount, maxReplicaCountAllowed - updatedReplicaCount
}

// BeforeStartingStep checks if canary rollout is at the starting step
Expand Down