Deploymets scaled down very quickly despite low maxDisruption #143

michael-careplanner · 2022-01-21T11:58:59Z

We have several deployments scaled by WPA on fairly spiky queues. In order to limit how spiky our scale downs are we set maxDisruption to a low value.

However we still see scale downs happening very quicky. Looking at the logs, it seems WPA is taking scaling decisions several times a second. For example in once instance, despite having a maxDisruption set to 1% we saw a deployment scale from 53 replicas to 30 replicas in 5 seconds. Each time the deployment was decreased by 1 replica (so obeying the maxDisruption value set) but WPA made 26 individual scaling decisions in those 5 seconds.

I think a (possibly configurable) cooldown needs adding after a scaling decision, so that fast scale downs can be prevented by setting a low maxDisruption.

alok87 · 2022-01-22T09:44:47Z

Specifying the secondsToProcessOneJob will help here.

WPA will keep your minReplicas based on the RPM on the queue if secondsToProcessOneJob is specified.

Basically this is the logic, code:

workersBasedOnMessagesSent := int32(math.Ceil((secondsToProcessOneJob * messagesSentPerMinute) / 60))
if workersBasedOnMessagesSent > minWorkers {
return workersBasedOnMessagesSent
}

return minWorkers

messageSentPerMinute in SQS is avg number of msg sent in 1 minute, code.
minWorkers is the replicas you have specified as min in WPA spec. kubectl edit wpa
secondsToProcessOneJob is what you specify

Can you try this and let us know if this helps?

justjkk · 2022-02-03T21:17:01Z

@michael-careplanner I think the resync-period configuration is set to a very low value(like 200ms) which would explain the rapid scaling decisions. The default value is 20 seconds which works well for us. Please confirm if this is the case and if so is there any particular reason for setting it to a value lower than 20 seconds? Scaling up from 0 is anyways instantaneous due to the long-polling logic and I can't figure out another use-case for setting a small value for resync-period.

michael-careplanner · 2022-02-04T08:19:14Z

@justjkk Looking at our config resync-period is actually being set to 40, up from the default of 20. Agreed that having it set to a very low value doesn't make too much sense, but looks like that isn't the case in our config.

@alok87 Thanks for your reply. I am planning to look at secondsToProcessOneJob at some point, but I feel like that's a separate issue to this. I still don't think WPA should be making scale down decisions so close together, especially given the resync period we've set.

alok87 · 2022-02-04T08:23:18Z

Possible to share the log the WPA for a particular queue with -v=4 verbosity? will help in debugging.

michael-careplanner · 2022-02-04T08:54:40Z

wpa-rapid-scale-down.log

Not sure if this is a large enough snippet of logs but it does show the rapid scale down.

alok87 · 2022-02-04T09:41:57Z

Resync-period is not being respected. I see the control loop executing every 200ms, instead of 20second.

I0204 08:25:22.941872       1 controller.go:441] cpapi_live_webhooks qMsgs: 89, desired: 31
I0204 08:25:22.968042       1 controller.go:441] cpapi_live_webhooks qMsgs: 89, desired: 30
I0204 08:25:23.148417       1 controller.go:441] cpapi_live_webhooks qMsgs: 89, desired: 29
I0204 08:25:23.350866       1 controller.go:441] cpapi_live_webhooks qMsgs: 89, desired: 28

I think event handlers needs to use AddEventHandlerWithResyncPeriod
wpa is using AddEventHandler

Made a change.

Can you try this image: practodev/workerpodautoscaler:v1.5.0-5-gfef9160 and send me the logs again if possible. Wanted to test if this fixes the problem!

michael-careplanner · 2022-02-04T11:28:41Z

Tested using that and I am seeing some 40 seconds pauses (in line with our resync-period) - However I'm also seeing every 200ms scale down's for certain deployments:

I0204 11:18:58.042487       1 controller.go:451] cpapi_live_webhooks qMsgs: 153, desired: 61                                                                                                
I0204 11:18:58.243314       1 controller.go:451] cpapi_live_webhooks qMsgs: 153, desired: 60                                                                                                
I0204 11:18:58.442599       1 controller.go:451] cpapi_live_webhooks qMsgs: 153, desired: 59                                                                                                
I0204 11:18:58.642624       1 controller.go:451] cpapi_live_webhooks qMsgs: 153, desired: 58                                                                                                
I0204 11:18:58.842958       1 controller.go:451] cpapi_live_webhooks qMsgs: 153, desired: 57
<many more lines of this>

Seems once it has decided to do a scale down it still loops in the controller without respecting the resync-period until the deployment is at the size WPA wants it to be at.

alok87 · 2022-02-04T12:01:06Z

Ok I think since we are updating the status in WPA every reconciliation loop, hence it is creating new event everytime without respecting resync.

Doing WPA status update only when changes happen and doing noop events can prevent this.

alok87 · 2022-02-05T04:29:49Z

@michael-careplanner can you confirm the 200ms is happening in the queues/wpa objects with else:

         if workerPodAutoScaler.Status.CurrentReplicas == currentWorkers &&
		  workerPodAutoScaler.Status.AvailableReplicas == availableWorkers &&
		  workerPodAutoScaler.Status.DesiredReplicas == desiredWorkers &&
		  workerPodAutoScaler.Status.CurrentMessages == queueMessages {
		  klog.V(4).Infof("%s/%s: WPA status is already up to date\n", namespace, name)
		  return
	  } else {
		  klog.V(4).Infof("%s/%s: Updating wpa status\n", namespace, name)
	  }

michael-careplanner · 2022-02-08T08:48:34Z

Yes, I see lots of messages like this while it's doing the 200ms reconciliation loop:
controller.go:770] live-api/cpapi-transport: Updating wpa status

Followed by a single controller.go:767] live-api/cpapi-transport: WPA status is already up to date at the end of the fast loop. So looks like it's repeatedly executing that if statement above until it hits the truthy case, then it roughly respects the resync-period before running again:

I0208 08:49:18.857614       1 controller.go:767] live-api/cpapi-transport: WPA status is already up to date
I0208 08:49:53.650846       1 controller.go:626] cpapi_live_transport min=1, max=300, targetBacklog=10

alok87 · 2022-02-10T06:46:09Z

Yes then it is confirmed, the update status is leading to very fast re-queue. Cant stop doing updates to the status since queue messages and others details are updated with every update, also the dashboards are there on top of these status.

We may have to consider scale down cool-off after all: --scale-down-delay-after-scale-up

alok87 · 2022-02-10T10:27:25Z

Hello @michael-careplanner

Added scale-down-delay-after-last-scale-activity.

--scale-down-delay-after-last-scale-activity 
scale down delay after last scale up or down (defaults to 600 seconds i.e 10mins)

I have written test cases for it. But have not tested it in any Kubernetes cluster yet.

Here is the image: practodev/workerpodautoscaler:v1.5.0-9-gbeb0731

michael-careplanner · 2022-02-15T11:43:19Z

Just tested this and hitting an issue where it returns scale-up when scaling down and therefore not respecting the cool down:

I0215 11:25:32.221965       1 controller.go:458] cpapi_live_transport current: 96                                                                                                           I0215 11:25:32.221969       1 controller.go:459] cpapi_live_transport qMsgs: 98, desired: 95                                                                                                I0215 11:25:32.240002       1 controller.go:519] cpapi_live_transport scaleOp: scale-up

Looks like GetScaleOperation expects desiredWorkers then currentWorkers, but when it's called it's passing currentWorkers then desiredWorkers, so the GetScaleOperation logic is inverted!

alok87 · 2022-02-15T17:06:21Z

@michael-careplanner Fixed it here.

New image with the above change:
practodev/workerpodautoscaler:v1.5.0-11-g7d12579

michael-careplanner · 2022-02-16T10:12:37Z

Tested that version, it now correctly infers the scale type, but it's still not respecting the cooldown - log:

I0216 09:48:55.389034       1 controller.go:660] cpapi_live_transport min=1, max=300, targetBacklog=10
I0216 09:48:55.389049       1 controller.go:683] cpapi_live_transport qMsgs=55, qMsgsPerMin=-1
I0216 09:48:55.389057       1 controller.go:685] cpapi_live_transport secToProcessJob=0, maxDisruption=1%
I0216 09:48:55.389061       1 controller.go:687] cpapi_live_transport current=115, idle=-1
I0216 09:48:55.389066       1 controller.go:689] cpapi_live_transport minComputed=1, maxDisruptable=2
I0216 09:48:55.389073       1 controller.go:458] cpapi_live_transport current: 115
I0216 09:48:55.389079       1 controller.go:459] cpapi_live_transport qMsgs: 55, desired: 113
I0216 09:48:55.389091       1 scale_operation.go:60] cpapi_live_transport scaleDown is allowed, cooloff passed
I0216 09:48:55.394090       1 controller.go:519] cpapi_live_transport scaleOp: scale-down
I0216 09:48:55.394109       1 controller.go:806] live-api/cpapi-transport: Updating wpa status
I0216 09:48:55.587844       1 controller.go:827] live-api/cpapi-transport: Updated wpa status
I0216 09:48:55.587944       1 controller.go:660] cpapi_live_transport min=1, max=300, targetBacklog=10
I0216 09:48:55.587958       1 controller.go:683] cpapi_live_transport qMsgs=55, qMsgsPerMin=-1
I0216 09:48:55.587965       1 controller.go:685] cpapi_live_transport secToProcessJob=0, maxDisruption=1%
I0216 09:48:55.587977       1 controller.go:687] cpapi_live_transport current=113, idle=-1
I0216 09:48:55.587982       1 controller.go:689] cpapi_live_transport minComputed=1, maxDisruptable=2
I0216 09:48:55.587988       1 controller.go:458] cpapi_live_transport current: 113
I0216 09:48:55.587994       1 controller.go:459] cpapi_live_transport qMsgs: 55, desired: 111
I0216 09:48:55.588005       1 scale_operation.go:60] cpapi_live_transport scaleDown is allowed, cooloff passed
I0216 09:48:55.597267       1 controller.go:519] cpapi_live_transport scaleOp: scale-down
I0216 09:48:55.597284       1 controller.go:806] live-api/cpapi-transport: Updating wpa status
I0216 09:48:55.788992       1 controller.go:827] live-api/cpapi-transport: Updated wpa status

I've had a good look over your PR but can't see where the bug is...

alok87 · 2022-02-16T11:16:43Z

I am guessing the status update for lastScaleTime is not reflecting in your WPA object.

status:
  LastScaleTime:

This should get updated after every scale activity. It should start with nil.

I am testing this one of our clusters.

#143 (comment)

alok87 · 2022-02-16T12:34:18Z

@michael-careplanner Hey! Issue was scaleDownDelay was being set as 0 due to a bug. I fixed it in f1be3a3.

Thanks for bringing out this issue!! I saw the same problem in one of our high throughput queue. Here is the dashboard image after this change, the scale down becomes smooth 👍🏼 (this worker did not set secondsToProcessOneJob)

If someone still needs the old behaviour, would need to set the scaleDownDelay to a lower value than default of 10mins.

Below image has the fix, let me know how it works out for you!:
practodev/workerpodautoscaler:v1.5.0-12-gf1be3a3

alok87 · 2022-02-21T05:53:59Z

@michael-careplanner This is good to release from our end. Working perfectly in our latest cluster. Did you get the chance to test it?

michael-careplanner · 2022-02-21T08:38:15Z

@alok87 Yeah, this is working well on our end now. Thanks very much for all your work on this!

alok87 · 2022-02-21T09:24:48Z

Release v1.6.0 https://github.com/practo/k8s-worker-pod-autoscaler/releases/tag/v1.6.0

alok87 added the bug Something isn't working label Feb 4, 2022

alok87 added a commit that referenced this issue Feb 15, 2022

Fix bug: #143 (comment)

7d12579

alok87 mentioned this issue Feb 15, 2022

Cooldown after last scale activity: scale-down-delay-after-last-scale-activity #145

Merged

alok87 added a commit that referenced this issue Feb 16, 2022

Fix scaleDownDelay being 0 bug

f1be3a3

#143 (comment)

alok87 closed this as completed in #145 Feb 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deploymets scaled down very quickly despite low maxDisruption #143

Deploymets scaled down very quickly despite low maxDisruption #143

michael-careplanner commented Jan 21, 2022

alok87 commented Jan 22, 2022 •

edited

Loading

justjkk commented Feb 3, 2022

michael-careplanner commented Feb 4, 2022

alok87 commented Feb 4, 2022

michael-careplanner commented Feb 4, 2022

alok87 commented Feb 4, 2022

michael-careplanner commented Feb 4, 2022

alok87 commented Feb 4, 2022 •

edited

Loading

alok87 commented Feb 5, 2022 •

edited

Loading

michael-careplanner commented Feb 8, 2022 •

edited

Loading

alok87 commented Feb 10, 2022 •

edited

Loading

alok87 commented Feb 10, 2022

michael-careplanner commented Feb 15, 2022 •

edited

Loading

alok87 commented Feb 15, 2022

michael-careplanner commented Feb 16, 2022

alok87 commented Feb 16, 2022 •

edited

Loading

alok87 commented Feb 16, 2022 •

edited

Loading

alok87 commented Feb 21, 2022 •

edited

Loading

michael-careplanner commented Feb 21, 2022

alok87 commented Feb 21, 2022

Deploymets scaled down very quickly despite low maxDisruption #143

Deploymets scaled down very quickly despite low maxDisruption #143

Comments

michael-careplanner commented Jan 21, 2022

alok87 commented Jan 22, 2022 • edited Loading

justjkk commented Feb 3, 2022

michael-careplanner commented Feb 4, 2022

alok87 commented Feb 4, 2022

michael-careplanner commented Feb 4, 2022

alok87 commented Feb 4, 2022

michael-careplanner commented Feb 4, 2022

alok87 commented Feb 4, 2022 • edited Loading

alok87 commented Feb 5, 2022 • edited Loading

michael-careplanner commented Feb 8, 2022 • edited Loading

alok87 commented Feb 10, 2022 • edited Loading

alok87 commented Feb 10, 2022

michael-careplanner commented Feb 15, 2022 • edited Loading

alok87 commented Feb 15, 2022

michael-careplanner commented Feb 16, 2022

alok87 commented Feb 16, 2022 • edited Loading

alok87 commented Feb 16, 2022 • edited Loading

alok87 commented Feb 21, 2022 • edited Loading

michael-careplanner commented Feb 21, 2022

alok87 commented Feb 21, 2022

alok87 commented Jan 22, 2022 •

edited

Loading

alok87 commented Feb 4, 2022 •

edited

Loading

alok87 commented Feb 5, 2022 •

edited

Loading

michael-careplanner commented Feb 8, 2022 •

edited

Loading

alok87 commented Feb 10, 2022 •

edited

Loading

michael-careplanner commented Feb 15, 2022 •

edited

Loading

alok87 commented Feb 16, 2022 •

edited

Loading

alok87 commented Feb 16, 2022 •

edited

Loading

alok87 commented Feb 21, 2022 •

edited

Loading