-
Notifications
You must be signed in to change notification settings - Fork 2
Fix/autoscaling #37
base: main
Are you sure you want to change the base?
Fix/autoscaling #37
Conversation
I don't believe that the dask-kubernetes and AKS autoscalers are linked. This configuration specifically scales down when there are no workloads and is unrelated to Dask. |
We eventually should still have this functionality at K8S level to save costs, otherwise the cluster will stay with nodes spun up until the default low-workload threshold time is hit. |
@tracetechnical There are several points to unpack here, the first is that us configuring down scaling at the cluster level appears to cause a race/contention issue with worker pod creation initiated by https://github.com/dask/distributed/blob/main/distributed/deploy/adaptive_core.py. The The non-Dask workloads running in our cluster (prefect-agent, flow-runner, loki, grafana) are less likely to require resource scaling. As we are running multiple workloads in this cluster does it make sense for us to use an annotation to prevent Dask resources from being autoscaled down? One issue with this may be the case where the scheduler pod is killed prematurely resulting in orphaned worker pods that are no longer subject to be autoscaled down. Maybe the best approach is to maintain |
@sharkinsspatial Scrub my previous comment RE: the workings of the autoscaler (now deleted). I think your idea RE: scaledown un-needed may be covered by the default times in the autoscaler, but we would need to verify this against the standard autoscaler profile And I imagine that the cost savings delivered by non-standard autoscaler profiles would be far less than the annoyance factor of mystery dissapearances based on your examples above. The above, coupled with the fact that these params seem to break the autoscaler, point toward this customisation which is removed in this PR being worthy of removal. |
I thought I would chime in on this based on our experience running Dask clusters in Pangeo. In the early days, we ran Dask Kubernetes on our Pangeo Cloud GKE cluster with autoscaling node pools. Dask's autoscaling requests for more pods triggered GKE to scale up and down accordingly. It seemed to work well. GKE's timescale was a lot slower than Dask's--if the timescales were comparable, I imagine you could get weird behavior (oscillations for example). |
What I am changing
Removed cluster downscaling configuration which was preventing AKS autoscaling. Downscaling should be managed directly via
dask-kubernetes
adaptive implementation.How I did it
Removed this configuration from the cluster terraform configuration.
How you can test it