Fix/autoscaling #37

sharkinsspatial · 2021-12-08T15:32:54Z

What I am changing

Removed cluster downscaling configuration which was preventing AKS autoscaling. Downscaling should be managed directly via dask-kubernetes adaptive implementation.

How I did it

Removed this configuration from the cluster terraform configuration.

How you can test it

tracetechnical · 2021-12-08T16:02:11Z

I don't believe that the dask-kubernetes and AKS autoscalers are linked. This configuration specifically scales down when there are no workloads and is unrelated to Dask.

tracetechnical · 2021-12-08T16:03:22Z

We eventually should still have this functionality at K8S level to save costs, otherwise the cluster will stay with nodes spun up until the default low-workload threshold time is hit.

sharkinsspatial · 2021-12-08T19:54:03Z

@tracetechnical There are several points to unpack here, the first is that us configuring down scaling at the cluster level appears to cause a race/contention issue with worker pod creation initiated by https://github.com/dask/distributed/blob/main/distributed/deploy/adaptive_core.py. The KubeCluster is scaling worker pods using distributed's adaptive heuristics but downscaling at the cluster level can result in new nodes being removed removed before worker pods are placed. Theoretically we could increase the downscaling time range but given the variability in AKS's autoscaling node launch that might still be problematic.

The non-Dask workloads running in our cluster (prefect-agent, flow-runner, loki, grafana) are less likely to require resource scaling. As we are running multiple workloads in this cluster does it make sense for us to use an annotation to prevent Dask resources from being autoscaled down? One issue with this may be the case where the scheduler pod is killed prematurely resulting in orphaned worker pods that are no longer subject to be autoscaled down. Maybe the best approach is to maintain scale_down_unneeded with a very large interval?

tracetechnical · 2021-12-09T13:07:43Z

@sharkinsspatial Scrub my previous comment RE: the workings of the autoscaler (now deleted). I think your idea RE: scaledown un-needed may be covered by the default times in the autoscaler, but we would need to verify this against the standard autoscaler profile And I imagine that the cost savings delivered by non-standard autoscaler profiles would be far less than the annoyance factor of mystery dissapearances based on your examples above.

The above, coupled with the fact that these params seem to break the autoscaler, point toward this customisation which is removed in this PR being worthy of removal.

rabernat · 2021-12-09T13:35:23Z

I thought I would chime in on this based on our experience running Dask clusters in Pangeo.

In the early days, we ran Dask Kubernetes on our Pangeo Cloud GKE cluster with autoscaling node pools. Dask's autoscaling requests for more pods triggered GKE to scale up and down accordingly. It seemed to work well. GKE's timescale was a lot slower than Dask's--if the timescales were comparable, I imagine you could get weird behavior (oscillations for example).

sharkinsspatial added 2 commits December 8, 2021 09:25

Remove cluster scale down behavior which was preventing autoscaling.

25dda8e

Fix typo in loki-stack-grafana secret retrieval.

9f7d97d

sharkinsspatial requested a review from tracetechnical December 8, 2021 15:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/autoscaling #37

Fix/autoscaling #37

sharkinsspatial commented Dec 8, 2021

tracetechnical commented Dec 8, 2021

tracetechnical commented Dec 8, 2021

sharkinsspatial commented Dec 8, 2021

tracetechnical commented Dec 9, 2021 •

edited

Loading

rabernat commented Dec 9, 2021

Fix/autoscaling #37

Are you sure you want to change the base?

Fix/autoscaling #37

Conversation

sharkinsspatial commented Dec 8, 2021

What I am changing

How I did it

How you can test it

tracetechnical commented Dec 8, 2021

tracetechnical commented Dec 8, 2021

sharkinsspatial commented Dec 8, 2021

tracetechnical commented Dec 9, 2021 • edited Loading

rabernat commented Dec 9, 2021

tracetechnical commented Dec 9, 2021 •

edited

Loading