Ingester zonal disruptions #9908

nullren · 2024-11-14T18:16:25Z

Is your feature request related to a problem? Please describe.

When deploying Mimir to K8s, some Pod Disruption Budgets (PDBs) are created for some pod types (distributors, ingesters, etc), however, they tend to be too restrictive—I think something like allowing only 1 disruption.

Anyway, because metrics are replicated across zones, there isn't a clear way to define a PDB that allows for more disruptions safely.

Describe the solution you'd like

It would be nice if there was some way to have a "high level PDB" where zones can be disrupted. A "zone" would be "healthy" or "up" if all pods in that zone are healthy/up. So, a disrupted zone would be one where at least 1 pod is unhealthy.

So, what that might enable is something like having a "ZDB" where you have rule for a majority of zones to be available/undisrupted. This would allow you to disrupt a single zone (eg, all pods in that zone). This would speed up draining k8s nodes since you can safely disrupt 1/3 total pods which is really important/helpful when running many pods.

This might be accomplished via some sort of controller/operator.

For example, we have a cluster with 420 ingester pods—having the PDB where only 1 pod means at a maximum, we can only drain 1 k8s node at a time when this could be done much more quickly (and safely).

Describe alternatives you've considered

This might be something we'll have to create ourselves because (ironically) it's very disruptive.

nullren · 2024-11-14T18:20:26Z

conceptually this could definitely be something that exists in kubernetes directly because the pattern of "allowing zonal disruptions" is not unique to mimir. eg, an elasticsearch cluster that has documents replicated across "zones" would benefit from this same controller...

nullren · 2024-11-14T19:17:19Z

perhaps this is something the https://github.com/grafana/rollout-operator could manage?

narqo · 2024-11-19T13:48:17Z

In Mimir 2.14 the team added the support for puting ingesters into a read-only mode (docs). The documentation on scaling ingesters down was also updated, mentioning the mechanics of multi-zonal deployment (docs). Would this help with what you outlined in the issue?

Perhaps this is something the https://github.com/grafana/rollout-operator could manage?

The mechanics outlined in the documentation are, indeed, supported by the rollout-operator. We have it codified in jsonnet. That's what we run internally at Grafana Labs.

stephcan · 2025-01-15T22:08:14Z

Keen on a solution for this. I think this was also raised on the rollout-operator just last week - grafana/rollout-operator#194

deniszh · 2025-02-12T13:30:26Z

FYI here: I solved similar issue running ZDB controller from aws/zone-aware-controllers-for-k8s . You can pick up my fork with golang and base image refreshed. Setup is quite straightforward but ping me if you have questions.
AWS blog about controllers - https://aws.amazon.com/blogs/opensource/speed-up-highly-available-deployments-on-kubernetes/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingester zonal disruptions #9908

Ingester zonal disruptions #9908

nullren commented Nov 14, 2024

nullren commented Nov 14, 2024

nullren commented Nov 14, 2024

narqo commented Nov 19, 2024 •

edited

Loading

stephcan commented Jan 15, 2025

deniszh commented Feb 12, 2025

Ingester zonal disruptions #9908

Ingester zonal disruptions #9908

Comments

nullren commented Nov 14, 2024

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

nullren commented Nov 14, 2024

nullren commented Nov 14, 2024

narqo commented Nov 19, 2024 • edited Loading

stephcan commented Jan 15, 2025

deniszh commented Feb 12, 2025

narqo commented Nov 19, 2024 •

edited

Loading