-
Notifications
You must be signed in to change notification settings - Fork 561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ingester zonal disruptions #9908
Comments
conceptually this could definitely be something that exists in kubernetes directly because the pattern of "allowing zonal disruptions" is not unique to mimir. eg, an elasticsearch cluster that has documents replicated across "zones" would benefit from this same controller... |
perhaps this is something the https://github.com/grafana/rollout-operator could manage? |
In Mimir 2.14 the team added the support for puting ingesters into a read-only mode (docs). The documentation on scaling ingesters down was also updated, mentioning the mechanics of multi-zonal deployment (docs). Would this help with what you outlined in the issue?
The mechanics outlined in the documentation are, indeed, supported by the |
Keen on a solution for this. I think this was also raised on the rollout-operator just last week - grafana/rollout-operator#194 |
FYI here: I solved similar issue running ZDB controller from aws/zone-aware-controllers-for-k8s . You can pick up my fork with golang and base image refreshed. Setup is quite straightforward but ping me if you have questions. |
Is your feature request related to a problem? Please describe.
When deploying Mimir to K8s, some Pod Disruption Budgets (PDBs) are created for some pod types (distributors, ingesters, etc), however, they tend to be too restrictive—I think something like allowing only 1 disruption.
Anyway, because metrics are replicated across zones, there isn't a clear way to define a PDB that allows for more disruptions safely.
Describe the solution you'd like
It would be nice if there was some way to have a "high level PDB" where zones can be disrupted. A "zone" would be "healthy" or "up" if all pods in that zone are healthy/up. So, a disrupted zone would be one where at least 1 pod is unhealthy.
So, what that might enable is something like having a "ZDB" where you have rule for a majority of zones to be available/undisrupted. This would allow you to disrupt a single zone (eg, all pods in that zone). This would speed up draining k8s nodes since you can safely disrupt 1/3 total pods which is really important/helpful when running many pods.
This might be accomplished via some sort of controller/operator.
For example, we have a cluster with 420 ingester pods—having the PDB where only 1 pod means at a maximum, we can only drain 1 k8s node at a time when this could be done much more quickly (and safely).
Describe alternatives you've considered
This might be something we'll have to create ourselves because (ironically) it's very disruptive.
The text was updated successfully, but these errors were encountered: