-
Notifications
You must be signed in to change notification settings - Fork 630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FR: include priorityClassName: system-node-critical
to controller deployments
#3800
Comments
Please take a look at the documentation that explains how to customize the manifests to your needs. |
I don't need to be told how to do it. I've been deploying with flux2 since 0.10. I am quite familiar with the docs and make use of Flux to configure Flux the way it should be configured. I'm talking about defaults. I might mention that when gitops is written about, fluxcd is included in the list of gitops systems maybe a third of the time (exceptions include pieces written by outfits that are partnered in some way with fluxcd) If you go to the issues for this and related Flux repos, you find many of them involve problems with controller stability. There are real code issues and design decisions that are still being worked out, but often the problem boils down to the controller dying in the middle of reconciliation. Unfortunately, as the gutting of kube's built-in drivers continues, more of said drivers are being pushed to the node as daemonsets, the liklihood of Flux controllers dying from goat-level prioritization rises. The defaults pretty much guarantee instability during deployment (actually using fluxcd for it's defined purpose) If the cluster operator must update a parameter on every default deployment, it is, by definition, a default which is what this issue concerns. |
This is no longer the case, starting with Flux v0.41.0, helm-controller handles SIGTERM sent by kubelet and aborts all inflight Helm operations by setting the releases status to failed. When it restarts, those failed releases are retried as they are no longer left in an inconsistent state.
The Flux controllers are not static pods nor daemonset, IMO a more suitable class would be |
Wrt the helm controller, as of 0.41.1, it does not. Pretty sure that's why there is a current initiative to properly rollback to a reconcilable state when it fails. I agree with you regarding using the system-cluster-critical class. |
What are you talking about? The SIGTERM handling was not reverted in 0.41.1, we shipped a cgroup fix in that release for the OOMWatcher, which is an opt-in feature. |
I'll post links to the issue mentioned when not on mobile. I do not think we're speaking of the same thing. |
In the case of pod eviction, due to node pressure or downscaling, before Flux v0.41.0, the inflight releases were left in a locked state. Once a release is locked, helm-controller can't perform any actions on it. Starting with v0.41, the controller reacts to SIGTERM (sent before eviction) and sets the inflight releases as failed, thus removing the lock. When helm-controller starts on a different node, it will retry the failed releases, if configured to do so. |
@stefanprodan thanks. I meant to get back to this but the week got away from me. I must admit to you: I was running the 41 cli against a 40 installation. I'll edit the issue for historical clarity. Thanks for the PR. |
Re-opening as the PR implementing the actual feature hasn't been merged, yet. |
Hello! |
@adarhef this has been fixed, thanks for reporting it. To install Flux on GKE, download v2.0.0-rc.3 run |
Thanks! I didn't get the chance to test your earlier suggestion to apply the |
when |
Describe the bug
flux/cli bootstrap deploys controller deployments without specifying a priorityClass. This puts them front-of-the-line for eviction if the resources of the underlying node come under even the slightest pressure and can result in flux objects (lookin' at you, helmRelease 😁) being left in an inconsistent state.
Steps to reproduce
flux deployments do not include
deployment.spec.template.spec.priorityClassName
. By the pod priority/eviction docs, this gives it a default priority of 0, or, in symbolic parlance, the priority of a sacrificial goat. In non-flux rolling update scenarios (managed by flux), it is often the case that resource pressure leads to flux controller eviction.I would like to suggest bootstrap set controllers' priorityClass to
system-node-critical
since it is a long-standing kubernetes default priorityClass.Expected behavior
When a source makes new data available that responding fluxcd controllers are around long enough to see the defined changes applied.
Screenshots and recordings
No response
OS / Distro
n/a
Flux version
v0.40.1
edit: specific concern already addressed with v41.1 release
Flux check
n/a
Git provider
n/a
Container Registry provider
n/a
Additional context
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: