-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Umbrella issue for handling Informer Permission Errors and Probes #1422
Comments
After thinking more of this issue, there is a serious problem when an informer is not working. The caches are not populated with the fresh resources, so until the RBAC issues is not solved it is basically could be inconsistent. For example there is a new primary resource that on reconciliation tries to create a secondary resource, if the informer of that secondary resource is not working the cache will not know about the resource so the reconciler will try to recreate it over and over again. Therefore would lean in the first phase just to restart the operator if there is such an error with the Informer. Maybe allowing to users to configure some retry options for the informer before a stop. There could be a feature flag to not automatically shut down the operator, but delegate this for probes and expose the health metrics of informers as mentioned in point 3. |
For the case when dynamically adding watched namespaces, there could be some optimizations, like reconcile a resource only when all the informers watching resources from a namespaces are healthy. But these will be very specific cases, its questionable such features is worth the effort. |
I think when the informer for a specific namespace cannot be created, we need to remove that namespace from the config and clear the caches. The general problem with the restarting/current behaviour on errors is that malicious users (or simple user error) can cripple the operator for everyone in the cluster. If the user deletes rolebinding in their own namespaces the operator will go in a restart loop and there is nothing anyone else can do to fix it. The whole cluster and resources are affected this way and restart won't do any good. |
@gyfora How should we behave on startup? Do the checks also I guess on watched namespace. This basically could be done just for informers which are Do you agree with the case when a permission revoked, and basically from that point one of the informers won't work, therefore the caches won't function properly, that we should restart the operator? |
However validating the namespace list and related permissions, and filtering the namespace list, could be something outside of the operator. |
No I am not really proposing anything about newly added namespaces. What I am proposing if at any time any watched namespaces becomes unaccessible we should be able to remove that from the list and clean up caches and continue with the rest. I think the best would be to have this at least a configurable option (default could be fail-operator/restart as you proposed). Basically we need to address the following situation robustly:
|
Unless this problem is addressed there is a big risk for any operator deployment that watches multiple independent user namespaces. One user basically can send the whole operator in a restart loop, this is not something that many production platforms can afford |
In some cases it is better to not watch a namespace anymore than to affect other namespaces being watched. |
Yes I guess if there is a liveness probe to restart then we as users of the framework could validate permissions. That would definitely be a step forward. |
ok I understand what you mean. Good point. However consider a situation: somebody accidentally mis-configures the rbac, and revokes just permission of a single resource in a namespace, while the other resources and related informers will still work on that namespace. Should we remove all the informers? if not than we have the mentioned problem above with the not fresh caches. But it can also happen that that particular namespace was a special one where some configs for the operator ( not a team namespace). So what I mean there is now no such thing in the model of the operator that Informers that are bound together - meaning if one of the stopped other ones should be stopped. There is this flag what means informers should follow namespace changes of the controller. Theoretically we could say: This would make sense for me for the multi namespace mode. (Not the whole cluster or one namespace mode, there I think we still need to stop.) |
The other alternative to solve both the issues, just retry staring the informer with some exponential backoff, and log warning. Assuming that it won't reconcile the resource anyways since there will be new event from that namespace anyways. And admins will eventually remove the revoked namespace from the configuration. |
Yes I think the exponential backoff + warning is a good approach. In that case other informers/namespaces would not be affected. |
Yes, that will cover this case in a relatively simple way. But would put there a feature flag, by default will exit if one of the informers is stopped. And later we can improve this by metrics and readyness probes. cc @andreaTP |
sound great :) |
As discussed during the community meeting, here is all the PoC caching stuff I have been working on for how Go operators can handle dynamically changing permissions via RBAC:
|
Another edge case that is relevant: |
see related issue: fabric8io/kubernetes-client#4527 |
🎉 great work! |
This issue is to cover more related issues and come up with a design how to handle them, see:
Summary
Informers are by definition not working if related RBAC is not configured or miss configured.
This is true both for informers for primary and secondary resources.
When a permission is revoked what happens is that Informer (from fabri8 client) will try retry few times but eventually just stops. (TODO verify this behavior, also configurability)
Note that framework however supports dynamically changing watched namespaces, so it can again happen a new namespace added but there are no permissions. It might be desirable to have the operator functioning on the former namespaces.
There are multiple ways to solve this:
@andreaTP propsed that having a callback function might be useful too.
The text was updated successfully, but these errors were encountered: