-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kube-mgmt throwing errors when pod has been running for quite some time #49
Comments
@muzcategui1106-gs can you show an example of a configmap that's annotated with policy-status that should not be? From looking at the configmap replication code in kube-mgmt, it's not obvious how this could happen. RE: The sync channel closure, that is to be expected with the Kubernetes client. However, the delay on upserting into OPA is NOT expected. I'll try to reproduce the problem. One thing that might help would be to update the data replicator to use the PATCH method on OPA instead of DELETE followed by a series of PUTs. Here is the code in question: https://github.com/open-policy-agent/kube-mgmt/blob/master/pkg/data/generic.go#L147 |
Sure I not allowed to show an entire configmcannot show the entire configmap but only the error that I get ( i changed the filename in the error message) Version Info: Flags on kube-mgmt: Regarding resource utilization I dont have any specification on my deployment about that. I should have bvut from what i see they pods dont seem to be starving for resources NAME CPU(cores) MEMORY(bytes) What should be the recommended values in your opinion? Regarding the code you mention https://github.com/open-policy-agent/kube-mgmt/blob/master/pkg/data/generic.go#L147 I observed the same when I was looking from a performance point of view it does seem like a good idea to do PATCH as opposed to DELETE and PUTS |
@muzcategui1106-gs can you confirm whether the openpolicyagent.org/policy label is set on that configmap? I have a test deployment that is emitting the "Unable to decode..." error log but the configmaps are not being loaded incorrectly:
Logs:
|
the openpolicyagent.org/policy is not set only the annotation |
@muzcategui1106-gs we've cut v0.10 which contains #50. This should reduce CPU usage during resync. As of yet, the configmap annotation problem is still unclear and I've not been able to reproduce it. Any hints you can provide would be helpful. Finally, @patrick-east has looked into the decode errors and has some thoughts. |
@tsandall this is awesome so hopefully this should reduce CPU uses and increase resource sync. Will give it a try. Regarding the configmap sync let me see what I can dig in the next couple of days :) on the meantime I will start running 0.10 in my dev environment |
Right, so looking into it these type of errors occur if the scheme wasn't registered with our client (which happens over here: https://github.com/open-policy-agent/kube-mgmt/blob/master/pkg/configmap/configmap.go#L104-L112 ). What isn't clear is what object came over the wire that it was unable to decode. The current theory is that some object version/schema changed while kube-mgmt was running, or the client code we're using is out of date or has a bug in it and is missing part of the schema for some object. As of the last chat with @tsandall it seems like this particular issue isn't likely a big deal as kube-mgmt will just continue processing the objects we care about. |
Having just installed OPA and kube-mgmt in our 3 dev clusters I am also seeing this in the logs for each environment (131 "Unable to decode.." messages last hour). Also seem to be having some (possibly related?) issues syncing kubernetes resources, leading to data being seemingly unavailable to some of the OPA instances, which in turn leads to authorization decisions being allowed or denied based on which of the instances is consulted (the default of course set to deny). Let me know if I can provide any data to help you look into this. |
Hi @anderseknert, please add more details. Like: versions of OPA/kube-mgmt, the startup arguments and the full error. |
Hi @rtoma, and thanks for reaching out. Here are some of the details. Let me know if you need more. OPA image: kube-mgmt image: All in all the payload returned when hitting Full error: |
Hi, our kube-mgmt logging is full of these errors without functional impact. Which is confirmed by:
So maybe focus on the 'issues syncing kubernetes resources' which clearly has a functional impact. Maybe it is an idea to create a script that makes /v1/data calls on all pods every minute and compare the results? We've created a metrics collector (which I can not share) for this purpose. Another suggestion because it seems you want to seriously use OPA: develop an OPA regression tester that periodically POSTs artificial AdmissionReview payloads (extract from OPA's decision log) to the OPA webhook endpoint and match the result against the expected results. That way you can verify OPA and its policies are behaving as you expect. |
Thanks @rtoma. Yeah we are definitely looking at using OPA at a larger scale. So far I have some tests in place, and more definitely to come - though currently they all point to an ingress controller just like the actual authorization webhook that we're looking at using OPA for as our first proof of concenpt. It's from these tests I've noticed different results on re-running tests though nothing in the kubernetes data actually changed during that time frame. Will see about running tests targeting the pods directly, thanks for the pointer. Even if I can verify the sync issue I'm not sure how to proceed with that from there though. |
So my tests work something like this:
Sleeping 60 seconds between 1 and 2 seems to solve the problem I had with data getting out of sync, so this seems like a problem with my tests rather than with kube-mgmt. Will extend the test suite to also target individual pods, but for now this seems solved for me. The error message originally reported is still very much present and annoying, but it did not cause sync issues. |
Will anything be done about the Is it an issue with the Kubernetes version? I'm seeing it on both a 1.12 and a 1.15 cluster. |
I took another look into the @muzcategui1106-gs if you can test this out, that would be great. |
@tsandall yes when the release comes out, I will test out the changes in our dev environment |
@muzcategui1106-gs great. There's a new openpolicyagent/kube-mgmt:0.11
image on Docker Hub that you can try out. Please let me know if this
resolves the issue.
…On Thu, Jan 16, 2020 at 9:00 AM muzcategui1106-gs ***@***.***> wrote:
@tsandall <https://github.com/tsandall> yes when the release comes out, I
will test out the changes in our dev environment
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#49?email_source=notifications&email_token=AAB2KJP3BTHEESWSFCIG7JDQ6BSBFA5CNFSM4ITROD52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJEE5OA#issuecomment-575164088>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAB2KJKDFUUVYGJO32UZ6NLQ6BSBFANCNFSM4ITROD5Q>
.
--
-Torin
|
We've confirmed that v0.11 fixes the logging issue. Since the logging problem and the resource sync issue have been resolved, I'm going to close this for now. |
Unsure if this is an actual issue or if I should log separate issues for each of them.
I am seeing a weird behavior in kube-mgmt after it has ran for a while; eg: more than 5 hours
kube-mgmt seems to stop honoring the --require-policy-label as all the configmaps in the opa namespace get annotated with the
penpolicyagent.org/policy-status:
annotationResource syncs seem to crash periodically (on an hourly basis)
Unsure what the root cause may be. But here are some details about the versions being used
kube-mgmt version: 0.9
OPA version: 0.12.2
The text was updated successfully, but these errors were encountered: