Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-mgmt throwing errors when pod has been running for quite some time #49

Closed
muzcategui1106-gs opened this issue Sep 4, 2019 · 18 comments

Comments

@muzcategui1106-gs
Copy link

Unsure if this is an actual issue or if I should log separate issues for each of them.

I am seeing a weird behavior in kube-mgmt after it has ran for a while; eg: more than 5 hours

  1. This logs start appearing

E0904 12:18:45.089926 1 streamwatcher.go:109] Unable to decode an event from the watch stream: unable to decode watch event: no kind "Status" is registered for version "v1" in scheme "github.com/open-policy-agent/kube-mgmt/pkg/configmap/configmap.go:102"

  1. kube-mgmt seems to stop honoring the --require-policy-label as all the configmaps in the opa namespace get annotated with the penpolicyagent.org/policy-status: annotation

  2. Resource syncs seem to crash periodically (on an hourly basis)

time="2019-09-04T10:51:03Z" level=info msg="Sync channel for extensions/v1beta1/ingresses closed. Restarting immediately."
time="2019-09-04T10:51:03Z" level=info msg="Syncing extensions/v1beta1/ingresses."
time="2019-09-04T10:51:08Z" level=info msg="Listed extensions/v1beta1/ingresses and got 319 resources with resourceVersion 1188097528. Took 5.462164663s."
time="2019-09-04T10:52:05Z" level=info msg="Loaded extensions/v1beta1/ingresses resources into OPA. Took 56.650552306s. Starting watch at resourceVersion 1188097528."

time="2019-09-04T11:56:18Z" level=info msg="Sync channel for extensions/v1beta1/ingresses closed. Restarting immediately."
time="2019-09-04T11:56:18Z" level=info msg="Syncing extensions/v1beta1/ingresses."
time="2019-09-04T11:56:19Z" level=info msg="Listed extensions/v1beta1/ingresses and got 319 resources with resourceVersion 1188317650. Took 872.624828ms."
time="2019-09-04T11:57:43Z" level=info msg="Loaded extensions/v1beta1/ingresses resources into OPA. Took 1m24.203876182s. Starting watch at resourceVersion 1188317650."

time="2019-09-04T12:39:50Z" level=info msg="Sync channel for extensions/v1beta1/ingresses closed. Restarting immediately."
time="2019-09-04T12:39:50Z" level=info msg="Syncing extensions/v1beta1/ingresses."
time="2019-09-04T12:39:53Z" level=info msg="Listed extensions/v1beta1/ingresses and got 319 resources with resourceVersion 1188471126. Took 2.790341522s."
time="2019-09-04T12:41:36Z" level=info msg="Loaded extensions/v1beta1/ingresses resources into OPA. Took 1m43.350371162s. Starting watch at resourceVersion 1188471126."

Unsure what the root cause may be. But here are some details about the versions being used

kube-mgmt version: 0.9
OPA version: 0.12.2

@tsandall
Copy link
Member

tsandall commented Sep 19, 2019

@muzcategui1106-gs can you show an example of a configmap that's annotated with policy-status that should not be? From looking at the configmap replication code in kube-mgmt, it's not obvious how this could happen.

RE: The sync channel closure, that is to be expected with the Kubernetes client. However, the delay on upserting into OPA is NOT expected. I'll try to reproduce the problem.

One thing that might help would be to update the data replicator to use the PATCH method on OPA instead of DELETE followed by a series of PUTs. Here is the code in question: https://github.com/open-policy-agent/kube-mgmt/blob/master/pkg/data/generic.go#L147

@muzcategui1106-gs
Copy link
Author

muzcategui1106-gs commented Sep 23, 2019

Sure I not allowed to show an entire configmcannot show the entire configmap but only the error that I get ( i changed the filename in the error message)

Version Info:
OPA: 0.14.0
Kube-mgmt: 0.9

Flags on kube-mgmt:
args: - "--replicate-cluster=v1/namespaces" - "--replicate-cluster=v1/persistentvolumes" - "--replicate=v1/persistentvolumeclaims" - "--replicate=v1/resourcequotas" - "--require-policy-label"
Error on configmap
openpolicyagent.org/policy-status: '{"status":"error","error":{"code":"invalid_parameter","message":"error(s) occurred while compiling module(s)","errors":[{"code":"rego_parse_error","message":"no match found, expected: [ \\t\\r\\n] or [A-Za-z_]","location":{"file":"opa/example.conf","row":7,"col":11}}]}}'

Regarding resource utilization I dont have any specification on my deployment about that. I should have bvut from what i see they pods dont seem to be starving for resources

NAME CPU(cores) MEMORY(bytes)
opa-5985f8cf78-kv45t 1571m 132Mi
opa-5985f8cf78-rbq4p 1m 151Mi
opa-5985f8cf78-rcrds 1369m 156Mi
policy-engine-6b6ffbb5f9-5kznf 914m 78Mi
policy-engine-6b6ffbb5f9-64l8d 503m 76Mi
policy-engine-6b6ffbb5f9-bhlht 329m 81Mi

What should be the recommended values in your opinion?

Regarding the code you mention https://github.com/open-policy-agent/kube-mgmt/blob/master/pkg/data/generic.go#L147 I observed the same when I was looking from a performance point of view it does seem like a good idea to do PATCH as opposed to DELETE and PUTS

@tsandall
Copy link
Member

@muzcategui1106-gs can you confirm whether the openpolicyagent.org/policy label is set on that configmap? I have a test deployment that is emitting the "Unable to decode..." error log but the configmaps are not being loaded incorrectly:

$ kubectl -n torin-test get configmap -o yaml
apiVersion: v1
items:
- apiVersion: v1
  data:
    x.txt: |
      hello
  kind: ConfigMap
  metadata:
    creationTimestamp: "2019-09-19T15:26:12Z"
    name: notpolicy
    namespace: torin-test
    resourceVersion: "7431454"
    selfLink: /api/v1/namespaces/torin-test/configmaps/notpolicy
    uid: d01934e1-daf1-11e9-ae65-024400571c2a
- apiVersion: v1
  data:
    x.rego: |-
      package foo

      p = 1
  kind: ConfigMap
  metadata:
    annotations:
      openpolicyagent.org/policy-status: '{"status":"ok"}'
    creationTimestamp: "2019-09-19T15:26:06Z"
    labels:
      openpolicyagent.org/policy: rego
    name: policy
    namespace: torin-test
    resourceVersion: "7431761"
    selfLink: /api/v1/namespaces/torin-test/configmaps/policy
    uid: cc69ae52-daf1-11e9-ae65-024400571c2a
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Logs:

E0923 13:10:34.000910       1 streamwatcher.go:109] Unable to decode an event from the watch stream: unable to decode watch event: no kind "Status" is registered for version "v1" in scheme "github.com/open-policy-agent/kube-mgmt/pkg/configmap/configmap.go:102"
E0923 13:32:19.022065       1 streamwatcher.go:109] Unable to decode an event from the watch stream: unable to decode watch event: no kind "Status" is registered for version "v1" in scheme "github.com/open-policy-agent/kube-mgmt/pkg/configmap/configmap.go:102"
E0923 13:49:55.032696       1 streamwatcher.go:109] Unable to decode an event from the watch stream: unable to decode watch event: no kind "Status" is registered for version "v1" in scheme "github.com/open-policy-agent/kube-mgmt/pkg/configmap/configmap.go:102"
time="2019-09-23T13:51:44Z" level=info msg="Sync channel for v1/pods closed. Restarting immediately." 
time="2019-09-23T13:51:44Z" level=info msg="Syncing v1/pods." 
time="2019-09-23T13:51:44Z" level=info msg="Listed v1/pods and got 633 resources with resourceVersion 9361168. Took 294.467604ms." 
time="2019-09-23T13:51:45Z" level=info msg="Loaded v1/pods resources into OPA. Took 524.414201ms. Starting watch at resourceVersion 9361168." 

@muzcategui1106-gs
Copy link
Author

the openpolicyagent.org/policy is not set only the annotation

@tsandall
Copy link
Member

tsandall commented Oct 1, 2019

@muzcategui1106-gs we've cut v0.10 which contains #50. This should reduce CPU usage during resync. As of yet, the configmap annotation problem is still unclear and I've not been able to reproduce it. Any hints you can provide would be helpful. Finally, @patrick-east has looked into the decode errors and has some thoughts.

@muzcategui1106-gs
Copy link
Author

@tsandall this is awesome so hopefully this should reduce CPU uses and increase resource sync. Will give it a try.

Regarding the configmap sync let me see what I can dig in the next couple of days :)

on the meantime I will start running 0.10 in my dev environment

@patrick-east
Copy link
Contributor

@patrick-east has looked into the decode errors and has some thoughts.

Right, so looking into it these type of errors occur if the scheme wasn't registered with our client (which happens over here: https://github.com/open-policy-agent/kube-mgmt/blob/master/pkg/configmap/configmap.go#L104-L112 ). What isn't clear is what object came over the wire that it was unable to decode.

The current theory is that some object version/schema changed while kube-mgmt was running, or the client code we're using is out of date or has a bug in it and is missing part of the schema for some object.

As of the last chat with @tsandall it seems like this particular issue isn't likely a big deal as kube-mgmt will just continue processing the objects we care about.

@anderseknert
Copy link
Member

Having just installed OPA and kube-mgmt in our 3 dev clusters I am also seeing this in the logs for each environment (131 "Unable to decode.." messages last hour).

Also seem to be having some (possibly related?) issues syncing kubernetes resources, leading to data being seemingly unavailable to some of the OPA instances, which in turn leads to authorization decisions being allowed or denied based on which of the instances is consulted (the default of course set to deny).

Let me know if I can provide any data to help you look into this.

@rtoma
Copy link

rtoma commented Oct 30, 2019

Hi @anderseknert, please add more details. Like: versions of OPA/kube-mgmt, the startup arguments and the full error.

@anderseknert
Copy link
Member

Hi @rtoma, and thanks for reaching out. Here are some of the details. Let me know if you need more.

OPA image: docker.io/openpolicyagent/opa:0.14.2
Startup arguments run --server --authentication=token --authorization=basic --ignore=.* /policies/authz.rego

kube-mgmt image: openpolicyagent/kube-mgmt:0.10
Startup arguments: --opa-auth-token-file=/policies/token --require-policy-label=true --replicate=v1/services --replicate=v1/pods --replicate=v1/configmaps --replicate=v1/persistentvolumeclaims --replicate=apps/v1/deployments --replicate=apps/v1/statefulsets --replicate=autoscaling/v1/horizontalpodautoscalers --replicate=batch/v1/jobs --replicate=batch/v1beta1/cronjobs --replicate=extensions/v1beta1/ingresses --replicate=extensions/v1beta1/replicasets --replicate=networking.k8s.io/v1/networkpolicies

All in all the payload returned when hitting /v1/data/ is about 25MB of JSON so it's definitely syncing - it just seems to have intermittent problems on some of the instances.

Full error:
E1030 18:02:45.864525 1 streamwatcher.go:109] Unable to decode an event from the watch stream: unable to decode watch event: no kind "Status" is registered for version "v1" in scheme "github.com/open-policy-agent/kube-mgmt/pkg/configmap/configmap.go:102"

@rtoma
Copy link

rtoma commented Oct 30, 2019

Hi, our kube-mgmt logging is full of these errors without functional impact. Which is confirmed by:

As of the last chat with @tsandall it seems like this particular issue isn't likely a big deal as kube-mgmt will just continue processing the objects we care about.

So maybe focus on the 'issues syncing kubernetes resources' which clearly has a functional impact. Maybe it is an idea to create a script that makes /v1/data calls on all pods every minute and compare the results? We've created a metrics collector (which I can not share) for this purpose. Another suggestion because it seems you want to seriously use OPA: develop an OPA regression tester that periodically POSTs artificial AdmissionReview payloads (extract from OPA's decision log) to the OPA webhook endpoint and match the result against the expected results. That way you can verify OPA and its policies are behaving as you expect.

@anderseknert
Copy link
Member

Thanks @rtoma. Yeah we are definitely looking at using OPA at a larger scale. So far I have some tests in place, and more definitely to come - though currently they all point to an ingress controller just like the actual authorization webhook that we're looking at using OPA for as our first proof of concenpt. It's from these tests I've noticed different results on re-running tests though nothing in the kubernetes data actually changed during that time frame. Will see about running tests targeting the pods directly, thanks for the pointer. Even if I can verify the sync issue I'm not sure how to proceed with that from there though.

@anderseknert
Copy link
Member

So my tests work something like this:

  1. In the setup, create and deploy a kubernetes resource.
  2. Verify that I can run policy queries targeting OP with aforementioned resource as data.
  3. Teardown, delete kubernetes resource.

Sleeping 60 seconds between 1 and 2 seems to solve the problem I had with data getting out of sync, so this seems like a problem with my tests rather than with kube-mgmt. Will extend the test suite to also target individual pods, but for now this seems solved for me.

The error message originally reported is still very much present and annoying, but it did not cause sync issues.

@patoarvizu
Copy link

Will anything be done about the no kind "Status" is registered for version "v1" error? It's kind of confusing and can make debugging harder.

Is it an issue with the Kubernetes version? I'm seeing it on both a 1.12 and a 1.15 cluster.

@tsandall
Copy link
Member

tsandall commented Jan 16, 2020

I took another look into the no kind "Status" is registered error message and found (what I think is) a solution. I've posted a PR w/ the one-line fix #58. I'll merge this fix later today and cut a release.

@muzcategui1106-gs if you can test this out, that would be great.

@muzcategui1106-gs
Copy link
Author

@tsandall yes when the release comes out, I will test out the changes in our dev environment

@tsandall
Copy link
Member

tsandall commented Jan 16, 2020 via email

@tsandall
Copy link
Member

We've confirmed that v0.11 fixes the logging issue. Since the logging problem and the resource sync issue have been resolved, I'm going to close this for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants