kube-mgmt throwing errors when pod has been running for quite some time #49

muzcategui1106-gs · 2019-09-04T12:43:46Z

Unsure if this is an actual issue or if I should log separate issues for each of them.

I am seeing a weird behavior in kube-mgmt after it has ran for a while; eg: more than 5 hours

This logs start appearing

E0904 12:18:45.089926 1 streamwatcher.go:109] Unable to decode an event from the watch stream: unable to decode watch event: no kind "Status" is registered for version "v1" in scheme "github.com/open-policy-agent/kube-mgmt/pkg/configmap/configmap.go:102"

kube-mgmt seems to stop honoring the --require-policy-label as all the configmaps in the opa namespace get annotated with the penpolicyagent.org/policy-status: annotation
Resource syncs seem to crash periodically (on an hourly basis)

time="2019-09-04T10:51:03Z" level=info msg="Sync channel for extensions/v1beta1/ingresses closed. Restarting immediately."
time="2019-09-04T10:51:03Z" level=info msg="Syncing extensions/v1beta1/ingresses."
time="2019-09-04T10:51:08Z" level=info msg="Listed extensions/v1beta1/ingresses and got 319 resources with resourceVersion 1188097528. Took 5.462164663s."
time="2019-09-04T10:52:05Z" level=info msg="Loaded extensions/v1beta1/ingresses resources into OPA. Took 56.650552306s. Starting watch at resourceVersion 1188097528."

time="2019-09-04T11:56:18Z" level=info msg="Sync channel for extensions/v1beta1/ingresses closed. Restarting immediately."
time="2019-09-04T11:56:18Z" level=info msg="Syncing extensions/v1beta1/ingresses."
time="2019-09-04T11:56:19Z" level=info msg="Listed extensions/v1beta1/ingresses and got 319 resources with resourceVersion 1188317650. Took 872.624828ms."
time="2019-09-04T11:57:43Z" level=info msg="Loaded extensions/v1beta1/ingresses resources into OPA. Took 1m24.203876182s. Starting watch at resourceVersion 1188317650."

time="2019-09-04T12:39:50Z" level=info msg="Sync channel for extensions/v1beta1/ingresses closed. Restarting immediately."
time="2019-09-04T12:39:50Z" level=info msg="Syncing extensions/v1beta1/ingresses."
time="2019-09-04T12:39:53Z" level=info msg="Listed extensions/v1beta1/ingresses and got 319 resources with resourceVersion 1188471126. Took 2.790341522s."
time="2019-09-04T12:41:36Z" level=info msg="Loaded extensions/v1beta1/ingresses resources into OPA. Took 1m43.350371162s. Starting watch at resourceVersion 1188471126."

Unsure what the root cause may be. But here are some details about the versions being used

kube-mgmt version: 0.9
OPA version: 0.12.2

The text was updated successfully, but these errors were encountered:

tsandall · 2019-09-19T15:45:01Z

@muzcategui1106-gs can you show an example of a configmap that's annotated with policy-status that should not be? From looking at the configmap replication code in kube-mgmt, it's not obvious how this could happen.

RE: The sync channel closure, that is to be expected with the Kubernetes client. However, the delay on upserting into OPA is NOT expected. I'll try to reproduce the problem.

One thing that might help would be to update the data replicator to use the PATCH method on OPA instead of DELETE followed by a series of PUTs. Here is the code in question: https://github.com/open-policy-agent/kube-mgmt/blob/master/pkg/data/generic.go#L147

muzcategui1106-gs · 2019-09-23T13:17:54Z

Sure I not allowed to show an entire configmcannot show the entire configmap but only the error that I get ( i changed the filename in the error message)

Version Info:
OPA: 0.14.0
Kube-mgmt: 0.9

Flags on kube-mgmt:
args: - "--replicate-cluster=v1/namespaces" - "--replicate-cluster=v1/persistentvolumes" - "--replicate=v1/persistentvolumeclaims" - "--replicate=v1/resourcequotas" - "--require-policy-label"
Error on configmap
openpolicyagent.org/policy-status: '{"status":"error","error":{"code":"invalid_parameter","message":"error(s) occurred while compiling module(s)","errors":[{"code":"rego_parse_error","message":"no match found, expected: [ \\t\\r\\n] or [A-Za-z_]","location":{"file":"opa/example.conf","row":7,"col":11}}]}}'

Regarding resource utilization I dont have any specification on my deployment about that. I should have bvut from what i see they pods dont seem to be starving for resources

NAME CPU(cores) MEMORY(bytes)
opa-5985f8cf78-kv45t 1571m 132Mi
opa-5985f8cf78-rbq4p 1m 151Mi
opa-5985f8cf78-rcrds 1369m 156Mi
policy-engine-6b6ffbb5f9-5kznf 914m 78Mi
policy-engine-6b6ffbb5f9-64l8d 503m 76Mi
policy-engine-6b6ffbb5f9-bhlht 329m 81Mi

What should be the recommended values in your opinion?

Regarding the code you mention https://github.com/open-policy-agent/kube-mgmt/blob/master/pkg/data/generic.go#L147 I observed the same when I was looking from a performance point of view it does seem like a good idea to do PATCH as opposed to DELETE and PUTS

tsandall · 2019-09-23T13:52:07Z

@muzcategui1106-gs can you confirm whether the openpolicyagent.org/policy label is set on that configmap? I have a test deployment that is emitting the "Unable to decode..." error log but the configmaps are not being loaded incorrectly:

$ kubectl -n torin-test get configmap -o yaml
apiVersion: v1
items:
- apiVersion: v1
  data:
    x.txt: |
      hello
  kind: ConfigMap
  metadata:
    creationTimestamp: "2019-09-19T15:26:12Z"
    name: notpolicy
    namespace: torin-test
    resourceVersion: "7431454"
    selfLink: /api/v1/namespaces/torin-test/configmaps/notpolicy
    uid: d01934e1-daf1-11e9-ae65-024400571c2a
- apiVersion: v1
  data:
    x.rego: |-
      package foo

      p = 1
  kind: ConfigMap
  metadata:
    annotations:
      openpolicyagent.org/policy-status: '{"status":"ok"}'
    creationTimestamp: "2019-09-19T15:26:06Z"
    labels:
      openpolicyagent.org/policy: rego
    name: policy
    namespace: torin-test
    resourceVersion: "7431761"
    selfLink: /api/v1/namespaces/torin-test/configmaps/policy
    uid: cc69ae52-daf1-11e9-ae65-024400571c2a
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Logs:

E0923 13:10:34.000910       1 streamwatcher.go:109] Unable to decode an event from the watch stream: unable to decode watch event: no kind "Status" is registered for version "v1" in scheme "github.com/open-policy-agent/kube-mgmt/pkg/configmap/configmap.go:102"
E0923 13:32:19.022065       1 streamwatcher.go:109] Unable to decode an event from the watch stream: unable to decode watch event: no kind "Status" is registered for version "v1" in scheme "github.com/open-policy-agent/kube-mgmt/pkg/configmap/configmap.go:102"
E0923 13:49:55.032696       1 streamwatcher.go:109] Unable to decode an event from the watch stream: unable to decode watch event: no kind "Status" is registered for version "v1" in scheme "github.com/open-policy-agent/kube-mgmt/pkg/configmap/configmap.go:102"
time="2019-09-23T13:51:44Z" level=info msg="Sync channel for v1/pods closed. Restarting immediately." 
time="2019-09-23T13:51:44Z" level=info msg="Syncing v1/pods." 
time="2019-09-23T13:51:44Z" level=info msg="Listed v1/pods and got 633 resources with resourceVersion 9361168. Took 294.467604ms." 
time="2019-09-23T13:51:45Z" level=info msg="Loaded v1/pods resources into OPA. Took 524.414201ms. Starting watch at resourceVersion 9361168."

muzcategui1106-gs · 2019-09-23T14:41:42Z

the openpolicyagent.org/policy is not set only the annotation

tsandall · 2019-10-01T19:00:07Z

@muzcategui1106-gs we've cut v0.10 which contains #50. This should reduce CPU usage during resync. As of yet, the configmap annotation problem is still unclear and I've not been able to reproduce it. Any hints you can provide would be helpful. Finally, @patrick-east has looked into the decode errors and has some thoughts.

muzcategui1106-gs · 2019-10-01T19:09:38Z

@tsandall this is awesome so hopefully this should reduce CPU uses and increase resource sync. Will give it a try.

Regarding the configmap sync let me see what I can dig in the next couple of days :)

on the meantime I will start running 0.10 in my dev environment

patrick-east · 2019-10-01T19:36:42Z

@patrick-east has looked into the decode errors and has some thoughts.

Right, so looking into it these type of errors occur if the scheme wasn't registered with our client (which happens over here: https://github.com/open-policy-agent/kube-mgmt/blob/master/pkg/configmap/configmap.go#L104-L112 ). What isn't clear is what object came over the wire that it was unable to decode.

The current theory is that some object version/schema changed while kube-mgmt was running, or the client code we're using is out of date or has a bug in it and is missing part of the schema for some object.

As of the last chat with @tsandall it seems like this particular issue isn't likely a big deal as kube-mgmt will just continue processing the objects we care about.

anderseknert · 2019-10-30T18:11:17Z

Having just installed OPA and kube-mgmt in our 3 dev clusters I am also seeing this in the logs for each environment (131 "Unable to decode.." messages last hour).

Also seem to be having some (possibly related?) issues syncing kubernetes resources, leading to data being seemingly unavailable to some of the OPA instances, which in turn leads to authorization decisions being allowed or denied based on which of the instances is consulted (the default of course set to deny).

Let me know if I can provide any data to help you look into this.

rtoma · 2019-10-30T18:40:29Z

Hi @anderseknert, please add more details. Like: versions of OPA/kube-mgmt, the startup arguments and the full error.

anderseknert · 2019-10-30T19:27:56Z

Hi @rtoma, and thanks for reaching out. Here are some of the details. Let me know if you need more.

OPA image: docker.io/openpolicyagent/opa:0.14.2
Startup arguments run --server --authentication=token --authorization=basic --ignore=.* /policies/authz.rego

kube-mgmt image: openpolicyagent/kube-mgmt:0.10
Startup arguments: --opa-auth-token-file=/policies/token --require-policy-label=true --replicate=v1/services --replicate=v1/pods --replicate=v1/configmaps --replicate=v1/persistentvolumeclaims --replicate=apps/v1/deployments --replicate=apps/v1/statefulsets --replicate=autoscaling/v1/horizontalpodautoscalers --replicate=batch/v1/jobs --replicate=batch/v1beta1/cronjobs --replicate=extensions/v1beta1/ingresses --replicate=extensions/v1beta1/replicasets --replicate=networking.k8s.io/v1/networkpolicies

All in all the payload returned when hitting /v1/data/ is about 25MB of JSON so it's definitely syncing - it just seems to have intermittent problems on some of the instances.

Full error:
E1030 18:02:45.864525 1 streamwatcher.go:109] Unable to decode an event from the watch stream: unable to decode watch event: no kind "Status" is registered for version "v1" in scheme "github.com/open-policy-agent/kube-mgmt/pkg/configmap/configmap.go:102"

rtoma · 2019-10-30T19:40:21Z

Hi, our kube-mgmt logging is full of these errors without functional impact. Which is confirmed by:

As of the last chat with @tsandall it seems like this particular issue isn't likely a big deal as kube-mgmt will just continue processing the objects we care about.

So maybe focus on the 'issues syncing kubernetes resources' which clearly has a functional impact. Maybe it is an idea to create a script that makes /v1/data calls on all pods every minute and compare the results? We've created a metrics collector (which I can not share) for this purpose. Another suggestion because it seems you want to seriously use OPA: develop an OPA regression tester that periodically POSTs artificial AdmissionReview payloads (extract from OPA's decision log) to the OPA webhook endpoint and match the result against the expected results. That way you can verify OPA and its policies are behaving as you expect.

anderseknert · 2019-10-30T20:41:42Z

Thanks @rtoma. Yeah we are definitely looking at using OPA at a larger scale. So far I have some tests in place, and more definitely to come - though currently they all point to an ingress controller just like the actual authorization webhook that we're looking at using OPA for as our first proof of concenpt. It's from these tests I've noticed different results on re-running tests though nothing in the kubernetes data actually changed during that time frame. Will see about running tests targeting the pods directly, thanks for the pointer. Even if I can verify the sync issue I'm not sure how to proceed with that from there though.

anderseknert · 2019-10-30T22:14:03Z

So my tests work something like this:

In the setup, create and deploy a kubernetes resource.
Verify that I can run policy queries targeting OP with aforementioned resource as data.
Teardown, delete kubernetes resource.

Sleeping 60 seconds between 1 and 2 seems to solve the problem I had with data getting out of sync, so this seems like a problem with my tests rather than with kube-mgmt. Will extend the test suite to also target individual pods, but for now this seems solved for me.

The error message originally reported is still very much present and annoying, but it did not cause sync issues.

patoarvizu · 2019-12-19T21:08:12Z

Will anything be done about the no kind "Status" is registered for version "v1" error? It's kind of confusing and can make debugging harder.

Is it an issue with the Kubernetes version? I'm seeing it on both a 1.12 and a 1.15 cluster.

tsandall · 2020-01-16T13:53:32Z

I took another look into the no kind "Status" is registered error message and found (what I think is) a solution. I've posted a PR w/ the one-line fix #58. I'll merge this fix later today and cut a release.

@muzcategui1106-gs if you can test this out, that would be great.

muzcategui1106-gs · 2020-01-16T14:00:49Z

@tsandall yes when the release comes out, I will test out the changes in our dev environment

tsandall · 2020-01-16T16:20:53Z

@muzcategui1106-gs great. There's a new openpolicyagent/kube-mgmt:0.11 image on Docker Hub that you can try out. Please let me know if this resolves the issue.

…

On Thu, Jan 16, 2020 at 9:00 AM muzcategui1106-gs ***@***.***> wrote: @tsandall <https://github.com/tsandall> yes when the release comes out, I will test out the changes in our dev environment — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#49?email_source=notifications&email_token=AAB2KJP3BTHEESWSFCIG7JDQ6BSBFA5CNFSM4ITROD52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJEE5OA#issuecomment-575164088>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAB2KJKDFUUVYGJO32UZ6NLQ6BSBFANCNFSM4ITROD5Q> .

-- -Torin

tsandall · 2020-01-29T01:14:49Z

We've confirmed that v0.11 fixes the logging issue. Since the logging problem and the resource sync issue have been resolved, I'm going to close this for now.

open-policy-agent/kube-mgmt#49

patrick-east mentioned this issue Sep 26, 2019

Change resource sync to use PATCH instead of PUTs #50

Merged

ashutosh-narkar mentioned this issue Jan 28, 2020

trouble loading policy, kube-mgmt logs throwing error #63

Closed

tsandall closed this as completed Jan 29, 2020

devops-grundfos added a commit to devops-grundfos/charts that referenced this issue Jul 9, 2020

Fixes well known bug: open-policy-agent/kube-mgmt#49

6fb9838

open-policy-agent/kube-mgmt#49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kube-mgmt throwing errors when pod has been running for quite some time #49

kube-mgmt throwing errors when pod has been running for quite some time #49

muzcategui1106-gs commented Sep 4, 2019

tsandall commented Sep 19, 2019 •

edited

Loading

muzcategui1106-gs commented Sep 23, 2019 •

edited

Loading

tsandall commented Sep 23, 2019

muzcategui1106-gs commented Sep 23, 2019

tsandall commented Oct 1, 2019

muzcategui1106-gs commented Oct 1, 2019

patrick-east commented Oct 1, 2019

anderseknert commented Oct 30, 2019

rtoma commented Oct 30, 2019

anderseknert commented Oct 30, 2019

rtoma commented Oct 30, 2019

anderseknert commented Oct 30, 2019

anderseknert commented Oct 30, 2019

patoarvizu commented Dec 19, 2019

tsandall commented Jan 16, 2020 •

edited

Loading

muzcategui1106-gs commented Jan 16, 2020

tsandall commented Jan 16, 2020 via email

tsandall commented Jan 29, 2020

kube-mgmt throwing errors when pod has been running for quite some time #49

kube-mgmt throwing errors when pod has been running for quite some time #49

Comments

muzcategui1106-gs commented Sep 4, 2019

tsandall commented Sep 19, 2019 • edited Loading

muzcategui1106-gs commented Sep 23, 2019 • edited Loading

tsandall commented Sep 23, 2019

muzcategui1106-gs commented Sep 23, 2019

tsandall commented Oct 1, 2019

muzcategui1106-gs commented Oct 1, 2019

patrick-east commented Oct 1, 2019

anderseknert commented Oct 30, 2019

rtoma commented Oct 30, 2019

anderseknert commented Oct 30, 2019

rtoma commented Oct 30, 2019

anderseknert commented Oct 30, 2019

anderseknert commented Oct 30, 2019

patoarvizu commented Dec 19, 2019

tsandall commented Jan 16, 2020 • edited Loading

muzcategui1106-gs commented Jan 16, 2020

tsandall commented Jan 16, 2020 via email

tsandall commented Jan 29, 2020

tsandall commented Sep 19, 2019 •

edited

Loading

muzcategui1106-gs commented Sep 23, 2019 •

edited

Loading

tsandall commented Jan 16, 2020 •

edited

Loading