Skip to content
This repository has been archived by the owner on Sep 14, 2020. It is now read-only.

Retry is not happening after exception #308

Closed
gtsystem opened this issue Feb 10, 2020 · 11 comments
Closed

Retry is not happening after exception #308

gtsystem opened this issue Feb 10, 2020 · 11 comments
Assignees
Labels
bug Something isn't working

Comments

@gtsystem
Copy link

Long story short

If an exception occur inside a handler for volumeattachments objects, the handler will never get retried.

Looking at the logs, it seems that kopf try to patch the object status with {"status":{"kopf": {"processing": ...}}}
This patch call succeed but the object status is not changed. I guess this is the reason no further attempts are made.
In theory the /status endpoint should be used to patch the status, but from my testing this have no effects as well. It seems like custom fields are not accepted inside "status".

Possible fix: use an annotation to store kopf status?

Description

The code snippet to reproduce the issue

test_va.py

class MyException(Exception):
    pass

@kopf.on.create('storage.k8s.io', 'v1', 'volumeattachments', retries=3, backoff=1.0)
def eventual_failure_with_tracebacks(**kwargs):
    raise Exception("An error that is supposed to be recoverable.")

va.yaml

apiVersion: storage.k8s.io/v1
kind: VolumeAttachment
metadata:
  name: test-vaa
spec:
  attacher: xxx 
  nodeName: kind-worker
  source:
    persistentVolumeName: a-pv
status:
  attached: false
The exact command to reproduce the issue
kopf run test_va.py &
kubectl create -f va.yaml
The full output of the command that failed
[2020-02-10 10:29:19,209] kopf.reactor.activit [INFO    ] Initial authentication has been initiated.
[2020-02-10 10:29:19,232] kopf.activities.auth [INFO    ] Handler 'login_via_pykube' succeeded.
[2020-02-10 10:29:19,245] kopf.activities.auth [INFO    ] Handler 'login_via_client' succeeded.
[2020-02-10 10:29:19,245] kopf.reactor.activit [INFO    ] Initial authentication has finished.
[2020-02-10 10:29:19,276] kopf.engines.peering [WARNING ] Default peering object not found, falling back to the standalone mode.
[2020-02-10 10:29:45,516] kopf.objects         [ERROR   ] [None/test-vaa] Handler 'eventual_failure_with_tracebacks' failed with an exception. Will retry.
Traceback (most recent call last):
  File "/src/_envl/lib64/python3.7/site-packages/kopf/reactor/handling.py", line 529, in _execute_handler
    lifecycle=lifecycle,  # just a default for the sub-handlers, not used directly.
  File "/src/_envl/lib64/python3.7/site-packages/kopf/reactor/handling.py", line 618, in _call_handler
    **kwargs,
  File "/src/_envl/lib64/python3.7/site-packages/kopf/reactor/invocation.py", line 130, in invoke
    result = await loop.run_in_executor(config.WorkersConfig.get_syn_executor(), real_fn)
  File "/usr/lib64/python3.7/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "test_va.py", line 9, in eventual_failure_with_tracebacks
    raise Exception("An error that is supposed to be recoverable.")
Exception: An error that is supposed to be recoverable.

Environment

  • Kopf version: 0.25
  • Kubernetes version: 1.16.3
  • Python version: 3.7.3
  • OS/platform: linux
@gtsystem gtsystem added the bug Something isn't working label Feb 10, 2020
@nolar
Copy link
Contributor

nolar commented Feb 13, 2020

Hello. Thanks for reporting.

Yes, indeed, Kopf relies on patching the object to cause a cascaded reaction with the next steps (maybe not the wisest way, but it was done so originally, and I had no reasons to redesign that yet).

As far as I know, /status is slightly different now when it is declared as a subresource of a CRD.

Kopf does not handle this case at the moment, and only assumes that status is part of the object — which can be controlled with custom resources, but not with built-in resources.

I think, it would be easy to change the function to patch either an object or its /status based on the resource definition. Since recently, we fetch the resource's meta-information anyway — for the cluster-scoped/namespaced detection. It can be extended to identify when /status is a sub-resource too.

@nolar
Copy link
Contributor

nolar commented Feb 13, 2020

A note for self on how it is defined in a GKE cluster:

curl -i -k -H "Authorization: Bearer $token" https://$ip/apis/storage.k8s.io/v1

{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "storage.k8s.io/v1",
  "resources": [
    {
      "name": "storageclasses",
      "singularName": "",
      "namespaced": false,
      "kind": "StorageClass",
      "verbs": [
        "create",
        "delete",
        "deletecollection",
        "get",
        "list",
        "patch",
        "update",
        "watch"
      ],
      "shortNames": [
        "sc"
      ]
    },
    {
      "name": "volumeattachments",
      "singularName": "",
      "namespaced": false,
      "kind": "VolumeAttachment",
      "verbs": [
        "create",
        "delete",
        "deletecollection",
        "get",
        "list",
        "patch",
        "update",
        "watch"
      ]
    },
    {
      "name": "volumeattachments/status",
      "singularName": "",
      "namespaced": false,
      "kind": "VolumeAttachment",
      "verbs": [
        "get",
        "patch",
        "update"
      ]
    }
  ]
}

So, volumeattachments/status is obviously there, separated from volumeattachments.

The same happens with CRDs when status is declared as a sub-resource:

$ curl -i -k -H "Authorization: Bearer $token" https://$ip/apis/zalando.org/v1

{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "zalando.org/v1",
  "resources": [
    {
      "name": "kopfexamples222",
      "singularName": "kopfexample222",
      "namespaced": true,
      "kind": "KopfExample222",
      "verbs": [
        "delete",
        "deletecollection",
        "get",
        "list",
        "patch",
        "create",
        "update",
        "watch"
      ],
      "shortNames": [
        "kopfexes222",
        "kopfex222",
        "kexes222",
        "kex222"
      ]
    },
    {
      "name": "kopfexamples222/status",
      "singularName": "",
      "namespaced": true,
      "kind": "KopfExample222",
      "verbs": [
        "get",
        "patch",
        "update"
      ]
    },
    {
      "name": "kopfexamples",
      "singularName": "kopfexample",
      "namespaced": true,
      "kind": "KopfExample",
      "verbs": [
        "delete",
        "deletecollection",
        "get",
        "list",
        "patch",
        "create",
        "update",
        "watch"
      ],
      "shortNames": [
        "kopfexes",
        "kopfex",
        "kexes",
        "kex"
      ]
    }
  ]

A little experiment shows that it indeed reacts only when /status is patched, and ignores the changes when the object itself is patched with the status: field.

$ kubectl get kex222 -w
$ curl -i -k -X PATCH -H "Authorization: Bearer $token" https://$ip/apis/zalando.org/v1/namespaces/default/kopfexamples222/kopf-example-1 -d '{"status": {"create_fn": {"message":400}}}' -H "Content-Type: application/merge-patch+json"
$ curl -i -k -X PATCH -H "Authorization: Bearer $token" https://$ip/apis/zalando.org/v1/namespaces/default/kopfexamples222/kopf-example-1/status -d '{"status": {"create_fn": {"message":300}}}' -H "Content-Type: application/merge-patch+json"

@nolar
Copy link
Contributor

nolar commented Feb 13, 2020

Related: #119. Should be fixed with #313.

@nolar nolar self-assigned this Feb 13, 2020
@gtsystem
Copy link
Author

As I mention, it seems that added custom fields in /status are just ignored:

curl -i -k -X PATCH -H "Content-Type: application/merge-patch+json" -d '{"status": {"create_fn":{"message":400}}}' "http://127.0.0.1:8001/apis/storage.k8s.io/v1/volumeattachments/myva/status"
HTTP/1.1 200 OK
...

{
  "kind": "VolumeAttachment",
  "apiVersion": "storage.k8s.io/v1",
  "metadata": {
    "name": "myva",
    ...
    }
  },
  "spec": {
   ...
  },
  "status": {
    "attached": true
  }

@nolar
Copy link
Contributor

nolar commented Feb 14, 2020

Ouch. That is highly unfortunate. Sorry, I missed that statement.

I couldn't quickly figure out how to create VolumeAttachments in GKE even with a pod+pvc pair — so I could not test it on this resource specifically (yesterday evening). Well, at least, it now works with other resources with status-as-a-subresource (#119).

What I can suggest, is a per-resource-kind or per-operator config where the internal status is stored: with .status.kopf by default, and optionally configurable to .kopf or .blahblah. — Can you please try if it works with VolumeAttachments? Or are they so strict that even arbitrary fields are not allowed?

Actually, there is a similar task #23 (and #283) — but there, the main use-case is multiple operators handling the same resources, so that they do not collide with each other. It can be extended to arbitrary root field name too.

Storing it in an arbitrary top-level field would be the easiest way. Otherwise, annotations can be used indeed; but labels & annotations are allowed to be strings only, so serialisation/deserialisation would be needed.

@gtsystem
Copy link
Author

I can reproduce in other built in resources like pod or pvc. I jsut tried patching a top-level field, again no luck. Maybe new versions of kubernetes are getting more strict..

@nolar
Copy link
Contributor

nolar commented Feb 14, 2020

Indeed. I now also have this reproduced with pods in Minikube with 1.16. Neither status nor root accepts arbitrary fields. The only place it accepts arbitrary content is annotations.

So far, so good — let's abuse annotations!

I have made a little dirty-hacking draft to store the operator's state locally in two formats: per-handler and per-handler-field. After the handling cycle, these annotations are wiped (as previously with the status fields).

Per handler:

apiVersion: zalando.org/v1
kind: KopfExample
metadata:
  annotations:
    my-op.zalando.org/create_fn_1: '{"started": "2020-02-14T16:58:25.396364", "stopped":
      "2020-02-14T16:58:25.401844", "delayed": null, "retries": 1, "success": true,
      "failure": false, "message": null}'
    my-op.zalando.org/create_fn_2: '{"started": "2020-02-14T16:58:25.396421", "stopped":
      null, "delayed": null, "retries": 0, "success": false, "failure": false, "message":
      null}'
  creationTimestamp: "2020-02-14T16:58:25Z"
  finalizers:
  - kopf.zalando.org/KopfFinalizerMarker
  generation: 1

Per handler-field:

apiVersion: zalando.org/v1
kind: KopfExample
metadata:
  annotations:
    my-op.zalando.org/create_fn_1.delayed: "null"
    my-op.zalando.org/create_fn_1.failure: "false"
    my-op.zalando.org/create_fn_1.message: "null"
    my-op.zalando.org/create_fn_1.retries: "1"
    my-op.zalando.org/create_fn_1.started: '"2020-02-14T17:10:42.713227"'
    my-op.zalando.org/create_fn_1.stopped: '"2020-02-14T17:10:42.718587"'
    my-op.zalando.org/create_fn_1.success: "true"
    my-op.zalando.org/create_fn_2.delayed: "null"
    my-op.zalando.org/create_fn_2.failure: "false"
    my-op.zalando.org/create_fn_2.message: "null"
    my-op.zalando.org/create_fn_2.retries: "0"
    my-op.zalando.org/create_fn_2.started: '"2020-02-14T17:10:42.713403"'
    my-op.zalando.org/create_fn_2.stopped: "null"
    my-op.zalando.org/create_fn_2.success: "false"
  creationTimestamp: "2020-02-14T17:10:42Z"
  finalizers:
  - kopf.zalando.org/KopfFinalizerMarker
  generation: 1

Per handler-field, if serialised a little bit smarter:

apiVersion: zalando.org/v1
kind: KopfExample
metadata:
  annotations:
    my-op.zalando.org/create_fn_1.retries: "1"
    my-op.zalando.org/create_fn_1.started: "2020-02-14T17:10:42.713227"
    my-op.zalando.org/create_fn_1.stopped: "2020-02-14T17:10:42.718587"
    my-op.zalando.org/create_fn_1.success: "true"
    my-op.zalando.org/create_fn_2.retries: "0"
    my-op.zalando.org/create_fn_2.started: "2020-02-14T17:10:42.713403"
  creationTimestamp: "2020-02-14T17:10:42Z"
  finalizers:
  - kopf.zalando.org/KopfFinalizerMarker
  generation: 1

Works like a charm! — Both for CRs and pods.

In both cases, and as per the docs, the generation is not increased when metadata is changed (this happens only for spec). But it generates the watch-events — which makes sense since most of the printed columns of kubectl get -w are taken from status.

Beside this, such approach allows to customise the operator name: instead of hardcoded kopf.zalando.org, developers can use my-op.companyname.com in annotations.

The first format is compact. The second format is quite verbose, but it allows the individual fields to be put as printable columns, or to be filtered by them.

Another approach would be to put the whole state as one annotation, but it is complicated by merging of the existing state with its updates (unlike REST merge-patching, where different fields are merged inside of a dict server-side, one single field is always fully overwritten and requires client-side merging).

I will try to do some magic on that to make it nice and simple. But this can be considered as a general direction for built-in resources in the future. For custom resources, I prefer to keep status.kopf by default instead of annotations — but let the developers to configure that.

@gtsystem
Copy link
Author

One annotation per handler looks already good to me 👍

@gtsystem
Copy link
Author

Any update?

@nolar
Copy link
Contributor

nolar commented Mar 24, 2020

Sorry, I was a bit busy with this coronavirus aftermath at work recently (well, not "a bit", actually). Nevertheless, I finally have some spare time now, and some of the framework's tasks are shifted from my private time to my work time — so, let's see how much faster it will go now and in the next few days.

PS: For custom resources, there is a workaround described in #321 (with x-kubernetes-preserve-unknown-fields: true). It does not work for built-in resources, of course (there is no control over the schemas).

@gtsystem
Copy link
Author

Tested kopf==0.27rc5 (that includes #331) and now if works as expected. Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants