Resume events keep firing #223

kopf-archiver · 2020-08-18T20:01:08Z

An issue by pshchelo at 2019-11-11 19:08:19+00:00
Original URL: zalando-incubator/kopf#223

Expected Behavior

resume event is handled only once on actual operator start

Actual Behavior

resume event keeps firing at regular intervals

Steps to Reproduce the Problem

Using kopf/master and the following bare-minimal operator handling standard kopfexample resource:

import kopf

@kopf.on.update("zalando.org", "v1", "kopfexamples")
@kopf.on.create("zalando.org", "v1", "kopfexamples")
async def handle_apply(**kwargs):
    ...

@kopf.on.resume("zalando.org", "v1", "kopfexamples")
async def handle_resume(logger, **kwargs):
    logger.info("Got resume event")

and creating a bare-minimal kopfexample:

apiVersion: zalando.org/v1
kind: KopfExample
metadata:
  name: kopf-example-1
  namespace: kopf
spec:
  field: value

and watching the logs I see that resume event is keeping firing at regular intervals (roughly once per minute) even as there was no 'initial' resources in the first place:

tox -eresume-master
resume-master installed: aiohttp==3.6.2,aiojobs==0.2.2,async-timeout==3.0.1,attrs==19.3.0,certifi==2019.9.11,chardet==3.0.4,Click==7.0,idna==2.8,iso8601==0.1.12,kopf==0.23.dev102+gbab7295,multidict==4.5.2,pykube-ng==19.10.0,PyYAML==5.1.2,requests==2.22.0,typing-extensions==3.7.4.1,urllib3==1.25.7,yarl==1.3.0
resume-master run-test-pre: PYTHONHASHSEED='980767318'
resume-master run-test: commands[0] | find /home/ubuntu/kopf-debug -type f -name '*.pyc' -delete
resume-master run-test: commands[1] | kopf run --standalone -n kopf /home/ubuntu/kopf-debug/resume.py
[2019-11-11 18:49:19,190] kopf.objects         [INFO    ] [kopf/kopf-example-1] Handler 'handle_apply' succeeded.
[2019-11-11 18:49:19,190] kopf.objects         [INFO    ] [kopf/kopf-example-1] All handlers succeeded for creation.
[2019-11-11 18:50:54,046] kopf.objects         [INFO    ] [kopf/kopf-example-1] Got resume event
[2019-11-11 18:50:54,049] kopf.objects         [INFO    ] [kopf/kopf-example-1] Handler 'handle_resume' succeeded.
[2019-11-11 18:50:54,049] kopf.objects         [INFO    ] [kopf/kopf-example-1] All handlers succeeded for resuming.
[2019-11-11 18:52:43,423] kopf.objects         [INFO    ] [kopf/kopf-example-1] Got resume event
[2019-11-11 18:52:43,425] kopf.objects         [INFO    ] [kopf/kopf-example-1] Handler 'handle_resume' succeeded.
[2019-11-11 18:52:43,426] kopf.objects         [INFO    ] [kopf/kopf-example-1] All handlers succeeded for resuming.
[2019-11-11 18:53:55,641] kopf.objects         [INFO    ] [kopf/kopf-example-1] Got resume event
[2019-11-11 18:53:55,643] kopf.objects         [INFO    ] [kopf/kopf-example-1] Handler 'handle_resume' succeeded.
[2019-11-11 18:53:55,643] kopf.objects         [INFO    ] [kopf/kopf-example-1] All handlers succeeded for resuming.
[2019-11-11 18:55:02,990] kopf.objects         [INFO    ] [kopf/kopf-example-1] Got resume event
[2019-11-11 18:55:02,993] kopf.objects         [INFO    ] [kopf/kopf-example-1] Handler 'handle_resume' succeeded.
[2019-11-11 18:55:02,993] kopf.objects         [INFO    ] [kopf/kopf-example-1] All handlers succeeded for resuming.
[2019-11-11 18:56:15,830] kopf.objects         [INFO    ] [kopf/kopf-example-1] Got resume event
[2019-11-11 18:56:15,832] kopf.objects         [INFO    ] [kopf/kopf-example-1] Handler 'handle_resume' succeeded.
[2019-11-11 18:56:15,833] kopf.objects         [INFO    ] [kopf/kopf-example-1] All handlers succeeded for resuming.
[2019-11-11 18:57:34,780] kopf.objects         [INFO    ] [kopf/kopf-example-1] Got resume event
[2019-11-11 18:57:34,783] kopf.objects         [INFO    ] [kopf/kopf-example-1] Handler 'handle_resume' succeeded.
[2019-11-11 18:57:34,783] kopf.objects         [INFO    ] [kopf/kopf-example-1] All handlers succeeded for resuming.

Upon logging of of all incoming kwargs to handle_resume one can see that the only different kwargs are started and runtime, everything else is the same.

Restarting the operator (so that 'initial' resource listing is not empty) does not change the behaviour.

Source code for the repro (with logging of all kwargs) can be found at
https://github.com/pshchelo/kopf-debug/tree/893ff6312251c51cb8a39b9e3a62c2e10389989c
and running tox -eresume-master from cloned repo.

Specifications

Platform: Ubuntu Bionic x64
Kubernetes version: v1.15.3-8+d06f4e3032941e
Python version: 3.7.3
Python packages installed:
aiohttp==3.6.2
aiojobs==0.2.2
async-timeout==3.0.1
attrs==19.3.0
certifi==2019.9.11
chardet==3.0.4
Click==7.0
idna==2.8
iso8601==0.1.12
kopf==0.23.dev102+gbab7295
multidict==4.5.2
pip==19.3.1
pykube-ng==19.10.0
PyYAML==5.1.2
requests==2.22.0
setuptools==41.4.0
typing-extensions==3.7.4.1
urllib3==1.25.7
wheel==0.33.6
yarl==1.3.0

Commented by pshchelo at 2019-11-11 19:09:38+00:00

I believe this has to do with 'initial' resource listing being done in 'streaming_watch' method.
Will push a PR soon.

Commented by nolar at 2019-11-12 09:45:08+00:00

Thanks for reporting.

This issue is solved in two look-ahead branches (with tests):

One is directly related to the suggested PR with proper usage of resourceVersion as long as possible (until "410 Gone").

Another one solves this problem completely by introducing an in-memory per-uid memos, so it prevents the resumes even when the resourceVersion is "410 Gone", and a re-listing is needed anyway.

This is ready for couple weeks already. I'll try to speed up the merges now.

Commented by nolar at 2019-11-13 17:09:00+00:00

The mentioned branches are merged and pre-released as kopf==0.23rc1. Will be normally released as kopf==0.23 in few days since now (after some live testing).

Beware: it also includes massive refactoring and internal changes, which can break things (though should not). See the Kopf 0.23rc1 Release Notes.

If the issue reappears with kopf>=0.23rc1, feel free to re-open this issue with a comment.

Commented by pshchelo at 2019-11-18 12:32:41+00:00

nolar testing with master currently, with repro controller as in
https://github.com/pshchelo/kopf-debug/blob/96fbdc20e38b91db214e0d174c0343a2a87aff81/resume.py

when I create 3 kopfexample resources, shutdown controller, edit 2nd, create 4th and delete 3rd resource and start controller again, I see that for the second resource first the 'update' handler is invoked, and then 'resume' handler with event==update.
Is it expected?

Commented by pshchelo at 2019-11-18 15:43:07+00:00

nolar, however, if I roll all (except delete) handlers into one (decorate the same handler function as create, update and resume) that second (spurious?) 'resume handles update event' is not happening (handler de-duplication at works?).

Commented by nolar at 2019-11-18 16:34:31+00:00

Regarding the deduplication:

Yes. The handlers are deduplicated intentionally for the case of multiple decorators for one function. This is for the case of resume+create handlers: either an operator is created, or an object is created:

@kopf.on.resume(...)
@kopf.on.create(...)
def spawn_a_monitor(**_):
    pass

If you need them separately, then these should be separate functions, which internally call a regular function to do something.

Regarding the order:

Yes, the handlers are executed in the order they are registered, and this is intentional.

Though, actually, this is not fully true. The proper answer is: They are attempted to be executed in the order registered, but if there are retries, the 2nd and following attempts of the 1st handler will be after the 1st attempts of the 2nd and other handlers.

This is controlled by kopf.reactor.lifecycles internally. You can define your own lifecycle function and set it as a default if needed. They are exposed as a public interface of kopf (via import kopf; kopf.set_default_lifecycle(kopf.lifecycles.shuffled), etc).

Regarding reason==UPDATE:

Yes, it is intentional, though discussable.

As it is implemented now, the resuming handlers are mixed into the selection of handlers to be executed, no matter the reason of this execution. As such, the reason can be UPDATE, so both on-update & on-resume handlers are executed.

The RESUME reason is an artificial pseudo-reason for the cases when the regular reasons are not applicable (nothing is created, nothing is updated, nothing is deleted), but we still need to continue with the resuming handlers (because it is there).

The current naming is a bit confusing though. I thought about renaming the RESUME reason to NOTICE, as in "I do not see anything changed, but here it is, exists out there, and I see it for the first time" — this would semantically separate the "resuming" handlers from the "noticing"/"updating"/etc reasons of these handlers being triggered (unlike the case of creation/update/deletion reasons & handlers, where the mapping is obvious and straightforward).

It was originally done so to keep the logic of resuming separated from the logic of creation/updating — in the assumption that resuming usually implies a background task/thread spawning.

I would be happy to understand other use-cases of resuming except as for task/thread spawning. Can you please explain how do you intend to use the reason kwarg (ex-event)? What is an expected logic?

Commented by pshchelo at 2019-11-18 17:34:04+00:00

I don't think the naming is confusing, and as you describe 'notice' - is is 'create' semantically (seeing for the first time part).

I would expect the 'resume' handler to process only resources that have not changed in between controller shutdown and start. Everything else should be covered by other handlers - if something changed while controller was gone - on.update, if created new - on.create, if deleted - on.delete.

Our main use case for 'resume' is actually handling controller updates/upgrades - usually controller creates some other 'child' resources in k8s, and the way it creates those may (and will?) change from one version to the next, and we obviously want the result of controller actions to be the same regardless of if one deployed a new one for the first time, or upgraded from the older version.

For that we aim to use the 'resume' handler to catch CRs that were not changed while controller is restarting, and re-render templates for / patch all existing child resources of that CR (as created by previous controller version) to bring those to the state how new controller version would've created those from scratch.

To reduce any possible duplicate heavy-lifting (like external API calls) it would be nice if double-appearing events were not happening.

However at the moment we use single function for create/update/resume (kind of like kubectl apply), so we shouldn't be affected thanks to handler de-duplication (and we also can filter on reason if we have to).

Also currently we use single instance of controller, so it would be interesting to test if 'resume' fires up on controller unfreeze as part of rolling update of a controller deployed like kind: Deployment.

Commented by nolar at 2019-11-18 18:09:12+00:00

Thanks. Just few thoughts to write down:

My first question is: why an on-resume handler is not enough, without mixing it with on-update/-on-create decorators? That would cover all the CRs, both changed and unmodified.

But if the goal is to catch all the "unmodified only" CRs, this will be a problem indeed.

This brings me to an idea that another handler should be introduced (@kopf.on.notice()?) for this purpose — to keep aligned with the previous semantics of the on-resume handlers as they were intended originally (but were broken all this time before 0.23rc1). Or maybe the resuming semantics should be narrowed only to the unmodified objects only before I do the 0.23 release with the fixes (which means: quickly).

On the other hand, both the docs, the examples, and even our own use-cases always rely on @kopf.on.create + @kopf.on.resume pairs for the same function — which implies that they are separate, non-overlapping reasons, as you explained in your use-case.

This may be also in relation with the @kopf.daemon() and @kopf.on.timer() decorators (#19), which are currently in progress: to have tasks/threads easily spawned without explicit orchestration (drafted and were functional for some time before I broke it), which should replace @kopf.on.resume as a mean of task spawning for all the CRs, and thus removing the need to have anything to react to the object existence in any state ("noticing" the existence), but only in the unmodified state.

I need some time to process this information and to simulate the use-cases. Especially those that could at all exist in the previous broken state in kopf<=0.22.

Commented by nolar at 2019-11-18 20:53:00+00:00

I've re-read some PRs from the past — it seems, this dilemma of resuming handlers' semantic was already there when resuming was introduced:

[PR] Add "resuming" handlers for resources (as co-handlers) #105 for the current solution with the resume handlers mixed into create/update/delete reasons on the operator restart.
[PR] Add "resuming" handlers for resources #96 for a declined solution with the resuming reason directly mapped to the resuming handlers, with no mixing in (see [PR] Add "resuming" handlers for resources #96 (comment)).

Specifically, this case:

However, if the object was changed (e.g. its status) before the crash or restart, the UPDATE cause is detected, and the @kopf.on.update handlers are called. The @kopf.on.resume handlers are NOT called in case the object was changed — this breaks the intuitive expectations and semantics of a background thread starter (the original purpose of this handler).

Starting the threads from the UPDATE/DELETE handlers is not desired, as it means we have to keep track of the thread pool state per object — extra hassle.

The RESUME cause has to be handled even if the object has changed (but maybe not if deleted) — before or after the UPDATE handler in the same handling cycle.

Which means, the idea of mixing them in was not only intentional, but conscious — after previously trying an alternative solution with the "unmodified only" reason & handlers.

So, I will keep @kopf.on.resume() handlers as they are — mixed in for any regular reasons (CREATE/UPDATE/DELETE) on the initial listing — as this indeed was the original intention since kopf==0.16.

But I admit that one use-case is missing now and has no workarounds —the one described above— the "unmodified only" handlers.

I am going to release 0.23 without this feature. This feature will be added a bit later (maybe in 0.24). Naming suggestions are welcome. @kopf.on.notice()? @kopf.on.recall()? @kopf.on.existence()? Anything else?

I've created #241 for this type of handlers. It seems easy to implement, as all the moving parts are already there, but have no name tag on them.

The text was updated successfully, but these errors were encountered:

kopf-archiver bot added the archive label Aug 18, 2020

kopf-archiver bot closed this as completed Aug 18, 2020

kopf-archiver bot changed the title ~~[archival placeholder]~~ Resume events keep firing Aug 19, 2020

kopf-archiver bot added the bug Something isn't working label Aug 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resume events keep firing #223

Resume events keep firing #223

kopf-archiver bot commented Aug 18, 2020 •

edited

Loading

Resume events keep firing #223

Resume events keep firing #223

Comments

kopf-archiver bot commented Aug 18, 2020 • edited Loading

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Specifications

kopf-archiver bot commented Aug 18, 2020 •

edited

Loading