Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

informer library for shared watching with retries and updated list #494

Merged
merged 1 commit into from
Mar 30, 2021

Conversation

grosser
Copy link
Contributor

@grosser grosser commented Mar 5, 2021

fixes #456

@grosser grosser force-pushed the grosser/informer branch 6 times, most recently from 48a91bd to c4bbdd9 Compare March 6, 2021 00:24
@grosser grosser force-pushed the grosser/informer branch 3 times, most recently from 38f2f39 to 912bf44 Compare March 6, 2021 22:12
Copy link
Collaborator

@cben cben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor questions aside, this is great ❤️

lib/kubeclient/informer.rb Outdated Show resolved Hide resolved
lib/kubeclient/informer.rb Outdated Show resolved Hide resolved
lib/kubeclient/informer.rb Show resolved Hide resolved
end

def fill_cache
reply = @client.get_entities(nil, @resource_name, raw: true, resource_version: '0')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will version "0" result in .metadata.resourceVersion we can watch from? It means "any version is OK" which allows server returning cached response, but I vaguely remember hearing that might allow a version so old it'll return 410 when watched (?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, when List+Watch restarts, this might lead to replaying a state + events that are older than we already observed?

https://kubernetes.io/docs/reference/using-api/api-concepts/#the-resourceversion-parameter doesn't exactly answer this, but does say about supplying resourceVersion="0" to watch directly:

Get State and Start at Any: Warning: Watches initialize this way may return arbitrarily stale data! Please review this semantic before using it, and favor the other semantics where possible. Start a watch at any resource version, the most recent resource version available is preferred, but not required; any starting resource version is allowed. It is possible for the watch to start at a much older resource version that the client has previously observed, particularly in high availability configurations, due to partitions or stale caches. Clients that cannot tolerate this should not start a watch with this semantic. To establish initial state, the watch begins with synthetic "Added" events for all resources instances that exist at the starting resource version. All following watch events are for all changes that occurred after the resource version the watch started at.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • if the version was too old it would results in a restart of the list/watch loop, so we should be good (never saw this happening)
  • watch does not receive the 0, but what the list returned ... but it could end up in a replay since the list might have gotten old data ... so ideally we'd keep track of the last resourceVersion we saw and then do something with that (with a "latest" approach we'd lose some events though)

case notice[:type]
when 'ADDED', 'MODIFIED' then @cache[cache_key(notice[:object])] = notice[:object]
when 'DELETED' then @cache.delete(cache_key(notice[:object]))
when 'ERROR' then break # restart
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • for 410 (Gone / Expired), this will restart the fill_cache -> watch_to_update_cache loop ✔️
  • can any other errors be interesting for user?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no idea, I never had trouble with just ignoring :D

when 'ADDED', 'MODIFIED' then @cache[cache_key(notice[:object])] = notice[:object]
when 'DELETED' then @cache.delete(cache_key(notice[:object]))
when 'ERROR' then break # restart
else raise "Unsupported event type #{notice[:type]}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this happens (shouldn't), it'd raise in the worker thread. Would ruby even print it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be most consistent to ignore it so the worker does not crash ...
signaling back is kinda tricky, that's what I'm planing to use a logger for

@cben cben added this to the ()b0m2mE milestone Mar 7, 2021
@grosser grosser force-pushed the grosser/informer branch 3 times, most recently from a266968 to f17b299 Compare March 8, 2021 03:14
@grosser
Copy link
Contributor Author

grosser commented Mar 8, 2021

still any showstoppers ?
added logger to make errors more visible

@cben cben removed this from the mistake milestone Mar 13, 2021
@cben
Copy link
Collaborator

cben commented Mar 13, 2021

[we're awaiting a baby any time now, so my response times will be more erratic than they already were.]

One thought I had is this actually consists of several parts:

  • an implementation of automatic watch restarts.
  • pub/sub via queue.
  • a cache of last seen state for all objects.

In theory they could be split, one might want restarts without keeping a cache in RAM (or only a subset of cache).
But it is a pretty convenient package together, and an established pattern from the Go client. And we could always split it later if there's demand.

LGTM 👍 Can you check the test failure?

  1) Failure:
RetryTest#test_timeout [/home/runner/work/kubeclient/kubeclient/test/test_informer.rb:122]:
not all expectations were satisfied
unsatisfied expectations:
- expected exactly once, invoked twice: #<AnyInstance:Kubeclient::Common::WatchStream>.finish(any_parameters)

@grosser grosser force-pushed the grosser/informer branch 5 times, most recently from 30b9e61 to 2431240 Compare March 28, 2021 20:44
@grosser
Copy link
Contributor Author

grosser commented Mar 28, 2021

I think I got the tests fixed now ... truffle-ruby had some timing issues, so had to disable 1 test there

@cben
Copy link
Collaborator

cben commented Mar 30, 2021

LGTM. I'm still uncertain about using resourceVersion "0" for list, but we can revise that later if needed.

@cben cben merged commit 19a5d14 into ManageIQ:master Mar 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Shared Informer Pattern
2 participants