improve informer test #594

grosser · 2022-12-22T05:24:34Z

making all the threads terminate and no longer killing things ... 3 green ci runs 🤞

cben · 2022-12-22T12:52:29Z

lib/kubeclient/informer.rb

+        @waiter.run
+      rescue ThreadError # rubocop:disable Lint/SuppressedException
+      end
+      @waiter.join


Oh I think I see — previously, leftover waiter thread(s) could execute @watcher.finish at any time, possibly interrupting unrelated watchers from later tests, right?

But if we somehow (e.g. 'ERROR') come here early during sleep(@reconcile_timeout), we might block for a long time — default 15min right?
Here, what we're potentially blocking is the existing worker thread's loop, which won't be directly felt by the app but can cause significant gaps in the watching?

it would have killed a random new waiter before (not really in tests since they all build their own informer)

added a Thread.pass to fix the race condition (don't think it really happens since the watch is a http request so it will take some time, but it's cheap so 🤷 )

cben · 2022-12-22T13:04:17Z

lib/kubeclient/informer.rb

+        rescue ThreadError
+          # thread was already dead
+        end
+        thread.join


Similar question about @waiter possibly blocking for many minutes.
Here it'd block the app calling stop_worker right?

I'm thinking there is a subordinate relationship between worker thread -> single watch_to_update_cache run -> waiter thread. 🤔
What if waiter threads did not refer to current instance variable @watcher.finish but closed over a lexically scoped reference to their watcher? Would then it be safe to leak unfinished waiter threads without .join-ing them? Would it be a good idea?

Waiter threads don't do much — are they safe to .kill? Or is there another way to interrupt a sleep() early?

Or is there another way to interrupt a sleep() early?

You can use Concurrent::Event with a wait timeout to implement an interruptible sleep.

before we killed them, but that made the tests brittle, so I like the joining since that makes anything that goes wrong more obvious (and re-raises exceptions too)

From my testing @watcher.each can get blocked on the response stream. Even closing the http_client will not unblock the response body stream.

In that case join on the waiter thread that runs the @watcher.each internally will never return

cben · 2022-12-22T13:16:55Z

@agrare I don't know if you have been following the Kubeclient::Informer class on master branch, but perhaps it's interesting for future use in manageiq (?). Either way, you may have useful experience to share here from how manageiq is killing & restarting watches... That's a long way of saying "please review" 😉

[P.S. entirely out-of-scope here, but there is the unfinished business of #488. The blocker there was we didn't know how to implement watch .finish with Faraday 🤷 @agrare I was considering maybe dropping .finish in 5.0 and letting you figure out how to bring it back when you want to upgrade ManageIQ 😝 But now I realize Informer relies on it deeply, and I suspect so would any advanced usage of watches?]

agrare · 2022-12-22T15:25:59Z

I don't know if you have been following the Kubeclient::Informer class on master branch, but perhaps it's interesting for future use in manageiq (?)

Hey @cben! No I hadn't seen this, it definitely looks intriguing though we try not to cache inventory to keep our memory usage down so I doubt we'd use it directly but I can definitely see how this would be useful if your code were only interacting with kubeclient.

There definitely could be a more user friendly interface around watches since it seems we are solving the same issues.

Either way, you may have useful experience to share here from how manageiq is killing & restarting watches... That's a long way of saying "please review"

I notice a few minor differences in how we handle this compared to the Informer class:

Our initial collection we don't pass resource_version: '0' to the get_
If a watch breaks unexpectedly we retry with the same resourceVersion where it looks like this gets the whole collection again (?)
If we get an error with 410 Gone we set resourceVersion to nil and restart the watch.
We use a Concurrent::AtomicBoolean to indicate to the watch thread if it should break out after the watch.finish or retry

I believe we pulled this logic from the kubevirt/cnv ManageIQ provider which was contributed by the kubevirt team. Not saying this is better or worse than the Informer implementation just noting the differences.

But now I realize Informer relies on it deeply, and I suspect so would any advanced usage of watches?]

Yes we do use #finish to get a watch to break out so that we can join cleanly but we join with a timeout and .kill the thread if it doesn't respond. We also aren't trying to keep a cache consistent from this thread though.

What is the purpose of the waiter thread? I looks like it'll stop the watch after 15 minutes?

grosser · 2022-12-26T05:19:45Z

I think this is a good step forward, so prefer to merge and then iterate further if there are still open issues.

cben · 2023-01-03T09:14:38Z

Sorry, missed your reply entirely.

FWIW, Thread.pass is advisory, so not sure it can really "fix" things:

Give the thread scheduler a hint to pass execution to another thread. A running thread may or may not switch, it depends on OS and processor.

I'm still somewhat concerned a call to stop_worker might get stuck for a long time.
But yes, previous behavior was chaotic, so I guess it's an advance. 👍

And I'm certainly happy about CI looking reliably green now 🎉 Merging.

DocX · 2023-01-04T09:09:10Z

lib/kubeclient/informer.rb

+          ensure
+            sleep(1) # do not overwhelm the api-server if we are somehow broken
+          end
+          break if @stopped


why not change the loop to until @stopped then?

grosser force-pushed the grosser/redo branch from 2a2748f to c23dfac Compare December 22, 2022 05:28

grosser mentioned this pull request Dec 22, 2022

use forking-test-runner to ensure no test pollution between different… #593

Merged

grosser force-pushed the grosser/redo branch 2 times, most recently from 8a3c96b to a819708 Compare December 22, 2022 05:34

cben reviewed Dec 22, 2022

View reviewed changes

improve informer test

e5cda0a

grosser force-pushed the grosser/redo branch from a819708 to e5cda0a Compare December 26, 2022 05:15

cben merged commit 0287f32 into ManageIQ:master Jan 3, 2023

DocX reviewed Jan 4, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve informer test #594

improve informer test #594

grosser commented Dec 22, 2022 •

edited

Loading

cben Dec 22, 2022

grosser Dec 26, 2022

cben Dec 22, 2022

agrare Dec 22, 2022

grosser Dec 26, 2022

DocX Jan 4, 2023

cben commented Dec 22, 2022

agrare commented Dec 22, 2022

grosser commented Dec 26, 2022

cben commented Jan 3, 2023

DocX Jan 4, 2023

improve informer test #594

improve informer test #594

Conversation

grosser commented Dec 22, 2022 • edited Loading

cben Dec 22, 2022

Choose a reason for hiding this comment

grosser Dec 26, 2022

Choose a reason for hiding this comment

cben Dec 22, 2022

Choose a reason for hiding this comment

agrare Dec 22, 2022

Choose a reason for hiding this comment

grosser Dec 26, 2022

Choose a reason for hiding this comment

DocX Jan 4, 2023

Choose a reason for hiding this comment

cben commented Dec 22, 2022

agrare commented Dec 22, 2022

grosser commented Dec 26, 2022

cben commented Jan 3, 2023

DocX Jan 4, 2023

Choose a reason for hiding this comment

grosser commented Dec 22, 2022 •

edited

Loading