Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve informer test #594

Merged
merged 1 commit into from
Jan 3, 2023
Merged

improve informer test #594

merged 1 commit into from
Jan 3, 2023

Conversation

grosser
Copy link
Contributor

@grosser grosser commented Dec 22, 2022

making all the threads terminate and no longer killing things ... 3 green ci runs 🤞

@waiter.run
rescue ThreadError # rubocop:disable Lint/SuppressedException
end
@waiter.join
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I think I see — previously, leftover waiter thread(s) could execute @watcher.finish at any time, possibly interrupting unrelated watchers from later tests, right?

But if we somehow (e.g. 'ERROR') come here early during sleep(@reconcile_timeout), we might block for a long time — default 15min right?
Here, what we're potentially blocking is the existing worker thread's loop, which won't be directly felt by the app but can cause significant gaps in the watching?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would have killed a random new waiter before (not really in tests since they all build their own informer)

added a Thread.pass to fix the race condition (don't think it really happens since the watch is a http request so it will take some time, but it's cheap so 🤷 )

rescue ThreadError
# thread was already dead
end
thread.join
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar question about @waiter possibly blocking for many minutes.
Here it'd block the app calling stop_worker right?

  • I'm thinking there is a subordinate relationship between worker thread -> single watch_to_update_cache run -> waiter thread. 🤔
    What if waiter threads did not refer to current instance variable @watcher.finish but closed over a lexically scoped reference to their watcher? Would then it be safe to leak unfinished waiter threads without .join-ing them? Would it be a good idea?

  • Waiter threads don't do much — are they safe to .kill? Or is there another way to interrupt a sleep() early?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is there another way to interrupt a sleep() early?

You can use Concurrent::Event with a wait timeout to implement an interruptible sleep.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before we killed them, but that made the tests brittle, so I like the joining since that makes anything that goes wrong more obvious (and re-raises exceptions too)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my testing @watcher.each can get blocked on the response stream. Even closing the http_client will not unblock the response body stream.

In that case join on the waiter thread that runs the @watcher.each internally will never return

@cben
Copy link
Collaborator

cben commented Dec 22, 2022

@agrare I don't know if you have been following the Kubeclient::Informer class on master branch, but perhaps it's interesting for future use in manageiq (?). Either way, you may have useful experience to share here from how manageiq is killing & restarting watches... That's a long way of saying "please review" 😉

[P.S. entirely out-of-scope here, but there is the unfinished business of #488. The blocker there was we didn't know how to implement watch .finish with Faraday 🤷 @agrare I was considering maybe dropping .finish in 5.0 and letting you figure out how to bring it back when you want to upgrade ManageIQ 😝 But now I realize Informer relies on it deeply, and I suspect so would any advanced usage of watches?]

@agrare
Copy link
Member

agrare commented Dec 22, 2022

I don't know if you have been following the Kubeclient::Informer class on master branch, but perhaps it's interesting for future use in manageiq (?)

Hey @cben! No I hadn't seen this, it definitely looks intriguing though we try not to cache inventory to keep our memory usage down so I doubt we'd use it directly but I can definitely see how this would be useful if your code were only interacting with kubeclient.

There definitely could be a more user friendly interface around watches since it seems we are solving the same issues.

Either way, you may have useful experience to share here from how manageiq is killing & restarting watches... That's a long way of saying "please review"

I notice a few minor differences in how we handle this compared to the Informer class:

  1. Our initial collection we don't pass resource_version: '0' to the get_
  2. If a watch breaks unexpectedly we retry with the same resourceVersion where it looks like this gets the whole collection again (?)
  3. If we get an error with 410 Gone we set resourceVersion to nil and restart the watch.
  4. We use a Concurrent::AtomicBoolean to indicate to the watch thread if it should break out after the watch.finish or retry

I believe we pulled this logic from the kubevirt/cnv ManageIQ provider which was contributed by the kubevirt team. Not saying this is better or worse than the Informer implementation just noting the differences.

But now I realize Informer relies on it deeply, and I suspect so would any advanced usage of watches?]

Yes we do use #finish to get a watch to break out so that we can join cleanly but we join with a timeout and .kill the thread if it doesn't respond. We also aren't trying to keep a cache consistent from this thread though.

What is the purpose of the waiter thread? I looks like it'll stop the watch after 15 minutes?

@grosser
Copy link
Contributor Author

grosser commented Dec 26, 2022

I think this is a good step forward, so prefer to merge and then iterate further if there are still open issues.

@cben
Copy link
Collaborator

cben commented Jan 3, 2023

Sorry, missed your reply entirely.

FWIW, Thread.pass is advisory, so not sure it can really "fix" things:

Give the thread scheduler a hint to pass execution to another thread. A running thread may or may not switch, it depends on OS and processor.

I'm still somewhat concerned a call to stop_worker might get stuck for a long time.
But yes, previous behavior was chaotic, so I guess it's an advance. 👍

And I'm certainly happy about CI looking reliably green now 🎉 Merging.

@cben cben merged commit 0287f32 into ManageIQ:master Jan 3, 2023
ensure
sleep(1) # do not overwhelm the api-server if we are somehow broken
end
break if @stopped
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not change the loop to until @stopped then?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants