-
Notifications
You must be signed in to change notification settings - Fork 911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix long-running retry loop when Scanner is shutdown #3728
Fix long-running retry loop when Scanner is shutdown #3728
Conversation
service/worker/scanner/scanner.go
Outdated
wg sync.WaitGroup | ||
context scannerContext | ||
wg sync.WaitGroup | ||
shutdownOnce channel.ShutdownOnce |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this different/better than storing a Context and a CancelFunc in the Scanner and cancelling it on Stop? (and using that Context as the base context for ThrottleRetryContext, WithTimeout, etc.) It seems like that's a little less code and doesn't require familiarity with this other library
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Contexts should never be stored in structs: https://go.dev/blog/context-and-structs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, encapsulating this behavior in ShutdownOnce makes this reusable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The best solution is to change our service API entirely to remove all Start and Stop methods in favor of a Run(ctx) function. Then, we wouldn't have to do anything complicated and we can shutdown by just calling cancel. However, given that this would require a huge migration, I can't do it for this one fix. This solution gives us a simple and reusable way to ensure that long-running processes spawned by workers are terminated when the worker is terminated, without violating the guidelines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are times when it makes sense to put contexts in structs, and we do it in many places. Including this file already: see the BackgroundActivityContext in Start.
The arguments in that article are mainly about a component storing a context and also receiving contexts as arguments from callers. But the scanner is a background process that only calls out, things don't call into it. So that doesn't really apply. If you think about the idea of contexts as representing "where a call is coming from", it seems to me like the scanner and similar background processes do deserve their own contexts.
And we have one already, the BackgroundActivityContext, which does have useful contextual information (callerinfo). If we just add a cancel to that, and use it as the base context for the start workflow calls, in addition to activities, then we get what we want.
bc05257
to
9327e8e
Compare
Moved to #3818 |
What changed?
I modified the Stop() method of our scanners to cancel running attempts to start workflows.
Why?
I did this because it fixes a race condition which often occurs when the server is shutdown soon after starting. When the server is shutdown, the scanner may still be trying
startWorkflow
. This method utilizes an SDK client. However, when the server shutdown all of its services, it may shutdown the frontend handler before the scanner. As a result, all calls to the SDK client will fail, causing startWorkflow to fail. Currently, we retry startWorkflow a lot, so the scanner can currently enter a long-running retry loop. To fix this, we now cancel the retries as soon as the scanner is shutdown.How did you test it?
I added a test case which simulates the setup described above.
Potential risks
We may see some more error logs about SDK requests being canceled when servers are started/stopped.
Is hotfix candidate?
No.