Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix long-running retry loop when Scanner is shutdown #3728

Closed

Conversation

MichaelSnowden
Copy link
Contributor

What changed?
I modified the Stop() method of our scanners to cancel running attempts to start workflows.

Why?
I did this because it fixes a race condition which often occurs when the server is shutdown soon after starting. When the server is shutdown, the scanner may still be trying startWorkflow. This method utilizes an SDK client. However, when the server shutdown all of its services, it may shutdown the frontend handler before the scanner. As a result, all calls to the SDK client will fail, causing startWorkflow to fail. Currently, we retry startWorkflow a lot, so the scanner can currently enter a long-running retry loop. To fix this, we now cancel the retries as soon as the scanner is shutdown.

How did you test it?
I added a test case which simulates the setup described above.

Potential risks
We may see some more error logs about SDK requests being canceled when servers are started/stopped.

Is hotfix candidate?
No.

@MichaelSnowden MichaelSnowden requested a review from a team as a code owner December 19, 2022 20:30
@MichaelSnowden MichaelSnowden changed the title Fix a rare deadlock in scanner.Stop Fix long-running retry loop when Scanner is shutdown Dec 19, 2022
wg sync.WaitGroup
context scannerContext
wg sync.WaitGroup
shutdownOnce channel.ShutdownOnce
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different/better than storing a Context and a CancelFunc in the Scanner and cancelling it on Stop? (and using that Context as the base context for ThrottleRetryContext, WithTimeout, etc.) It seems like that's a little less code and doesn't require familiarity with this other library

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contexts should never be stored in structs: https://go.dev/blog/context-and-structs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, encapsulating this behavior in ShutdownOnce makes this reusable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best solution is to change our service API entirely to remove all Start and Stop methods in favor of a Run(ctx) function. Then, we wouldn't have to do anything complicated and we can shutdown by just calling cancel. However, given that this would require a huge migration, I can't do it for this one fix. This solution gives us a simple and reusable way to ensure that long-running processes spawned by workers are terminated when the worker is terminated, without violating the guidelines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are times when it makes sense to put contexts in structs, and we do it in many places. Including this file already: see the BackgroundActivityContext in Start.

The arguments in that article are mainly about a component storing a context and also receiving contexts as arguments from callers. But the scanner is a background process that only calls out, things don't call into it. So that doesn't really apply. If you think about the idea of contexts as representing "where a call is coming from", it seems to me like the scanner and similar background processes do deserve their own contexts.

And we have one already, the BackgroundActivityContext, which does have useful contextual information (callerinfo). If we just add a cancel to that, and use it as the base context for the start workflow calls, in addition to activities, then we get what we want.

@MichaelSnowden MichaelSnowden force-pushed the snowden/fix-scanner-shutdown branch from bc05257 to 9327e8e Compare January 19, 2023 00:31
@MichaelSnowden MichaelSnowden deleted the branch snowden/shutdown-once January 19, 2023 17:45
@MichaelSnowden
Copy link
Contributor Author

Moved to #3818

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants