Fix long-running retry loop when Scanner is shutdown #3728

MichaelSnowden · 2022-12-19T20:30:59Z

What changed?
I modified the Stop() method of our scanners to cancel running attempts to start workflows.

Why?
I did this because it fixes a race condition which often occurs when the server is shutdown soon after starting. When the server is shutdown, the scanner may still be trying startWorkflow. This method utilizes an SDK client. However, when the server shutdown all of its services, it may shutdown the frontend handler before the scanner. As a result, all calls to the SDK client will fail, causing startWorkflow to fail. Currently, we retry startWorkflow a lot, so the scanner can currently enter a long-running retry loop. To fix this, we now cancel the retries as soon as the scanner is shutdown.

How did you test it?
I added a test case which simulates the setup described above.

Potential risks
We may see some more error logs about SDK requests being canceled when servers are started/stopped.

Is hotfix candidate?
No.

dnr · 2023-01-03T21:09:23Z

service/worker/scanner/scanner.go

-		wg      sync.WaitGroup
+		context      scannerContext
+		wg           sync.WaitGroup
+		shutdownOnce channel.ShutdownOnce


How is this different/better than storing a Context and a CancelFunc in the Scanner and cancelling it on Stop? (and using that Context as the base context for ThrottleRetryContext, WithTimeout, etc.) It seems like that's a little less code and doesn't require familiarity with this other library

Contexts should never be stored in structs: https://go.dev/blog/context-and-structs

Also, encapsulating this behavior in ShutdownOnce makes this reusable

The best solution is to change our service API entirely to remove all Start and Stop methods in favor of a Run(ctx) function. Then, we wouldn't have to do anything complicated and we can shutdown by just calling cancel. However, given that this would require a huge migration, I can't do it for this one fix. This solution gives us a simple and reusable way to ensure that long-running processes spawned by workers are terminated when the worker is terminated, without violating the guidelines.

There are times when it makes sense to put contexts in structs, and we do it in many places. Including this file already: see the BackgroundActivityContext in Start.

The arguments in that article are mainly about a component storing a context and also receiving contexts as arguments from callers. But the scanner is a background process that only calls out, things don't call into it. So that doesn't really apply. If you think about the idea of contexts as representing "where a call is coming from", it seems to me like the scanner and similar background processes do deserve their own contexts.

And we have one already, the BackgroundActivityContext, which does have useful contextual information (callerinfo). If we just add a cancel to that, and use it as the base context for the start workflow calls, in addition to activities, then we get what we want.

MichaelSnowden · 2023-01-19T18:53:53Z

Moved to #3818

MichaelSnowden requested a review from a team as a code owner December 19, 2022 20:30

MichaelSnowden changed the title ~~Fix a rare deadlock in scanner.Stop~~ Fix long-running retry loop when Scanner is shutdown Dec 19, 2022

yux0 approved these changes Dec 21, 2022

View reviewed changes

dnr reviewed Jan 3, 2023

View reviewed changes

Fix a rare deadlock in scanner.Stop

9327e8e

MichaelSnowden force-pushed the snowden/fix-scanner-shutdown branch from bc05257 to 9327e8e Compare January 19, 2023 00:31

Merge branch 'snowden/shutdown-once' into snowden/fix-scanner-shutdown

3ed61ee

MichaelSnowden deleted the branch snowden/shutdown-once January 19, 2023 17:45

MichaelSnowden closed this Jan 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix long-running retry loop when Scanner is shutdown #3728

Fix long-running retry loop when Scanner is shutdown #3728

MichaelSnowden commented Dec 19, 2022

dnr Jan 3, 2023

MichaelSnowden Jan 18, 2023

MichaelSnowden Jan 18, 2023

MichaelSnowden Jan 18, 2023

dnr Jan 18, 2023

MichaelSnowden commented Jan 19, 2023

Fix long-running retry loop when Scanner is shutdown #3728

Fix long-running retry loop when Scanner is shutdown #3728

Conversation

MichaelSnowden commented Dec 19, 2022

dnr Jan 3, 2023

Choose a reason for hiding this comment

MichaelSnowden Jan 18, 2023

Choose a reason for hiding this comment

MichaelSnowden Jan 18, 2023

Choose a reason for hiding this comment

MichaelSnowden Jan 18, 2023

Choose a reason for hiding this comment

dnr Jan 18, 2023

Choose a reason for hiding this comment

MichaelSnowden commented Jan 19, 2023