Fix a rare deadlock in scanner.Stop #3818
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changed?
I modified the Stop() method of our scanners to cancel running attempts to start workflows.
Why?
I did this because it fixes a race condition which often occurs when the server is shutdown soon after starting. When the server is shutdown, the scanner may still be trying
startWorkflow
. This method utilizes an SDK client. However, when the server shutdown all of its services, it may shutdown the frontend handler before the scanner. As a result, all calls to the SDK client will fail, causing startWorkflow to fail. Currently, we retry startWorkflow a lot, so the scanner can currently enter a long-running retry loop. To fix this, we now cancel the retries as soon as the scanner is shutdown.How did you test it?
I added a test case which simulates the setup described above.
Potential risks
We may see some more error logs about SDK requests being canceled when servers are started/stopped.
Is hotfix candidate?
No.