Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix reliability issues in the parallelize tool #1632

Merged
merged 2 commits into from
Feb 18, 2021

Conversation

gix
Copy link
Contributor

@gix gix commented Feb 8, 2021

Fixes #1619

* output_collecting_pipe's destructor could hang on exit because it did not
  clear the running flag before cancelling. The threadpool callback would
  then start another IO operation which never completes.
* Use a manual reset event for overlapped IO to prevent potential memory
  corruption. AFAICT waiting on threadpool IO with named pipe handles is
  unreliable and may result in the kernel writing to a free'd OVERLAPPED
  struct.
@gix gix requested a review from a team as a code owner February 8, 2021 13:57
@ghost
Copy link

ghost commented Feb 8, 2021

CLA assistant check
All CLA requirements met.

@StephanTLavavej StephanTLavavej added the infrastructure Related to repository automation label Feb 9, 2021
@StephanTLavavej
Copy link
Member

Thanks! I'm far from a Windows API expert but I compared this to the documentation (e.g. the OVERLAPPED docs) and this all looks good to me. 😸

@CaseyCarter
Copy link
Contributor

FYI @BillyONeal. Feel free to review if you like, mostly I want you to know we've found a race condition in parallelize that you may want to pick up if you've reused the tool elsewhere.

Copy link
Member

@barcharcraz barcharcraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me after clarifications to STL's questions.

@BillyONeal
Copy link
Member

a call to GetOverlappedResult (or an equivalent wait operation) must be inserted after CancelIoEx.

The thread pool is responsible for performing that wait, and then we wait on the threadpool with WaitForThreadpoolIoCallbacks.

Doing several non-blocking GetOverlappedResult calls (without a manual reset event)

You need separate manual reset events if you issue multiple outstanding IOs on the same file handle. If an event is not supplied, the "secret event" in the file handle itself is used as the synchronization entity. ( https://devblogs.microsoft.com/oldnewthing/20190719-00/?p=102722 ) We never issue concurrent IOs here so the one in the file handle should be OK.

@gix
Copy link
Contributor Author

gix commented Feb 11, 2021

The thread pool is responsible for performing that wait, and then we wait on the threadpool with WaitForThreadpoolIoCallbacks.

To my knowledge this isn't true. The docs state:

When fCancelPendingCallbacks is set to TRUE, only queued callbacks are canceled. Pending I/O requests are not canceled. Therefore, the caller should call GetOverlappedResult for the OVERLAPPED structure to check whether the I/O operation has completed before freeing the structure.

Completed here means that the result is written back to the OVERLAPPED structure (aborted or not). So in the repro case, there aren't any queued callbacks, just a single pending one (from the last call to read_some) which is then cancelled, but nothing waits for it to actually complete, and the OVERLAPPED goes out of scope before the result is written back, corrupting the stack.

You need separate manual reset events if you issue multiple outstanding IOs on the same file handle. If an event is not supplied, the "secret event" in the file handle itself is used as the synchronization entity. ( https://devblogs.microsoft.com/oldnewthing/20190719-00/?p=102722 ) We never issue concurrent IOs here so the one in the file handle should be OK.

This is my understanding as well. But here the named pipe handle cannot be used to wait, it's never signaled (again?). But "waiting" with a loop of non-blocking GetOverlappedResult (which is basically the same as checking the lower DWORD of OVERLAPPED.Internal for an NTSTATUS code) clearly shows that the operation is still incomplete. This also happens when the WaitForThreadpoolIoCallbacks call is moved before or just after CancelIoEx.

@BillyONeal
Copy link
Member

To my knowledge this isn't true. The docs state:

When fCancelPendingCallbacks is set to TRUE, only queued callbacks are canceled. Pending I/O requests are not canceled. Therefore, the caller should call GetOverlappedResult for the OVERLAPPED structure to check whether the I/O operation has completed before freeing the structure.

I see, then maybe the right fix is to pass false rather than true there?

But here the named pipe handle cannot be used to wait, it's never signaled (again?).

Yes, because we passed FILE_SKIP_SET_EVENT_ON_HANDLE. We are not ever waiting on these events, we are using completion ports (via the thread pool) instead of these blocking primitives.

@StephanTLavavej
Copy link
Member

The gh1619-repro branch indeed repros for me, thanks. (I observe The API "CancelIoEx" failed unexpectedly; last error 0x00000490.) Calling WaitForThreadpoolWaitCallbacks(wait, FALSE); instead of TRUE doesn't fix it. The event does fix it.

It's interesting that enum_files_using_git() doesn't actually do anything with the result; it seems that merely beginning and completing the subprocess_executive is enough to cause the repro. (It's also interesting that repro seems to happen quite readily when run in the tools\parallelize directory, where it processes only one file.) Changing the command to git status appears to make the repro vanish, so perhaps it's related to processing a large amount of returned input?

I think I'm satisfied with this investigation. We have a repro, we have a change that fixes it, the change doesn't appear to be harmful in any way, and although we can't really explain why it works, the documentation is overall very vague (as Billy noted) so this isn't too surprising. We know of no other solutions, so I think we should merge the fix and revise it later if we learn more. (And I ultimately suspect that this is responsible for occasional infrastructure hangs, so there is a real productivity improvement here.)

@gix
Copy link
Contributor Author

gix commented Feb 15, 2021

The gh1619-repro branch indeed repros for me, thanks. (I observe The API "CancelIoEx" failed unexpectedly; last error 0x00000490.) Calling WaitForThreadpoolWaitCallbacks(wait, FALSE); instead of TRUE doesn't fix it. The event does fix it.

I still believe that it's not fully correct (why it already works with just an event I don't know). Either:

  1. Remove FILE_SKIP_SET_EVENT_ON_HANDLE (which I initially missed) and wait after cancelling with GetOverlappedResult.
  2. Keep FILE_SKIP_SET_EVENT_ON_HANDLE but instead use event and also wait with GetOverlappedResult (this one now seems unnecessary since we don't have multiple IO operations on the same handle).

I don't think just using WaitForThreadpoolWaitCallbacks(wait, FALSE); is enough. In my tests it seems to work but the docs don't support it.

@StephanTLavavej
Copy link
Member

I've mirrored this to the MSVC-internal repo for merging. If you can improve the fix with GetOverlappedResult that would be great as a followup PR.

@StephanTLavavej StephanTLavavej self-assigned this Feb 18, 2021
@StephanTLavavej StephanTLavavej merged commit 188c35a into microsoft:main Feb 18, 2021
@StephanTLavavej
Copy link
Member

Thanks for improving this tool, and congratulations on your first microsoft/STL commit! 🎉 ⚙️ 😸

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
infrastructure Related to repository automation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

parallelize tool may hang or crash
5 participants