-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test failure: Interop\\COM\\NativeClients\\Events\\Events.cmd #37236
Comments
|
@jkoritzinsky @AaronRobinsonMSFT another interop gcstress failure |
I can't reproduce this locally at all. I have tried There was a recent change in #37116 which fixed a long standing issue in x86 with GCStress. If this reoccurs in the future a deeper investigation will be warranted. |
@AaronRobinsonMSFT This failed again this weekend in the weekly GCstress run, e.g.: |
I had to run this in a loop 3000 times... it finally triggered on 2984 which took about 2.5 hrs. This is not the test itself, but rather the runner. I have a full DMP of the issue.
@janvorli or @jkotas any suggestions on where to look are helpful. I will try the few things each of you have suggested in the past and see if anything pops. |
Looks like the object pointer on which |
@BruceForstall Yeah I was starting down that path. I believe I have the object address, but it isn't in the stresslog. The object pointer I am looking at is a reasonable address but when dereferenced points to |
The failure is happening on line 65 below. The test is a bit complicated because the actual test is a native EXE which is launched by a manged entry point that is run with Based on the DMP there are two things going on:
I can't find the object address in the stresslogs so unsure where it came from or perhaps is new/never moved? Given where the launcher is in the create process steps the native process hasn't been started yet. We are at the early stages of setting up pipes to communicate with the as yet unlaunched process. runtime/src/coreclr/tests/src/Interop/common/ExeLauncherProgram.cs Lines 65 to 75 in bbb5902
|
I see where the process handle from (2) above is coming from. That is being used during the first pipe creation for stdout, the call to get a pipe for stderr is where this is failing. |
Since it fails in |
@janvorli Good call. The previous object is a |
Or some unsafe code manipulating the string has written behind the end of the string by accident. It might be interesting to get gc roots of that string to see if it reveals where is it being used ( |
EDIT: Perhaps not. Unsure if this is related, but the comment did seem to imply something similar was occurring. |
After chatting about this with @jkotas and debugging through the GC Cover code, I believe this is an issue with the insertion of GC Cover breakpoints - The following appears in both DMPs I have been able to collect.
Note that this is an extremely rare occurrence and has nothing to do with COM. The failure in this test is from a small runner app that launches another exe - see The generated code on Windows x64 is as follows for the two DMPs. 00007ffa`acc325fa ffd0 call rax
00007ffa`acc325fc 488b9548ffffff mov rdx,qword ptr [rbp-0B8h]
00007ffa`acc32603 c6420c01 mov byte ptr [rdx+0Ch],1
00007ffa`acc32607 48baf0ddb90cfb7f0000 mov rdx,offset coreclr!g_TrapReturningThreads (00007ffb`0cb9ddf0)
00007ffa`acc32611 833a00 cmp dword ptr [rdx],0
00007ffa`acc32614 740c je 00007ffa`acc32622
00007ffa`acc32616 48b93073b90cfb7f0000 mov rcx,offset coreclr!hlpDynamicFuncTable+0x150 (00007ffb`0cb97330)
00007ffa`acc32620 ff11 call qword ptr [rcx] (gcstress) (JitHelp: CORINFO_HELP_STOP_FOR_GC)
00007ffa`acc32622 898554ffffff mov dword ptr [rbp-0ACh],eax The call at I believe that if the JIT could mark the inline check as "uninterruptible" the issue would go away. I have two DMPs if anyone wants to look at examples. As mentioned, this is a very rare failure to observe. The first run took more than 2 hours to trigger, luckily the second took only 10 minutes, but it is rare. /cc @dotnet/jit-contrib |
Would not surprise me if there is a race here, though I do not quite understand where things go wrong. Per the later comments if this is a stress-only issue, we've been unwilling to modify the GC info just to handle those cases. I wonder if a fix similar to #37432 makes sense here -- if we hit a stress interrupt and the thread is in preemptive mode (and perhaps, not at a call), don't initiate a GC. |
The problem is that the thread is marked as being in cooperative mode already, but it is not fully switched over yet. The only reliable way to tell we are in this spot is by looking at the method code. We can make a pessimistic estimate e.g. give up when |
We can make the check precise if the linking/unlinking of PInvoke frames is done exactly around the call. We have discussed it as the right thing to do - more context in #34526 (comment). |
Sorry to be dense, I'd like to understand this better. Is the bug that if we have both a normal GC and a stress GC while preemptive mode is enabled and the pinvoke frame is still active, we do different GC reporting? Or is it a race of some kind? |
Yes, it is a race. The problem happens when:
We may fail to properly suspend the system in this situation, both threads enter the GC, and corrupt the GC state in arbitrary way. |
Ok, thanks. We need to zero in on how we're going proceed here. Suspect a number of our recent sporadic stress failures may be caused by this race. Fadi suggests most of the work for precise link/unlink is in the runtime:
Do we have a handle on how much work this is, or should we look into suppressing stress interrupts if the system is trapping returning threads, or can we duplicate more of the pinvoke post-call logic within |
I don't fully understand what Fadi is talking about - my ignorance in this area. If someone can elaborate on the suggestion that would be helpful. |
The runtime supports this contract for R2R that is on this plan already. There are 2 few places in the runtime (look for |
We may combine check to trapping returning threads and check for active PInvoke frame. It should catch this situation and be rare enough at the same time. Something like:
|
We would need a condition like this even with the fix that pops the frame around PInvoke callsites. |
Ok, will start in on this. I'm first going to see if adding a nop sled (or similar) the post-pinvoke call sequence widens the race window and so boosts the reproducibility of this bug, so that once we have a fix there's a more reliable way of telling if the issue is really addressed. |
If I understand correctly, this seems like it should also address #330. If the runtime's exception handler reliably pops off the linked ICF, then we should be able to use ICFs in try regions without restrictions, right? |
Currently adding 16 G_M45425_IG05:
488B9578FFFFFF mov rdx, qword ptr [rbp-88H]
C6420C01 mov byte ptr [rdx+12], 1
0FAEF0 vmfence
0FAEF0 vmfence
0FAEF0 vmfence
0FAEF0 vmfence
0FAEF0 vmfence
0FAEF0 vmfence
0FAEF0 vmfence
0FAEF0 vmfence
0FAEF0 vmfence
0FAEF0 vmfence
0FAEF0 vmfence
0FAEF0 vmfence
0FAEF0 vmfence
0FAEF0 vmfence
0FAEF0 vmfence
0FAEF0 vmfence
833D1372F55F00 cmp dword ptr [(reloc)], 0
7406 je SHORT G_M45425_IG06
FF156B07F55F call [CORINFO_HELP_STOP_FOR_GC] |
Unfortunately not. I had the following set against a Win-x64 Checked version of the runtime: set COMPlus_GCStress=0xC
set COMPlus_JITMinOpts=1
set COMPlus_HeapVerify=1 As noted the following script ran for a good long time (~ 2.5 hrs) before it triggered: @ECHO OFF
SETLOCAL EnableDelayedExpansion
FOR /l %%S in (1, 1, 3000) DO (
VERIFY >nul
ECHO B %%S : !ERRORLEVEL!
D:\runtime\artifacts\tests\coreclr\Windows_NT.x64.Checked\Tests\Core_Root\corerun.exe Events.dll
ECHO E %%S : !ERRORLEVEL!
IF NOT "!ERRORLEVEL!"=="100" GOTO :ERROR
)
GOTO :EOF
:ERROR
ECHO Failed! |
Was the machine running anything else significant during this time? Races will sometimes repro more readily if the machine is heavily loaded (as the CI machines likely are) since the OS scheduler is forced to be more creative. |
Yeah. This is an old QA trick that I occasionally try. In this case it was running on my local dev machine without much input from me. I did have Outlook, Teams, and a web browser up so they were definitely doing things, but nothing was hitting the CPU with much force (i.e. compile or video decoding). |
Yup. I assume that you meant the linking/unlinking around the call. I am not sure whether linking/unlinking around the call is strictly required to fix this. The condition that I have suggested in #37236 (comment) may be enough. |
Aaron's crash had:
Maybe this makes the difference |
I have two repros under the debugger now that look exactly like the above (out of ~1200 runs). It is puzzling to me why a much wider window would not lead to failures somewhere in the stretch between switching to preemptive mode and the observed failure spot. I'm running with 256 mfences in there now. The gc stress opcode doesn't normally get removed until after DoGCStress has returned. So could it be the call that's causing the GC, not the instruction before? I'm probably wrong, because we'd expect that call to have been handled specially and restored by I'm going to try and add a bit more instrumentation. There is some state capture going on via
|
Looked at this again and it's the mov to set up the address for stop_for_gc that is the instruction at issue. I suppose I can move the "nop" sled down to sit between the check for trap returning threads and the call to the stop helper and that should boost the reproducibility of this. Also seems to jive with the observation that if the offset of the helper table entry can be RIP relative then we seemingly don't fail as then there is just the call instruction with no move before it and gc interrupts from the call get suppressed. Might not be too hard to extend the current call protection to the mov too. |
Testing a solution along the lines of Jan's suggestion above: #37236 (comment) This would be more general than trying to pattern match all the possible instructions we emit between observing that Seems to be holding up so far, and has reached the point (~500 runs) where I was seeing repros before. I'll let it go for a few thousand runs. Also will check that the total number of GCs is similar with and without the fix. |
Proposed fix: @@ -1425,6 +1425,25 @@ BOOL OnGcCoverageInterrupt(PCONTEXT regs)
return TRUE;
}
+ // If we're in cooperative mode, we're supposed to stop for GC,
+ // and there's an active ICF, don't initiate a stress GC.
+ if (g_TrapReturningThreads && pThread->PreemptiveGCDisabled())
+ {
+ Frame* pFrame = pThread->GetFrame();
+
+ // Note if we're fully in COOP mode there may be no frame,
+ // but if we're in coop mode just after a pinvoke we may
+ // not have unlinked the ICF yet.
+ if ((pFrame != NULL)
+ && (pFrame != FRAME_TOP)
+ && (pFrame->GetVTablePtr() == InlinedCallFrame::GetMethodFrameVPtr())
+ && InlinedCallFrame::FrameHasActiveCall(pFrame))
+ {
+ RemoveGcCoverageInterrupt(instrPtr, savedInstrPtr);
+ return TRUE;
+ }
+ }
+ |
It is sufficient to just call the runtime/src/coreclr/src/vm/frames.h Lines 2955 to 2963 in 17d413f
|
Thanks, I'll update. With the fix, I'm now up to 1250 runs without a repro.... |
In the post-call part of a pinvoke inline call frame, it's not safe to start a stress mode GC in the window between checking `g_TrapReturningThreads` and the call to `CORINFO_HELP_STOP_FOR_GC`. The call instruction is already getting special treatement, but there may be other instructions between the check and call. Instead of trying to pattern match them all, suppress GC stress if `g_TrapReturningThreads` is true, the thread is in cooperative mode, and there's an active inline call frame. Closes dotnet#37236.
In the post-call part of a pinvoke inline call frame, it's not safe to start a stress mode GC in the window between checking `g_TrapReturningThreads` and the call to `CORINFO_HELP_STOP_FOR_GC`. The call instruction is already getting special treatement, but there may be other instructions between the check and call. Instead of trying to pattern match them all, suppress GC stress if `g_TrapReturningThreads` is true, the thread is in cooperative mode, and there's an active inline call frame. Closes #37236.
failed in job: runtime-coreclr gcstress0x3-gcstress0xc 20200531.1
Error message
The text was updated successfully, but these errors were encountered: