Test failure: Interop\\COM\\NativeClients\\Events\\Events.cmd #37236

v-haren · 2020-06-01T06:56:30Z

failed in job: runtime-coreclr gcstress0x3-gcstress0xc 20200531.1

Error message


Return code: 1
Raw output file: C:\h\w\A7A808D4\w\AD6E08F2\e\Interop\COM\Reports\Interop.COM\NativeClients\Events\Events.output.txt
Raw output:
BEGIN EXECUTION
Copying 'C:\h\w\A7A808D4\p\CoreShim.dll'...
 1 file(s) copied.
 "C:\h\w\A7A808D4\p\corerun.exe" Events.dll 
Searching for exe to launch in C:\h\w\A7A808D4\w\AD6E08F2\e\Interop\COM\NativeClients\Events...
Launching 'C:\h\w\A7A808D4\w\AD6E08F2\e\Interop\COM\NativeClients\Events\COMClientEvents.exe'...

Expected: 100
Actual: -1073741819
END EXECUTION - FAILED
FAILED
Test Harness Exitcode is : 1
To run the test:
> set CORE_ROOT=C:\h\w\A7A808D4\p
> C:\h\w\A7A808D4\w\AD6E08F2\e\Interop\COM\NativeClients\Events\Events.cmd
Expected: True
Actual: False


Stack trace
   at Interop_COM._NativeClients_Events_Events_._NativeClients_Events_Events_cmd() in F:\workspace\_work\1\s\artifacts\tests\coreclr\Windows_NT.x86.Checked\TestWrappers\Interop.COM\Interop.COM.XUnitWrapper.cs:line 916

The text was updated successfully, but these errors were encountered:

BruceForstall · 2020-06-01T17:15:13Z

CoreCLR Windows_NT x86 Checked gcstress0xc @ Windows.10.Amd64.Open

BruceForstall · 2020-06-01T17:16:02Z

@jkoritzinsky @AaronRobinsonMSFT another interop gcstress failure

AaronRobinsonMSFT · 2020-06-01T18:57:02Z

I can't reproduce this locally at all. I have tried COMPlus_GCStress=0xC and COMPlus_GCStress=0x3 10 times each. I tried this on both DEBUG and CHECKED builds without any success. I am going to close this for now, but please reopen if it reproduces again.

There was a recent change in #37116 which fixed a long standing issue in x86 with GCStress. If this reoccurs in the future a deeper investigation will be warranted.

BruceForstall · 2020-06-15T21:03:32Z

@AaronRobinsonMSFT This failed again this weekend in the weekly GCstress run, e.g.:

https://dev.azure.com/dnceng/public/_build/results?buildId=687273&view=ms.vss-test-web.build-test-results-tab&runId=21339984&paneView=debug&resultId=108497

AaronRobinsonMSFT · 2020-06-16T04:51:40Z

I had to run this in a loop 3000 times... it finally triggered on 2984 which took about 2.5 hrs. This is not the test itself, but rather the runner. I have a full DMP of the issue.

0:000> !clrstack
OS Thread Id: 0x4da4 (0)
        Child SP               IP Call Site
000000E62897CE88 00007ffb62a4f97f [RedirectedThreadFrame: 000000e62897ce88] 
000000E62897D070 00007ffaacc32616 [InlinedCallFrame: 000000e62897d070] Interop+Kernel32.CreatePipe(Microsoft.Win32.SafeHandles.SafeFileHandle ByRef, Microsoft.Win32.SafeHandles.SafeFileHandle ByRef, SECURITY_ATTRIBUTES ByRef, Int32)
000000E62897D040 00007ffaacc32616 ILStubClass.IL_STUB_PInvoke(Microsoft.Win32.SafeHandles.SafeFileHandle ByRef, Microsoft.Win32.SafeHandles.SafeFileHandle ByRef, SECURITY_ATTRIBUTES ByRef, Int32)
000000E62897D1A0 00007ffaacc32389 System.Diagnostics.Process.CreatePipeWithSecurityAttributes(Microsoft.Win32.SafeHandles.SafeFileHandle ByRef, Microsoft.Win32.SafeHandles.SafeFileHandle ByRef, SECURITY_ATTRIBUTES ByRef, Int32) [D:\runtime\src\libraries\System.Diagnostics.Process\src\System\Diagnostics\Process.Windows.cs @ 828]
000000E62897D1F0 00007ffaacc32275 System.Diagnostics.Process.CreatePipe(Microsoft.Win32.SafeHandles.SafeFileHandle ByRef, Microsoft.Win32.SafeHandles.SafeFileHandle ByRef, Boolean) [D:\runtime\src\libraries\System.Diagnostics.Process\src\System\Diagnostics\Process.Windows.cs @ 866]
000000E62897D270 00007ffaacc30eac System.Diagnostics.Process.StartWithCreateProcess(System.Diagnostics.ProcessStartInfo) [D:\runtime\src\libraries\System.Diagnostics.Process\src\System\Diagnostics\Process.Windows.cs @ 513]
000000E62897D680 00007ffaacc2fd8e System.Diagnostics.Process.StartCore(System.Diagnostics.ProcessStartInfo)
000000E62897D6B0 00007ffaacc2c4d8 System.Diagnostics.Process.Start() [D:\runtime\src\libraries\System.Diagnostics.Process\src\System\Diagnostics\Process.cs @ 1228]
000000E62897D760 00007ffaacc2b7d2 System.Diagnostics.Process.Start(System.Diagnostics.ProcessStartInfo) [D:\runtime\src\libraries\System.Diagnostics.Process\src\System\Diagnostics\Process.cs @ 1277]
000000E62897D7B0 00007ffaacc19e5b Program.Main(System.String[]) [D:\runtime\src\coreclr\tests\src\Interop\common\ExeLauncherProgram.cs @ 65]

0:000> kb
 # RetAddr           : Args to Child                                                           : Call Site
00 00007ffb`0bfc1b4f : 00000000`00000002 00007ffb`0c96bfdc 000002b9`ed105fc8 00007ffb`0cb9d2d0 : KERNELBASE!RaiseFailFastException+0xaf [minkernel\kernelbase\xcpt.c @ 1133] 
01 (Inline Function) : --------`-------- --------`-------- --------`-------- --------`-------- : coreclr!TerminateOnAssert+0x17 [F:\workspace\_work\1\s\src\coreclr\src\utilcode\debug.cpp @ 190] 
02 00007ffb`0bfc12c0 : 00007ffb`0c983248 fefefefe`00001d46 00007ffb`0cb9a9c0 fefefefe`00000000 : coreclr!_DbgBreakCheck+0x483 [F:\workspace\_work\1\s\src\coreclr\src\utilcode\debug.cpp @ 428] 
03 00007ffb`0bfc14a2 : 00000000`00000000 00007ffb`0cb9a9c0 00000000`00000001 00000000`00000001 : coreclr!_DbgBreakCheckNoThrow+0x50 [F:\workspace\_work\1\s\src\coreclr\src\utilcode\debug.cpp @ 535] 
04 00007ffb`0bfc7d25 : 00007ffb`0c983248 000002b9`00001d46 000002b9`eebf3c48 000000e6`2897bbe8 : coreclr!DbgAssertDialog+0x18e [F:\workspace\_work\1\s\src\coreclr\src\utilcode\debug.cpp @ 696] 
05 00007ffb`0c020a86 : 000000e6`2897bbe8 000002b9`ed0fc6b8 000002b9`ed0c9e90 000002b9`ed0ee640 : coreclr!CHECK::Trigger+0x2c9 [F:\workspace\_work\1\s\src\coreclr\src\utilcode\check.cpp @ 140] 
06 00007ffb`0c020547 : 00000000`00000028 000002b9`eab2f890 00000000`00000001 000000e6`2897c110 : coreclr!CLRVectoredExceptionHandlerPhase3+0x2b2 [F:\workspace\_work\1\s\src\coreclr\src\vm\excep.cpp @ 7493] 
07 00007ffb`0c02048d : 00000000`00000028 00000000`00000028 000002b9`eaba1230 000002b9`eab2f890 : coreclr!CLRVectoredExceptionHandlerPhase2+0x97 [F:\workspace\_work\1\s\src\coreclr\src\vm\excep.cpp @ 7220] 
08 00007ffb`0c020d00 : 000002b9`eab2f890 00000000`00000000 000002b9`eaba1230 ffffffff`ffffffff : coreclr!CLRVectoredExceptionHandler+0x215 [F:\workspace\_work\1\s\src\coreclr\src\vm\excep.cpp @ 7186] 
09 00007ffb`655b85b6 : 00007ffb`0c020b60 000000e6`2897c100 000002b9`ec560880 00007ffb`65593550 : coreclr!CLRVectoredExceptionHandlerShim+0x1a0 [F:\workspace\_work\1\s\src\coreclr\src\vm\excep.cpp @ 7922] 
0a 00007ffb`655aa056 : 000000e6`2897c870 000000e6`2897c380 00007ffb`62971e00 00007ffb`0c7a2429 : ntdll!RtlpCallVectoredHandlers+0x196 [minkernel\ntdll\vectxcpt.c @ 204] 
0b (Inline Function) : --------`-------- --------`-------- --------`-------- --------`-------- : ntdll!RtlCallVectoredExceptionHandlers+0xe [minkernel\ntdll\vectxcpt.c @ 339] 
0c 00007ffb`655dfe3e : 00007ffb`0cb0cb00 00007ffb`65560299 00000000`00000000 000000e6`2897c448 : ntdll!RtlDispatchException+0x66 [minkernel\ntos\rtl\amd64\exdsptch.c @ 453] 
0d 00007ffb`0c6921e2 : 00000000`00000001 000000e6`2897cb39 00000000`00000000 000002b9`80150b80 : ntdll!KiUserExceptionDispatch+0x2e [minkernel\ntos\rtl\amd64\trampoln.asm @ 749] 
0e (Inline Function) : --------`-------- --------`-------- --------`-------- --------`-------- : coreclr!WKS::CObjectHeader::GetMethodTable+0xa [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 3673] 
0f (Inline Function) : --------`-------- --------`-------- --------`-------- --------`-------- : coreclr!WKS::my_get_size+0xa [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 9089] 
10 00007ffb`0c67c26f : 00000000`00000002 ffffffff`00000000 00007ffa`ad07bb60 00000000`00000000 : coreclr!WKS::gc_heap::verify_heap+0x316 [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 34721] 
11 00007ffb`0c667a72 : 00000000`00002b01 00000000`00000000 00000000`00000000 00007ffb`0c683377 : coreclr!WKS::gc_heap::garbage_collect+0x273 [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 17550] 
12 00007ffb`0c66a79a : 00000000`00000000 000000e6`ffffffff 00007ffb`00000008 000002b9`800d0320 : coreclr!WKS::GCHeap::GarbageCollectGeneration+0x286 [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 37034] 
13 00007ffb`0c36b5e6 : 000002b9`eaba1230 000000e6`0000a5f4 00007ffa`acd053d0 000002b9`eaba1288 : coreclr!WKS::GCHeap::StressHeap+0x4ca [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 35982] 
14 00007ffb`0c11a2ae : 000002b9`ed0634d0 000002b9`ed063400 00007ffa`ad124100 00007ffb`0c9a94f4 : coreclr!DoGcStress+0x362 [F:\workspace\_work\1\s\src\coreclr\src\vm\gccover.cpp @ 1815] 
15 00007ffb`0c314bd6 : 00000000`00000000 000000e6`2897d978 00000000`00000000 00000000`00000001 : coreclr!Thread::RedirectedHandledJITCase+0x1f6 [F:\workspace\_work\1\s\src\coreclr\src\vm\threadsuspend.cpp @ 2901] 
16 00007ffa`acc32616 : 00000000`00000000 000000e6`2897d978 00000000`00000000 00000000`00000004 : coreclr!RedirectedHandledJITCaseForGCStress_Stub+0x26 [F:\workspace\_work\1\s\src\coreclr\src\vm\amd64\RedirectedHandledJITCase.asm @ 104]
17 00007ffa`acc32389 : 000000e6`2897d240 000000e6`2897d558 000000e6`2897d248 00007ffb`00000000 : 0x00007ffa`acc32616

@janvorli or @jkotas any suggestions on where to look are helpful. I will try the few things each of you have suggested in the past and see if anything pops.

BruceForstall · 2020-06-16T05:09:18Z

Looks like the object pointer on which coreclr!WKS::CObjectHeader::GetMethodTable() is being called is corrupt, since it appears the object pointer points to illegal memory. You need to figure out where that pointer came from, and why it is corrupt. Use the stress log and see if that has any clues.

AaronRobinsonMSFT · 2020-06-16T05:37:03Z

@BruceForstall Yeah I was starting down that path. I believe I have the object address, but it isn't in the stresslog. The object pointer I am looking at is a reasonable address but when dereferenced points to NULL - which is clearly the problem.

AaronRobinsonMSFT · 2020-06-16T06:26:14Z

The failure is happening on line 65 below. The test is a bit complicated because the actual test is a native EXE which is launched by a manged entry point that is run with corerun.exe. This convoluted path is needed because the test infrastructure didn't (doesn't?) support tests starting from a native context.

Based on the DMP there are two things going on:

The native process is starting to be launched.
There is a SafeProcessHandle in the finalizer queue. I have no idea why there would be a SafeProcessHandle in the queue, but it has an invalid handle value and is not marked as "closed" or "disposed" with a ref count of 1.

I can't find the object address in the stresslogs so unsure where it came from or perhaps is new/never moved? Given where the launcher is in the create process steps the native process hasn't been started yet. We are at the early stages of setting up pipes to communicate with the as yet unlaunched process.

runtime/src/coreclr/tests/src/Interop/common/ExeLauncherProgram.cs

Lines 65 to 75 in bbb5902

    
           using (Process p = Process.Start(startInfo)) 
        
           { 
        
               p.OutputDataReceived += (_, args) => Console.WriteLine(args.Data); 
        
               p.BeginOutputReadLine(); 
        
               p.ErrorDataReceived += (_, args) => Console.Error.WriteLine(args.Data); 
        
               p.BeginErrorReadLine(); 
        
               p.WaitForExit(); 
        
               return p.ExitCode; 
        
           }

AaronRobinsonMSFT · 2020-06-16T06:36:16Z

I see where the process handle from (2) above is coming from. That is being used during the first pipe creation for stdout, the call to get a pipe for stderr is where this is failing.

janvorli · 2020-06-16T09:22:32Z

Since it fails in gc_heap::verify_heap, I suppose the heap got corrupted. That method is going through the GC heap in a linear manner, so it seems that maybe a previous object on the heap might have reported a wrong size and the verify_heap is looking into the middle of an object. Or maybe more likely, something has written behind the end of the previous object. Seeing what the previous object is about might help to validate such a theory.
You can run !VerifyHeap to verify it is really a heap corruption and get some more details on it.

AaronRobinsonMSFT · 2020-06-16T16:30:35Z

Or maybe more likely, something has written behind the end of the previous object. Seeing what the previous object is about might help to validate such a theory.

@janvorli Good call. The previous object is a System.String with a length of 42484. The entire string is a series of NULL. I would agree the previous object reported something and clearing it out impacted the current object which is determined to be invalid.

janvorli · 2020-06-16T16:51:25Z

Or some unsafe code manipulating the string has written behind the end of the string by accident. It might be interesting to get gc roots of that string to see if it reveals where is it being used (gcroot -all the_string_address)

AaronRobinsonMSFT · 2020-06-17T00:33:05Z

~~I think this issue is caused by #7670.~~

EDIT: Perhaps not. Unsure if this is related, but the comment did seem to imply something similar was occurring.

AaronRobinsonMSFT · 2020-06-17T18:15:07Z

@AndyAyersMS @janvorli

After chatting about this with @jkotas and debugging through the GC Cover code, I believe this is an issue with the insertion of GC Cover breakpoints - GCCoverageInfo::SprinkleBreakpoints. The comments above the GCCoverageInfo::SprinkleBreakpoints function were quite helpful in developing a hypothesis - Now, a managed thread (T) can race with the GC as follows:.

The following appears in both DMPs I have been able to collect.

A thread is performing a P/Invoke and attempting to re-enter the runtime.
The finalizer thread is executing managed code and triggering a GC.

Note that this is an extremely rare occurrence and has nothing to do with COM. The failure in this test is from a small runner app that launches another exe - see ExeLauncherProgram.cs.

The generated code on Windows x64 is as follows for the two DMPs.

00007ffa`acc325fa ffd0            call    rax
00007ffa`acc325fc 488b9548ffffff  mov     rdx,qword ptr [rbp-0B8h]
00007ffa`acc32603 c6420c01        mov     byte ptr [rdx+0Ch],1
00007ffa`acc32607 48baf0ddb90cfb7f0000 mov rdx,offset coreclr!g_TrapReturningThreads (00007ffb`0cb9ddf0)
00007ffa`acc32611 833a00          cmp     dword ptr [rdx],0
00007ffa`acc32614 740c            je      00007ffa`acc32622
00007ffa`acc32616 48b93073b90cfb7f0000 mov rcx,offset coreclr!hlpDynamicFuncTable+0x150 (00007ffb`0cb97330)
00007ffa`acc32620 ff11            call    qword ptr [rcx] (gcstress) (JitHelp: CORINFO_HELP_STOP_FOR_GC)
00007ffa`acc32622 898554ffffff    mov     dword ptr [rbp-0ACh],eax

The call at 00007ffaacc325fa is the native invocation. Note the inlined case where JIT_RareDisableHelper is invoked if the trap is non-zero. This appears to be the exact scenario alluded to in the comments for GCCoverageInfo::SprinkleBreakpoints. Currently, immediately after 00007ffaacc325fa and at 00007ffaacc32614 a breakpoint is inserted, however the entire inlined portion should be marked "uninterruptible" and not just the call to JIT_RareDisableHelper (i.e. 00007ffaacc32620).

I believe that if the JIT could mark the inline check as "uninterruptible" the issue would go away. I have two DMPs if anyone wants to look at examples. As mentioned, this is a very rare failure to observe. The first run took more than 2 hours to trigger, luckily the second took only 10 minutes, but it is rare.

/cc @dotnet/jit-contrib

AndyAyersMS · 2020-06-17T18:49:26Z

I believe that if the JIT could mark the inline check as "uninterruptible" and the issue would go away

Would not surprise me if there is a race here, though I do not quite understand where things go wrong.

Per the later comments if this is a stress-only issue, we've been unwilling to modify the GC info just to handle those cases. I wonder if a fix similar to #37432 makes sense here -- if we hit a stress interrupt and the thread is in preemptive mode (and perhaps, not at a call), don't initiate a GC.

jkotas · 2020-06-17T19:13:48Z

The problem is that the thread is marked as being in cooperative mode already, but it is not fully switched over yet. The only reliable way to tell we are in this spot is by looking at the method code. We can make a pessimistic estimate e.g. give up when g_TrapReturningThreads != 0, but it has a risk of reducing GC stress coverage quite a bit.

jkotas · 2020-06-17T19:27:43Z

We can make the check precise if the linking/unlinking of PInvoke frames is done exactly around the call. We have discussed it as the right thing to do - more context in #34526 (comment).

AndyAyersMS · 2020-06-17T19:39:09Z

Sorry to be dense, I'd like to understand this better. Is the bug that if we have both a normal GC and a stress GC while preemptive mode is enabled and the pinvoke frame is still active, we do different GC reporting? Or is it a race of some kind?

jkotas · 2020-06-17T20:04:59Z

Yes, it is a race. The problem happens when:

One thread is triggering GC (typically GC stress induced GC)
Second thread is returning from PInvoke and hits GC stress marker induced GC on an instruction between flipping the GC thread state (CORINFO_EE_INFO::offsetOfGCState) and calling JIT_RareDisableHelper

We may fail to properly suspend the system in this situation, both threads enter the GC, and corrupt the GC state in arbitrary way.

AndyAyersMS · 2020-06-18T23:31:57Z

Ok, thanks.

We need to zero in on how we're going proceed here. Suspect a number of our recent sporadic stress failures may be caused by this race.

Fadi suggests most of the work for precise link/unlink is in the runtime:

Should be an easy JIT change, but there's a lot of places on the runtime side I've seen where the assumption of disconnected linking/activation is baked in, and that will be a bigger change

Do we have a handle on how much work this is, or should we look into suppressing stress interrupts if the system is trapping returning threads, or can we duplicate more of the pinvoke post-call logic within OnCodeCoverageInterrupt to avoid the issue?

cc @BruceForstall

AaronRobinsonMSFT · 2020-06-19T00:09:02Z

Do we have a handle on how much work this is

I don't fully understand what Fadi is talking about - my ignorance in this area. If someone can elaborate on the suggestion that would be helpful.

jkotas · 2020-06-19T02:33:27Z

Do we have a handle on how much work this is

The runtime supports this contract for R2R that is on this plan already. There are 2 few places in the runtime (look for ExecutionManager::IsReadyToRunCode) that will need to fixed up to take the R2R path unconditionally. That's all (assuming R2R path does not have bugs).

jkotas · 2020-06-19T02:46:03Z

should we look into suppressing stress interrupts if the system is trapping returning threads

We may combine check to trapping returning threads and check for active PInvoke frame. It should catch this situation and be rare enough at the same time. Something like:

    if (g_TrapReturningThreads &&
        pFrame->GetVTablePtr() == InlinedCallFrame::GetMethodFrameVPtr() &&
        InlinedCallFrame::FrameHasActiveCall(pFrame))
    {
        skip gcstress interrupt
    }

jkotas · 2020-06-19T02:47:24Z

We would need a condition like this even with the fix that pops the frame around PInvoke callsites.

AndyAyersMS · 2020-06-19T21:04:48Z

Ok, will start in on this.

I'm first going to see if adding a nop sled (or similar) the post-pinvoke call sequence widens the race window and so boosts the reproducibility of this bug, so that once we have a fix there's a more reliable way of telling if the issue is really addressed.

AndyAyersMS · 2020-06-19T21:34:42Z

If I understand correctly, this seems like it should also address #330. If the runtime's exception handler reliably pops off the linked ICF, then we should be able to use ICFs in try regions without restrictions, right?

AndyAyersMS · 2020-06-20T00:00:37Z

Currently adding 16 vmfences after re-entering preemptive mode and before checking for the gc trap (vmfence because it has no operands and high latency). No luck so far getting a repro. @AaronRobinsonMSFT were you do anything special when you got the repros above?

G_M45425_IG05:
       488B9578FFFFFF       mov      rdx, qword ptr [rbp-88H]
       C6420C01             mov      byte  ptr [rdx+12], 1
       0FAEF0               vmfence  
       0FAEF0               vmfence  
       0FAEF0               vmfence  
       0FAEF0               vmfence  
       0FAEF0               vmfence  
       0FAEF0               vmfence  
       0FAEF0               vmfence  
       0FAEF0               vmfence  
       0FAEF0               vmfence  
       0FAEF0               vmfence  
       0FAEF0               vmfence  
       0FAEF0               vmfence  
       0FAEF0               vmfence  
       0FAEF0               vmfence  
       0FAEF0               vmfence  
       0FAEF0               vmfence  
       833D1372F55F00       cmp      dword ptr [(reloc)], 0
       7406                 je       SHORT G_M45425_IG06
       FF156B07F55F         call     [CORINFO_HELP_STOP_FOR_GC]

AaronRobinsonMSFT · 2020-06-20T00:07:54Z

were you do anything special when you got the repros above?

Unfortunately not. I had the following set against a Win-x64 Checked version of the runtime:

set COMPlus_GCStress=0xC
set COMPlus_JITMinOpts=1
set COMPlus_HeapVerify=1

As noted the following script ran for a good long time (~ 2.5 hrs) before it triggered:

@ECHO OFF
SETLOCAL EnableDelayedExpansion

FOR /l %%S in (1, 1, 3000) DO (
	VERIFY >nul
	ECHO B %%S : !ERRORLEVEL! 
	D:\runtime\artifacts\tests\coreclr\Windows_NT.x64.Checked\Tests\Core_Root\corerun.exe Events.dll
	ECHO E %%S : !ERRORLEVEL!

	IF NOT "!ERRORLEVEL!"=="100" GOTO :ERROR
)

GOTO :EOF

:ERROR
ECHO Failed!

AndyAyersMS · 2020-06-20T00:10:35Z

Was the machine running anything else significant during this time?

Races will sometimes repro more readily if the machine is heavily loaded (as the CI machines likely are) since the OS scheduler is forced to be more creative.

AaronRobinsonMSFT · 2020-06-20T00:18:48Z

Was the machine running anything else significant during this time?

Yeah. This is an old QA trick that I occasionally try. In this case it was running on my local dev machine without much input from me. I did have Outlook, Teams, and a web browser up so they were definitely doing things, but nothing was hitting the CPU with much force (i.e. compile or video decoding).

jkotas · 2020-06-20T01:11:26Z

this seems like it should also address #330

Yup. I assume that you meant the linking/unlinking around the call.

I am not sure whether linking/unlinking around the call is strictly required to fix this. The condition that I have suggested in #37236 (comment) may be enough.

jkotas · 2020-06-20T03:57:22Z

call [CORINFO_HELP_STOP_FOR_GC]

Aaron's crash had:

mov rcx,offset coreclr!hlpDynamicFuncTable+0x150 (00007ffb`0cb97330) <- this is the instruction we faulted on
call    qword ptr [rcx] (gcstress) (JitHelp: CORINFO_HELP_STOP_FOR_GC)

Maybe this makes the difference

AndyAyersMS · 2020-06-20T09:04:10Z

I have two repros under the debugger now that look exactly like the above (out of ~1200 runs).

It is puzzling to me why a much wider window would not lead to failures somewhere in the stretch between switching to preemptive mode and the observed failure spot. I'm running with 256 mfences in there now.

The gc stress opcode doesn't normally get removed until after DoGCStress has returned. So could it be the call that's causing the GC, not the instruction before?

I'm probably wrong, because we'd expect that call to have been handled specially and restored by OnGcCoverageInterrupt and we seem to be past that point as we have the redirect frames on the stack. But my guess is that something is interfering with the detection of the call to stop for GC and that's perhaps where the race lies. The call detection looks at the original copy of the code so it's not impacted by the live code updates.

I'm going to try and add a bit more instrumentation. There is some state capture going on via forceStack but it doesn't seem to contain the sorts of values I'd expect.

forceStack                 [Type: void * [8]]
    [0]              : 0x2c36f7c110 [Type: void *]
    [1]              : 0x2c36f7c118 [Type: void *]
    [2]              : 0x2c36f7c108 [Type: void *]
    [3]              : 0x2c36f7c080 [Type: void *]
    [4]              : 0x0 [Type: void *]
    [5]              : 0x0 [Type: void *]
    [6]              : 0x2c36f7ca58 [Type: void *]
    [7]              : 0x0 [Type: void *]

AndyAyersMS · 2020-06-20T18:22:13Z

Looked at this again and it's the mov to set up the address for stop_for_gc that is the instruction at issue. I suppose I can move the "nop" sled down to sit between the check for trap returning threads and the call to the stop helper and that should boost the reproducibility of this.

Also seems to jive with the observation that if the offset of the helper table entry can be RIP relative then we seemingly don't fail as then there is just the call instruction with no move before it and gc interrupts from the call get suppressed.

Might not be too hard to extend the current call protection to the mov too.

AndyAyersMS · 2020-06-22T19:21:44Z

Testing a solution along the lines of Jan's suggestion above: #37236 (comment)

This would be more general than trying to pattern match all the possible instructions we emit between observing that g_TrapReturningThreads == 1 and calling the STOP_FOR_GC helper.

Seems to be holding up so far, and has reached the point (~500 runs) where I was seeing repros before. I'll let it go for a few thousand runs. Also will check that the total number of GCs is similar with and without the fix.

AndyAyersMS · 2020-06-22T19:23:29Z

Proposed fix:

@@ -1425,6 +1425,25 @@ BOOL OnGcCoverageInterrupt(PCONTEXT regs)
         return TRUE;
     }

+    // If we're in cooperative mode, we're supposed to stop for GC,
+    // and there's an active ICF, don't initiate a stress GC.
+    if (g_TrapReturningThreads && pThread->PreemptiveGCDisabled())
+    {
+        Frame* pFrame = pThread->GetFrame();
+
+        // Note if we're fully in COOP mode there may be no frame,
+        // but if we're in coop mode just after a pinvoke we may
+        // not have unlinked the ICF yet.
+        if ((pFrame != NULL)
+            && (pFrame != FRAME_TOP)
+            && (pFrame->GetVTablePtr() == InlinedCallFrame::GetMethodFrameVPtr())
+            && InlinedCallFrame::FrameHasActiveCall(pFrame))
+        {
+            RemoveGcCoverageInterrupt(instrPtr, savedInstrPtr);
+            return TRUE;
+        }
+    }
+

janvorli · 2020-06-22T19:41:43Z

It is sufficient to just call the InlinedCallFrame::FrameHasActiveCall without the pre-checks, it has all those inside:

runtime/src/coreclr/src/vm/frames.h

Lines 2955 to 2963 in 17d413f

    
           static BOOL FrameHasActiveCall(Frame *pFrame) 
        
           { 
        
               WRAPPER_NO_CONTRACT; 
        
               SUPPORTS_DAC; 
        
               return pFrame && 
        
                   pFrame != FRAME_TOP && 
        
                   InlinedCallFrame::GetMethodFrameVPtr() == pFrame->GetVTablePtr() && 
        
                   dac_cast<TADDR>(dac_cast<PTR_InlinedCallFrame>(pFrame)->m_pCallerReturnAddress) != NULL; 
        
           }

AndyAyersMS · 2020-06-22T20:46:07Z

Thanks, I'll update. With the fix, I'm now up to 1250 runs without a repro....

In the post-call part of a pinvoke inline call frame, it's not safe to start a stress mode GC in the window between checking `g_TrapReturningThreads` and the call to `CORINFO_HELP_STOP_FOR_GC`. The call instruction is already getting special treatement, but there may be other instructions between the check and call. Instead of trying to pattern match them all, suppress GC stress if `g_TrapReturningThreads` is true, the thread is in cooperative mode, and there's an active inline call frame. Closes dotnet#37236.

In the post-call part of a pinvoke inline call frame, it's not safe to start a stress mode GC in the window between checking `g_TrapReturningThreads` and the call to `CORINFO_HELP_STOP_FOR_GC`. The call instruction is already getting special treatement, but there may be other instructions between the check and call. Instead of trying to pattern match them all, suppress GC stress if `g_TrapReturningThreads` is true, the thread is in cooperative mode, and there's an active inline call frame. Closes #37236.

v-haren added the GCStress label Jun 1, 2020

Dotnet-GitSync-Bot added area-Infrastructure-coreclr untriaged New issue has not been triaged by the area owner labels Jun 1, 2020

BruceForstall added area-Interop-coreclr and removed area-Infrastructure-coreclr labels Jun 1, 2020

AaronRobinsonMSFT closed this as completed Jun 1, 2020

BruceForstall reopened this Jun 15, 2020

AaronRobinsonMSFT added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner and removed area-Interop-coreclr untriaged New issue has not been triaged by the area owner labels Jun 17, 2020

AndyAyersMS self-assigned this Jun 19, 2020

AndyAyersMS mentioned this issue Jun 23, 2020

Mitigation for a GC Stress race after an inline pinvoke #38246

Merged

jkotas closed this as completed in #38246 Jun 23, 2020

AndyAyersMS mentioned this issue Jun 24, 2020

GCStress: Remove special handing for call to CORINFO_HELP_STOP_FOR_GC #38317

Merged

BruceForstall mentioned this issue Jun 29, 2020

Test failure: Interop\\COM\\NativeClients\\Events\\Events.cmd #38531

Closed

ghost locked as resolved and limited conversation to collaborators Dec 9, 2020

Test failure: Interop\\COM\\NativeClients\\Events\\Events.cmd #37236

Test failure: Interop\\COM\\NativeClients\\Events\\Events.cmd #37236

Comments

v-haren commented Jun 1, 2020

BruceForstall commented Jun 1, 2020

BruceForstall commented Jun 1, 2020

AaronRobinsonMSFT commented Jun 1, 2020

BruceForstall commented Jun 15, 2020 • edited Loading

AaronRobinsonMSFT commented Jun 16, 2020

BruceForstall commented Jun 16, 2020

AaronRobinsonMSFT commented Jun 16, 2020

AaronRobinsonMSFT commented Jun 16, 2020

AaronRobinsonMSFT commented Jun 16, 2020

janvorli commented Jun 16, 2020

AaronRobinsonMSFT commented Jun 16, 2020

janvorli commented Jun 16, 2020

AaronRobinsonMSFT commented Jun 17, 2020 • edited Loading

AaronRobinsonMSFT commented Jun 17, 2020 • edited Loading

AndyAyersMS commented Jun 17, 2020

jkotas commented Jun 17, 2020

jkotas commented Jun 17, 2020

AndyAyersMS commented Jun 17, 2020

jkotas commented Jun 17, 2020 • edited Loading

AndyAyersMS commented Jun 18, 2020

AaronRobinsonMSFT commented Jun 19, 2020

jkotas commented Jun 19, 2020

jkotas commented Jun 19, 2020 • edited Loading

jkotas commented Jun 19, 2020

AndyAyersMS commented Jun 19, 2020

AndyAyersMS commented Jun 19, 2020

AndyAyersMS commented Jun 20, 2020

AaronRobinsonMSFT commented Jun 20, 2020

AndyAyersMS commented Jun 20, 2020

AaronRobinsonMSFT commented Jun 20, 2020

jkotas commented Jun 20, 2020

jkotas commented Jun 20, 2020

AndyAyersMS commented Jun 20, 2020

AndyAyersMS commented Jun 20, 2020

AndyAyersMS commented Jun 22, 2020

AndyAyersMS commented Jun 22, 2020

janvorli commented Jun 22, 2020

AndyAyersMS commented Jun 22, 2020

BruceForstall commented Jun 15, 2020 •

edited

Loading

AaronRobinsonMSFT commented Jun 17, 2020 •

edited

Loading

AaronRobinsonMSFT commented Jun 17, 2020 •

edited

Loading

jkotas commented Jun 17, 2020 •

edited

Loading

jkotas commented Jun 19, 2020 •

edited

Loading