Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failure: Interop\\COM\\NativeClients\\Events\\Events.cmd #37236

Closed
v-haren opened this issue Jun 1, 2020 · 39 comments · Fixed by #38246
Closed

Test failure: Interop\\COM\\NativeClients\\Events\\Events.cmd #37236

v-haren opened this issue Jun 1, 2020 · 39 comments · Fixed by #38246
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI GCStress
Milestone

Comments

@v-haren
Copy link

v-haren commented Jun 1, 2020

failed in job: runtime-coreclr gcstress0x3-gcstress0xc 20200531.1

Error message


Return code: 1
Raw output file: C:\h\w\A7A808D4\w\AD6E08F2\e\Interop\COM\Reports\Interop.COM\NativeClients\Events\Events.output.txt
Raw output:
BEGIN EXECUTION
Copying 'C:\h\w\A7A808D4\p\CoreShim.dll'...
 1 file(s) copied.
 "C:\h\w\A7A808D4\p\corerun.exe" Events.dll 
Searching for exe to launch in C:\h\w\A7A808D4\w\AD6E08F2\e\Interop\COM\NativeClients\Events...
Launching 'C:\h\w\A7A808D4\w\AD6E08F2\e\Interop\COM\NativeClients\Events\COMClientEvents.exe'...

Expected: 100
Actual: -1073741819
END EXECUTION - FAILED
FAILED
Test Harness Exitcode is : 1
To run the test:
> set CORE_ROOT=C:\h\w\A7A808D4\p
> C:\h\w\A7A808D4\w\AD6E08F2\e\Interop\COM\NativeClients\Events\Events.cmd
Expected: True
Actual: False


Stack trace
   at Interop_COM._NativeClients_Events_Events_._NativeClients_Events_Events_cmd() in F:\workspace\_work\1\s\artifacts\tests\coreclr\Windows_NT.x86.Checked\TestWrappers\Interop.COM\Interop.COM.XUnitWrapper.cs:line 916
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-Infrastructure-coreclr untriaged New issue has not been triaged by the area owner labels Jun 1, 2020
@BruceForstall
Copy link
Member

CoreCLR Windows_NT x86 Checked gcstress0xc @ Windows.10.Amd64.Open

@BruceForstall
Copy link
Member

@jkoritzinsky @AaronRobinsonMSFT another interop gcstress failure

@AaronRobinsonMSFT
Copy link
Member

I can't reproduce this locally at all. I have tried COMPlus_GCStress=0xC and COMPlus_GCStress=0x3 10 times each. I tried this on both DEBUG and CHECKED builds without any success. I am going to close this for now, but please reopen if it reproduces again.

There was a recent change in #37116 which fixed a long standing issue in x86 with GCStress. If this reoccurs in the future a deeper investigation will be warranted.

@BruceForstall
Copy link
Member

BruceForstall commented Jun 15, 2020

@BruceForstall BruceForstall reopened this Jun 15, 2020
@AaronRobinsonMSFT
Copy link
Member

I had to run this in a loop 3000 times... it finally triggered on 2984 which took about 2.5 hrs. This is not the test itself, but rather the runner. I have a full DMP of the issue.

0:000> !clrstack
OS Thread Id: 0x4da4 (0)
        Child SP               IP Call Site
000000E62897CE88 00007ffb62a4f97f [RedirectedThreadFrame: 000000e62897ce88] 
000000E62897D070 00007ffaacc32616 [InlinedCallFrame: 000000e62897d070] Interop+Kernel32.CreatePipe(Microsoft.Win32.SafeHandles.SafeFileHandle ByRef, Microsoft.Win32.SafeHandles.SafeFileHandle ByRef, SECURITY_ATTRIBUTES ByRef, Int32)
000000E62897D040 00007ffaacc32616 ILStubClass.IL_STUB_PInvoke(Microsoft.Win32.SafeHandles.SafeFileHandle ByRef, Microsoft.Win32.SafeHandles.SafeFileHandle ByRef, SECURITY_ATTRIBUTES ByRef, Int32)
000000E62897D1A0 00007ffaacc32389 System.Diagnostics.Process.CreatePipeWithSecurityAttributes(Microsoft.Win32.SafeHandles.SafeFileHandle ByRef, Microsoft.Win32.SafeHandles.SafeFileHandle ByRef, SECURITY_ATTRIBUTES ByRef, Int32) [D:\runtime\src\libraries\System.Diagnostics.Process\src\System\Diagnostics\Process.Windows.cs @ 828]
000000E62897D1F0 00007ffaacc32275 System.Diagnostics.Process.CreatePipe(Microsoft.Win32.SafeHandles.SafeFileHandle ByRef, Microsoft.Win32.SafeHandles.SafeFileHandle ByRef, Boolean) [D:\runtime\src\libraries\System.Diagnostics.Process\src\System\Diagnostics\Process.Windows.cs @ 866]
000000E62897D270 00007ffaacc30eac System.Diagnostics.Process.StartWithCreateProcess(System.Diagnostics.ProcessStartInfo) [D:\runtime\src\libraries\System.Diagnostics.Process\src\System\Diagnostics\Process.Windows.cs @ 513]
000000E62897D680 00007ffaacc2fd8e System.Diagnostics.Process.StartCore(System.Diagnostics.ProcessStartInfo)
000000E62897D6B0 00007ffaacc2c4d8 System.Diagnostics.Process.Start() [D:\runtime\src\libraries\System.Diagnostics.Process\src\System\Diagnostics\Process.cs @ 1228]
000000E62897D760 00007ffaacc2b7d2 System.Diagnostics.Process.Start(System.Diagnostics.ProcessStartInfo) [D:\runtime\src\libraries\System.Diagnostics.Process\src\System\Diagnostics\Process.cs @ 1277]
000000E62897D7B0 00007ffaacc19e5b Program.Main(System.String[]) [D:\runtime\src\coreclr\tests\src\Interop\common\ExeLauncherProgram.cs @ 65]
0:000> kb
 # RetAddr           : Args to Child                                                           : Call Site
00 00007ffb`0bfc1b4f : 00000000`00000002 00007ffb`0c96bfdc 000002b9`ed105fc8 00007ffb`0cb9d2d0 : KERNELBASE!RaiseFailFastException+0xaf [minkernel\kernelbase\xcpt.c @ 1133] 
01 (Inline Function) : --------`-------- --------`-------- --------`-------- --------`-------- : coreclr!TerminateOnAssert+0x17 [F:\workspace\_work\1\s\src\coreclr\src\utilcode\debug.cpp @ 190] 
02 00007ffb`0bfc12c0 : 00007ffb`0c983248 fefefefe`00001d46 00007ffb`0cb9a9c0 fefefefe`00000000 : coreclr!_DbgBreakCheck+0x483 [F:\workspace\_work\1\s\src\coreclr\src\utilcode\debug.cpp @ 428] 
03 00007ffb`0bfc14a2 : 00000000`00000000 00007ffb`0cb9a9c0 00000000`00000001 00000000`00000001 : coreclr!_DbgBreakCheckNoThrow+0x50 [F:\workspace\_work\1\s\src\coreclr\src\utilcode\debug.cpp @ 535] 
04 00007ffb`0bfc7d25 : 00007ffb`0c983248 000002b9`00001d46 000002b9`eebf3c48 000000e6`2897bbe8 : coreclr!DbgAssertDialog+0x18e [F:\workspace\_work\1\s\src\coreclr\src\utilcode\debug.cpp @ 696] 
05 00007ffb`0c020a86 : 000000e6`2897bbe8 000002b9`ed0fc6b8 000002b9`ed0c9e90 000002b9`ed0ee640 : coreclr!CHECK::Trigger+0x2c9 [F:\workspace\_work\1\s\src\coreclr\src\utilcode\check.cpp @ 140] 
06 00007ffb`0c020547 : 00000000`00000028 000002b9`eab2f890 00000000`00000001 000000e6`2897c110 : coreclr!CLRVectoredExceptionHandlerPhase3+0x2b2 [F:\workspace\_work\1\s\src\coreclr\src\vm\excep.cpp @ 7493] 
07 00007ffb`0c02048d : 00000000`00000028 00000000`00000028 000002b9`eaba1230 000002b9`eab2f890 : coreclr!CLRVectoredExceptionHandlerPhase2+0x97 [F:\workspace\_work\1\s\src\coreclr\src\vm\excep.cpp @ 7220] 
08 00007ffb`0c020d00 : 000002b9`eab2f890 00000000`00000000 000002b9`eaba1230 ffffffff`ffffffff : coreclr!CLRVectoredExceptionHandler+0x215 [F:\workspace\_work\1\s\src\coreclr\src\vm\excep.cpp @ 7186] 
09 00007ffb`655b85b6 : 00007ffb`0c020b60 000000e6`2897c100 000002b9`ec560880 00007ffb`65593550 : coreclr!CLRVectoredExceptionHandlerShim+0x1a0 [F:\workspace\_work\1\s\src\coreclr\src\vm\excep.cpp @ 7922] 
0a 00007ffb`655aa056 : 000000e6`2897c870 000000e6`2897c380 00007ffb`62971e00 00007ffb`0c7a2429 : ntdll!RtlpCallVectoredHandlers+0x196 [minkernel\ntdll\vectxcpt.c @ 204] 
0b (Inline Function) : --------`-------- --------`-------- --------`-------- --------`-------- : ntdll!RtlCallVectoredExceptionHandlers+0xe [minkernel\ntdll\vectxcpt.c @ 339] 
0c 00007ffb`655dfe3e : 00007ffb`0cb0cb00 00007ffb`65560299 00000000`00000000 000000e6`2897c448 : ntdll!RtlDispatchException+0x66 [minkernel\ntos\rtl\amd64\exdsptch.c @ 453] 
0d 00007ffb`0c6921e2 : 00000000`00000001 000000e6`2897cb39 00000000`00000000 000002b9`80150b80 : ntdll!KiUserExceptionDispatch+0x2e [minkernel\ntos\rtl\amd64\trampoln.asm @ 749] 
0e (Inline Function) : --------`-------- --------`-------- --------`-------- --------`-------- : coreclr!WKS::CObjectHeader::GetMethodTable+0xa [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 3673] 
0f (Inline Function) : --------`-------- --------`-------- --------`-------- --------`-------- : coreclr!WKS::my_get_size+0xa [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 9089] 
10 00007ffb`0c67c26f : 00000000`00000002 ffffffff`00000000 00007ffa`ad07bb60 00000000`00000000 : coreclr!WKS::gc_heap::verify_heap+0x316 [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 34721] 
11 00007ffb`0c667a72 : 00000000`00002b01 00000000`00000000 00000000`00000000 00007ffb`0c683377 : coreclr!WKS::gc_heap::garbage_collect+0x273 [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 17550] 
12 00007ffb`0c66a79a : 00000000`00000000 000000e6`ffffffff 00007ffb`00000008 000002b9`800d0320 : coreclr!WKS::GCHeap::GarbageCollectGeneration+0x286 [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 37034] 
13 00007ffb`0c36b5e6 : 000002b9`eaba1230 000000e6`0000a5f4 00007ffa`acd053d0 000002b9`eaba1288 : coreclr!WKS::GCHeap::StressHeap+0x4ca [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 35982] 
14 00007ffb`0c11a2ae : 000002b9`ed0634d0 000002b9`ed063400 00007ffa`ad124100 00007ffb`0c9a94f4 : coreclr!DoGcStress+0x362 [F:\workspace\_work\1\s\src\coreclr\src\vm\gccover.cpp @ 1815] 
15 00007ffb`0c314bd6 : 00000000`00000000 000000e6`2897d978 00000000`00000000 00000000`00000001 : coreclr!Thread::RedirectedHandledJITCase+0x1f6 [F:\workspace\_work\1\s\src\coreclr\src\vm\threadsuspend.cpp @ 2901] 
16 00007ffa`acc32616 : 00000000`00000000 000000e6`2897d978 00000000`00000000 00000000`00000004 : coreclr!RedirectedHandledJITCaseForGCStress_Stub+0x26 [F:\workspace\_work\1\s\src\coreclr\src\vm\amd64\RedirectedHandledJITCase.asm @ 104]
17 00007ffa`acc32389 : 000000e6`2897d240 000000e6`2897d558 000000e6`2897d248 00007ffb`00000000 : 0x00007ffa`acc32616 

@janvorli or @jkotas any suggestions on where to look are helpful. I will try the few things each of you have suggested in the past and see if anything pops.

@BruceForstall
Copy link
Member

Looks like the object pointer on which coreclr!WKS::CObjectHeader::GetMethodTable() is being called is corrupt, since it appears the object pointer points to illegal memory. You need to figure out where that pointer came from, and why it is corrupt. Use the stress log and see if that has any clues.

@AaronRobinsonMSFT
Copy link
Member

@BruceForstall Yeah I was starting down that path. I believe I have the object address, but it isn't in the stresslog. The object pointer I am looking at is a reasonable address but when dereferenced points to NULL - which is clearly the problem.

@AaronRobinsonMSFT
Copy link
Member

The failure is happening on line 65 below. The test is a bit complicated because the actual test is a native EXE which is launched by a manged entry point that is run with corerun.exe. This convoluted path is needed because the test infrastructure didn't (doesn't?) support tests starting from a native context.

Based on the DMP there are two things going on:

  1. The native process is starting to be launched.
  2. There is a SafeProcessHandle in the finalizer queue. I have no idea why there would be a SafeProcessHandle in the queue, but it has an invalid handle value and is not marked as "closed" or "disposed" with a ref count of 1.

I can't find the object address in the stresslogs so unsure where it came from or perhaps is new/never moved? Given where the launcher is in the create process steps the native process hasn't been started yet. We are at the early stages of setting up pipes to communicate with the as yet unlaunched process.

using (Process p = Process.Start(startInfo))
{
p.OutputDataReceived += (_, args) => Console.WriteLine(args.Data);
p.BeginOutputReadLine();
p.ErrorDataReceived += (_, args) => Console.Error.WriteLine(args.Data);
p.BeginErrorReadLine();
p.WaitForExit();
return p.ExitCode;
}

@AaronRobinsonMSFT
Copy link
Member

I see where the process handle from (2) above is coming from. That is being used during the first pipe creation for stdout, the call to get a pipe for stderr is where this is failing.

@janvorli
Copy link
Member

Since it fails in gc_heap::verify_heap, I suppose the heap got corrupted. That method is going through the GC heap in a linear manner, so it seems that maybe a previous object on the heap might have reported a wrong size and the verify_heap is looking into the middle of an object. Or maybe more likely, something has written behind the end of the previous object. Seeing what the previous object is about might help to validate such a theory.
You can run !VerifyHeap to verify it is really a heap corruption and get some more details on it.

@AaronRobinsonMSFT
Copy link
Member

Or maybe more likely, something has written behind the end of the previous object. Seeing what the previous object is about might help to validate such a theory.

@janvorli Good call. The previous object is a System.String with a length of 42484. The entire string is a series of NULL. I would agree the previous object reported something and clearing it out impacted the current object which is determined to be invalid.

@janvorli
Copy link
Member

Or some unsafe code manipulating the string has written behind the end of the string by accident. It might be interesting to get gc roots of that string to see if it reveals where is it being used (gcroot -all the_string_address)

@AaronRobinsonMSFT
Copy link
Member

AaronRobinsonMSFT commented Jun 17, 2020

I think this issue is caused by #7670.

EDIT: Perhaps not. Unsure if this is related, but the comment did seem to imply something similar was occurring.

@AaronRobinsonMSFT
Copy link
Member

AaronRobinsonMSFT commented Jun 17, 2020

@AndyAyersMS @janvorli

After chatting about this with @jkotas and debugging through the GC Cover code, I believe this is an issue with the insertion of GC Cover breakpoints - GCCoverageInfo::SprinkleBreakpoints. The comments above the GCCoverageInfo::SprinkleBreakpoints function were quite helpful in developing a hypothesis - Now, a managed thread (T) can race with the GC as follows:.

The following appears in both DMPs I have been able to collect.

  1. A thread is performing a P/Invoke and attempting to re-enter the runtime.
  2. The finalizer thread is executing managed code and triggering a GC.

Note that this is an extremely rare occurrence and has nothing to do with COM. The failure in this test is from a small runner app that launches another exe - see ExeLauncherProgram.cs.

The generated code on Windows x64 is as follows for the two DMPs.

00007ffa`acc325fa ffd0            call    rax
00007ffa`acc325fc 488b9548ffffff  mov     rdx,qword ptr [rbp-0B8h]
00007ffa`acc32603 c6420c01        mov     byte ptr [rdx+0Ch],1
00007ffa`acc32607 48baf0ddb90cfb7f0000 mov rdx,offset coreclr!g_TrapReturningThreads (00007ffb`0cb9ddf0)
00007ffa`acc32611 833a00          cmp     dword ptr [rdx],0
00007ffa`acc32614 740c            je      00007ffa`acc32622
00007ffa`acc32616 48b93073b90cfb7f0000 mov rcx,offset coreclr!hlpDynamicFuncTable+0x150 (00007ffb`0cb97330)
00007ffa`acc32620 ff11            call    qword ptr [rcx] (gcstress) (JitHelp: CORINFO_HELP_STOP_FOR_GC)
00007ffa`acc32622 898554ffffff    mov     dword ptr [rbp-0ACh],eax

The call at 00007ffaacc325fa is the native invocation. Note the inlined case where JIT_RareDisableHelper is invoked if the trap is non-zero. This appears to be the exact scenario alluded to in the comments for GCCoverageInfo::SprinkleBreakpoints. Currently, immediately after 00007ffaacc325fa and at 00007ffaacc32614 a breakpoint is inserted, however the entire inlined portion should be marked "uninterruptible" and not just the call to JIT_RareDisableHelper (i.e. 00007ffaacc32620).

I believe that if the JIT could mark the inline check as "uninterruptible" the issue would go away. I have two DMPs if anyone wants to look at examples. As mentioned, this is a very rare failure to observe. The first run took more than 2 hours to trigger, luckily the second took only 10 minutes, but it is rare.

/cc @dotnet/jit-contrib

@AaronRobinsonMSFT AaronRobinsonMSFT added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner and removed area-Interop-coreclr untriaged New issue has not been triaged by the area owner labels Jun 17, 2020
@AndyAyersMS
Copy link
Member

I believe that if the JIT could mark the inline check as "uninterruptible" and the issue would go away

Would not surprise me if there is a race here, though I do not quite understand where things go wrong.

Per the later comments if this is a stress-only issue, we've been unwilling to modify the GC info just to handle those cases. I wonder if a fix similar to #37432 makes sense here -- if we hit a stress interrupt and the thread is in preemptive mode (and perhaps, not at a call), don't initiate a GC.

@jkotas
Copy link
Member

jkotas commented Jun 17, 2020

The problem is that the thread is marked as being in cooperative mode already, but it is not fully switched over yet. The only reliable way to tell we are in this spot is by looking at the method code. We can make a pessimistic estimate e.g. give up when g_TrapReturningThreads != 0, but it has a risk of reducing GC stress coverage quite a bit.

@jkotas
Copy link
Member

jkotas commented Jun 17, 2020

We can make the check precise if the linking/unlinking of PInvoke frames is done exactly around the call. We have discussed it as the right thing to do - more context in #34526 (comment).

@AndyAyersMS
Copy link
Member

Sorry to be dense, I'd like to understand this better. Is the bug that if we have both a normal GC and a stress GC while preemptive mode is enabled and the pinvoke frame is still active, we do different GC reporting? Or is it a race of some kind?

@jkotas
Copy link
Member

jkotas commented Jun 17, 2020

Yes, it is a race. The problem happens when:

  • One thread is triggering GC (typically GC stress induced GC)
  • Second thread is returning from PInvoke and hits GC stress marker induced GC on an instruction between flipping the GC thread state (CORINFO_EE_INFO::offsetOfGCState) and calling JIT_RareDisableHelper

We may fail to properly suspend the system in this situation, both threads enter the GC, and corrupt the GC state in arbitrary way.

@AndyAyersMS
Copy link
Member

Ok, thanks.

We need to zero in on how we're going proceed here. Suspect a number of our recent sporadic stress failures may be caused by this race.

Fadi suggests most of the work for precise link/unlink is in the runtime:

Should be an easy JIT change, but there's a lot of places on the runtime side I've seen where the assumption of disconnected linking/activation is baked in, and that will be a bigger change

Do we have a handle on how much work this is, or should we look into suppressing stress interrupts if the system is trapping returning threads, or can we duplicate more of the pinvoke post-call logic within OnCodeCoverageInterrupt to avoid the issue?

cc @BruceForstall

@AaronRobinsonMSFT
Copy link
Member

Do we have a handle on how much work this is

I don't fully understand what Fadi is talking about - my ignorance in this area. If someone can elaborate on the suggestion that would be helpful.

@jkotas
Copy link
Member

jkotas commented Jun 19, 2020

Do we have a handle on how much work this is

The runtime supports this contract for R2R that is on this plan already. There are 2 few places in the runtime (look for ExecutionManager::IsReadyToRunCode) that will need to fixed up to take the R2R path unconditionally. That's all (assuming R2R path does not have bugs).

@jkotas
Copy link
Member

jkotas commented Jun 19, 2020

should we look into suppressing stress interrupts if the system is trapping returning threads

We may combine check to trapping returning threads and check for active PInvoke frame. It should catch this situation and be rare enough at the same time. Something like:

    if (g_TrapReturningThreads &&
        pFrame->GetVTablePtr() == InlinedCallFrame::GetMethodFrameVPtr() &&
        InlinedCallFrame::FrameHasActiveCall(pFrame))
    {
        skip gcstress interrupt
    }

@jkotas
Copy link
Member

jkotas commented Jun 19, 2020

We would need a condition like this even with the fix that pops the frame around PInvoke callsites.

@AndyAyersMS AndyAyersMS self-assigned this Jun 19, 2020
@AndyAyersMS
Copy link
Member

Ok, will start in on this.

I'm first going to see if adding a nop sled (or similar) the post-pinvoke call sequence widens the race window and so boosts the reproducibility of this bug, so that once we have a fix there's a more reliable way of telling if the issue is really addressed.

@AndyAyersMS
Copy link
Member

If I understand correctly, this seems like it should also address #330. If the runtime's exception handler reliably pops off the linked ICF, then we should be able to use ICFs in try regions without restrictions, right?

@AndyAyersMS
Copy link
Member

Currently adding 16 vmfences after re-entering preemptive mode and before checking for the gc trap (vmfence because it has no operands and high latency). No luck so far getting a repro. @AaronRobinsonMSFT were you do anything special when you got the repros above?

G_M45425_IG05:
       488B9578FFFFFF       mov      rdx, qword ptr [rbp-88H]
       C6420C01             mov      byte  ptr [rdx+12], 1
       0FAEF0               vmfence  
       0FAEF0               vmfence  
       0FAEF0               vmfence  
       0FAEF0               vmfence  
       0FAEF0               vmfence  
       0FAEF0               vmfence  
       0FAEF0               vmfence  
       0FAEF0               vmfence  
       0FAEF0               vmfence  
       0FAEF0               vmfence  
       0FAEF0               vmfence  
       0FAEF0               vmfence  
       0FAEF0               vmfence  
       0FAEF0               vmfence  
       0FAEF0               vmfence  
       0FAEF0               vmfence  
       833D1372F55F00       cmp      dword ptr [(reloc)], 0
       7406                 je       SHORT G_M45425_IG06
       FF156B07F55F         call     [CORINFO_HELP_STOP_FOR_GC]

@AaronRobinsonMSFT
Copy link
Member

were you do anything special when you got the repros above?

Unfortunately not. I had the following set against a Win-x64 Checked version of the runtime:

set COMPlus_GCStress=0xC
set COMPlus_JITMinOpts=1
set COMPlus_HeapVerify=1

As noted the following script ran for a good long time (~ 2.5 hrs) before it triggered:

@ECHO OFF
SETLOCAL EnableDelayedExpansion

FOR /l %%S in (1, 1, 3000) DO (
	VERIFY >nul
	ECHO B %%S : !ERRORLEVEL! 
	D:\runtime\artifacts\tests\coreclr\Windows_NT.x64.Checked\Tests\Core_Root\corerun.exe Events.dll
	ECHO E %%S : !ERRORLEVEL!

	IF NOT "!ERRORLEVEL!"=="100" GOTO :ERROR
)

GOTO :EOF

:ERROR
ECHO Failed!

@AndyAyersMS
Copy link
Member

Was the machine running anything else significant during this time?

Races will sometimes repro more readily if the machine is heavily loaded (as the CI machines likely are) since the OS scheduler is forced to be more creative.

@AaronRobinsonMSFT
Copy link
Member

Was the machine running anything else significant during this time?

Yeah. This is an old QA trick that I occasionally try. In this case it was running on my local dev machine without much input from me. I did have Outlook, Teams, and a web browser up so they were definitely doing things, but nothing was hitting the CPU with much force (i.e. compile or video decoding).

@jkotas
Copy link
Member

jkotas commented Jun 20, 2020

this seems like it should also address #330

Yup. I assume that you meant the linking/unlinking around the call.

I am not sure whether linking/unlinking around the call is strictly required to fix this. The condition that I have suggested in #37236 (comment) may be enough.

@jkotas
Copy link
Member

jkotas commented Jun 20, 2020

call [CORINFO_HELP_STOP_FOR_GC]

Aaron's crash had:

mov rcx,offset coreclr!hlpDynamicFuncTable+0x150 (00007ffb`0cb97330) <- this is the instruction we faulted on
call    qword ptr [rcx] (gcstress) (JitHelp: CORINFO_HELP_STOP_FOR_GC)

Maybe this makes the difference

@AndyAyersMS
Copy link
Member

I have two repros under the debugger now that look exactly like the above (out of ~1200 runs).

It is puzzling to me why a much wider window would not lead to failures somewhere in the stretch between switching to preemptive mode and the observed failure spot. I'm running with 256 mfences in there now.

The gc stress opcode doesn't normally get removed until after DoGCStress has returned. So could it be the call that's causing the GC, not the instruction before?

I'm probably wrong, because we'd expect that call to have been handled specially and restored by OnGcCoverageInterrupt and we seem to be past that point as we have the redirect frames on the stack. But my guess is that something is interfering with the detection of the call to stop for GC and that's perhaps where the race lies. The call detection looks at the original copy of the code so it's not impacted by the live code updates.

I'm going to try and add a bit more instrumentation. There is some state capture going on via forceStack but it doesn't seem to contain the sorts of values I'd expect.

forceStack                 [Type: void * [8]]
    [0]              : 0x2c36f7c110 [Type: void *]
    [1]              : 0x2c36f7c118 [Type: void *]
    [2]              : 0x2c36f7c108 [Type: void *]
    [3]              : 0x2c36f7c080 [Type: void *]
    [4]              : 0x0 [Type: void *]
    [5]              : 0x0 [Type: void *]
    [6]              : 0x2c36f7ca58 [Type: void *]
    [7]              : 0x0 [Type: void *]

@AndyAyersMS
Copy link
Member

Looked at this again and it's the mov to set up the address for stop_for_gc that is the instruction at issue. I suppose I can move the "nop" sled down to sit between the check for trap returning threads and the call to the stop helper and that should boost the reproducibility of this.

Also seems to jive with the observation that if the offset of the helper table entry can be RIP relative then we seemingly don't fail as then there is just the call instruction with no move before it and gc interrupts from the call get suppressed.

Might not be too hard to extend the current call protection to the mov too.

@AndyAyersMS
Copy link
Member

Testing a solution along the lines of Jan's suggestion above: #37236 (comment)

This would be more general than trying to pattern match all the possible instructions we emit between observing that g_TrapReturningThreads == 1 and calling the STOP_FOR_GC helper.

Seems to be holding up so far, and has reached the point (~500 runs) where I was seeing repros before. I'll let it go for a few thousand runs. Also will check that the total number of GCs is similar with and without the fix.

@AndyAyersMS
Copy link
Member

Proposed fix:

@@ -1425,6 +1425,25 @@ BOOL OnGcCoverageInterrupt(PCONTEXT regs)
         return TRUE;
     }

+    // If we're in cooperative mode, we're supposed to stop for GC,
+    // and there's an active ICF, don't initiate a stress GC.
+    if (g_TrapReturningThreads && pThread->PreemptiveGCDisabled())
+    {
+        Frame* pFrame = pThread->GetFrame();
+
+        // Note if we're fully in COOP mode there may be no frame,
+        // but if we're in coop mode just after a pinvoke we may
+        // not have unlinked the ICF yet.
+        if ((pFrame != NULL)
+            && (pFrame != FRAME_TOP)
+            && (pFrame->GetVTablePtr() == InlinedCallFrame::GetMethodFrameVPtr())
+            && InlinedCallFrame::FrameHasActiveCall(pFrame))
+        {
+            RemoveGcCoverageInterrupt(instrPtr, savedInstrPtr);
+            return TRUE;
+        }
+    }
+

@janvorli
Copy link
Member

It is sufficient to just call the InlinedCallFrame::FrameHasActiveCall without the pre-checks, it has all those inside:

static BOOL FrameHasActiveCall(Frame *pFrame)
{
WRAPPER_NO_CONTRACT;
SUPPORTS_DAC;
return pFrame &&
pFrame != FRAME_TOP &&
InlinedCallFrame::GetMethodFrameVPtr() == pFrame->GetVTablePtr() &&
dac_cast<TADDR>(dac_cast<PTR_InlinedCallFrame>(pFrame)->m_pCallerReturnAddress) != NULL;
}

@AndyAyersMS
Copy link
Member

Thanks, I'll update. With the fix, I'm now up to 1250 runs without a repro....

AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Jun 23, 2020
In the post-call part of a pinvoke inline call frame, it's not safe
to start a stress mode GC in the window between checking
`g_TrapReturningThreads` and the call to `CORINFO_HELP_STOP_FOR_GC`.

The call instruction is already getting special treatement, but there may
be other instructions between the check and call. Instead of trying
to pattern match them all, suppress GC stress if `g_TrapReturningThreads`
is true, the thread is in cooperative mode, and there's an active inline
call frame.

Closes dotnet#37236.
jkotas pushed a commit that referenced this issue Jun 23, 2020
In the post-call part of a pinvoke inline call frame, it's not safe
to start a stress mode GC in the window between checking
`g_TrapReturningThreads` and the call to `CORINFO_HELP_STOP_FOR_GC`.

The call instruction is already getting special treatement, but there may
be other instructions between the check and call. Instead of trying
to pattern match them all, suppress GC stress if `g_TrapReturningThreads`
is true, the thread is in cooperative mode, and there's an active inline
call frame.

Closes #37236.
@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI GCStress
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants