Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failure: baseservices/threading/generics/WaitCallback/thread11/thread11.sh #36636

Closed
v-haren opened this issue May 18, 2020 · 18 comments
Closed
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI GCStress JitStress CLR JIT issues involving JIT internal stress modes tenet-reliability Reliability/stability related issue (stress, load problems, etc.)
Milestone

Comments

@v-haren
Copy link

v-haren commented May 18, 2020

failed in job: runtime-coreclr gcstress-extra 20200517.1

failed test:
baseservices/threading/generics/WaitCallback/thread11/thread11.sh
baseservices/threading/generics/WaitCallback/thread24/thread24.sh

Error message

Assert failure(PID 71086 [0x000115ae], Thread: 71086 [0x115ae]): !CREATE_CHECK_STRING(pMT && pMT->Validate())
 File: /__w/1/s/src/coreclr/src/vm/object.cpp Line: 545


 Image: /home/helixbot/work/B0A30949/p/corerun

/home/helixbot/work/B0A30949/w/A2ED08E4/e/baseservices/threading/generics/WaitCallback/thread11/thread11.sh: line 317: 71086 Aborted $LAUNCHER $ExePath "${CLRTestExecutionArguments[@]}"

Return code: 1
Raw output file: /home/helixbot/work/B0A30949/w/A2ED08E4/e/baseservices/threading/Reports/baseservices.threading/generics/WaitCallback/thread11/thread11.output.txt
Raw output:
BEGIN EXECUTION
/home/helixbot/work/B0A30949/p/corerun thread11.dll ''
Writing minidump to file /home/helixbot/dotnetbuild/dumps/coredump.71086.dmp
Written 81625088 bytes (19928 pages) to core file
Dump successfully written
Expected: 100
Actual: 134
END EXECUTION - FAILED
Test Harness Exitcode is : 1
To run the test:
> set CORE_ROOT=/home/helixbot/work/B0A30949/p
> /home/helixbot/work/B0A30949/w/A2ED08E4/e/baseservices/threading/generics/WaitCallback/thread11/thread11.sh
Expected: True
Actual: False

category:correctness
theme:gc-stress
skill-level:expert
cost:medium

@ghost
Copy link

ghost commented May 18, 2020

Tagging subscribers to this area: @Maoni0
Notify danmosemsft if you want to be subscribed.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label May 18, 2020
@BruceForstall
Copy link
Member

Case: CoreCLR Linux x64 Checked gcstress0xc_zapdisable_jitstress2 @ Ubuntu.1804.Amd64.Open

@BruceForstall BruceForstall added this to the 5.0 milestone May 18, 2020
@BruceForstall BruceForstall removed the untriaged New issue has not been triaged by the area owner label May 18, 2020
@Maoni0 Maoni0 added tenet-reliability Reliability/stability related issue (stress, load problems, etc.) and removed area-GC-coreclr labels May 18, 2020
@BruceForstall BruceForstall added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI JitStress CLR JIT issues involving JIT internal stress modes labels May 26, 2020
@BruceForstall
Copy link
Member

@AndyAyersMS Can you look at this?

@AndyAyersMS
Copy link
Member

Sure.

@AndyAyersMS
Copy link
Member

Can't repro this locally (yet).

There are core dumps from the failing runs above. After downloading the thread11 dump and the matching build artifacts, lldb shows the following backtrace:

    frame #7: 0x00007f86cc2c6ba3 libcoreclr.so`::DbgAssertDialog(szFile="", iLine=545, szExpr=<unavailable>) at debug.cpp:0
    frame #8: 0x00007f86cc362c97 libcoreclr.so`Object::ValidateInner(this=0x00007f8624011350, bDeep=YES, bVerifyNextHeader=YES, bVerifySyncBlock=YES) at object.cpp:545
    frame #9: 0x00007f86cc712bcf libcoreclr.so`ValidateFetchObjrefForHandle(objref=OBJECTREF @ 0x00007fff06c0f138) at handletable.cpp:331
    frame #10: 0x00007f86cc705cd4 libcoreclr.so`WKS::GCHeap::StressHeap(gc_alloc_context*) [inlined] HndFetchHandle(handle=0x00007f86ce3a13e0) at handletable.h:183
    frame #11: 0x00007f86cc705c7b libcoreclr.so`WKS::GCHeap::StressHeap(this=<unavailable>, context=<unavailable>) at gc.cpp:35592
    frame #12: 0x00007f86cc43c63f libcoreclr.so`DoGcStress(regs=0x0000558e89d74890, nativeCodeVersion=<unavailable>) at gccover.cpp:1796
    frame #13: 0x00007f86cc43c1bf libcoreclr.so`OnGcCoverageInterrupt(regs=0x0000558e89d74890) at gccover.cpp:1427

It seems the handle table kept by the gc for stress has an invalid reference.

Managed stack that lead to this gc stress point is:

00007FFF06C0F1E0 00007f86cdd9723a [ResumableFrame: 00007fff06c0f1e0] 
00007FFF06C101B8 00007f865314608e [InlinedCallFrame: 00007fff06c101b8] System.Threading.ThreadPool.RequestWorkerThread()
00007FFF06C101B0 00007F865314608E System.Threading.ThreadPoolWorkQueue.EnsureThreadRequested() [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPool.cs @ 441]
00007FFF06C10270 00007F8653145CE2 System.Threading.ThreadPoolWorkQueue.Enqueue(System.Object, Boolean) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPool.cs @ 490]
00007FFF06C102B0 00007F8653140EEB System.Threading.ThreadPool.QueueUserWorkItem(System.Threading.WaitCallback, System.Object) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPool.cs @ 1093]
00007FFF06C102E0 00007F865314D908 Gen`1[[System.Guid, System.Private.CoreLib]].ThreadPoolTest()
00007FFF06C10330 00007F865313F8A1 Test.Main() [/__w/1/s/src/coreclr/tests/src/baseservices/threading/generics/WaitCallback/thread11.cs @ 75]

VerifyHeap does not detect any heap corruption.

The address being validated is 0x00007f8624011350.

Number of GC Heaps: 1
generation 0 starts at 0x00007F8624012D20
generation 1 starts at 0x00007F8624012B90
generation 2 starts at 0x00007F8623FFF000
ephemeral segment allocation context: none
         segment             begin         allocated              size
00007F8623FFE000  00007F8623FFF000  00007F86240AA950  0xab950(702800)
Large object heap starts at 0x00007F8633FFF000
         segment             begin         allocated              size
00007F8633FFE000  00007F8633FFF000  00007F8634003480  0x4480(17536)
Total Size:              Size: 0xafdd0 (720336) bytes.

Suspect this may not be a codegen issue; jitted code does not interact with this handle table.

@Maoni0 any suggestions on where to look next?

@AndyAyersMS
Copy link
Member

@Maoni0 ping.. ? Is there a "gc-contrib" alias that I could nag instead?

@Maoni0
Copy link
Member

Maoni0 commented Jun 5, 2020

hmm, I did not get any notification from the first time you asked. it seems like GH notification is just not reliable.

I presume you are saying the failures were from with verify heap enabled and the failure occurred on a gen2 object would suggest that this is not a GC problem. it would also be odd if this was a handle table problem as handle table hasn't changed in a while. do you happen to have a rough timeline when this started to happen?

@AndyAyersMS
Copy link
Member

My understanding (could be wrong) is that when GC stress is enabled the GC will allocate a bunch of big strings and root them with a handle table internal to the GC, and then over time trim/free these strings to encourage compaction so live objects move around.

The assert here is complaining that one of the handles in that GC internal handle table is not a valid object reference, I am at a loss to see how jitted codegen could cause this, so was looking for some guidance on either how to investigate further or perhaps who to reassign this to.

@Maoni0
Copy link
Member

Maoni0 commented Jun 6, 2020

I was just trying to get more info to see which area the problem is unlikely to be in so we can rule those out. StressHeap creates some GC handles like any other handle usage. GC doesn't handle them in any special way. I agree codegen looks very unlikely as well.

just making sure when you said VerifyHeap didn't report anything, did you mean the sos !VerifyHeap command or setting COMPlus_HeapVerify to 1? if you haven't done the latter I would suggest to always try that first.

without a repro this would be quite difficult to debug.

@AndyAyersMS
Copy link
Member

It was sos's VerifyHeap.

I'll see if I can get this to repro... a few other of these thread tests are sporadically failling so we might get lucky. We could also try looking at the other core dumps to see if they look like the same failure.

@AndyAyersMS
Copy link
Member

I can't repro this after 10K+ runs locally.

https://dev.azure.com/dnceng/public/_test/analytics?definitionId=662&contextType=build shows thread11, thread17, thread24, thread28 have all failed once in the past 30 days out of 168 total runs each.

Am going to look into the thread28 failure which just happened two days ago and see if it looks similar to this one.

Stress log from the thread11 dump doesn't contain any useful info:

STRESS LOG:
    facilitiesToLog  = 0xffffffff
    levelToLog       = 6
    MaxLogSizePerThread = 0x20000 (131072)
    MaxTotalLogSize = 0x2000000 (33554432)
    CurrentTotalLogChunk = 21
    ThreadsWithLogs  = 6
    Clock frequency  = 1.000 GHz
    Start time             Last message time      Total elapsed time 2.613 sec

THREAD  TIMESTAMP     FACILITY                              MESSAGE
  ID  (sec from start)
--------------------------------------------------------------------------------------
115bc   2.613246884 : `SYNC`               
115bc   2.613235684 : `SYNC`               
115bc   2.612156676 : `SYNC`               
115bc   2.612146276 : `SYNC`               
115bc   2.611066267 : `SYNC`               
115bc   2.611055467 : `SYNC`               
115bc   2.609964759 : `SYNC`               
115bc   2.609954159 : `SYNC`               
115bc   2.608872150 : `SYNC`               
115bc   2.608861250 : `SYNC`               
115bc   2.607778742 : `SYNC`               
115bc   2.607767942 : `SYNC`               
115bc   2.606687434 : `SYNC`               
115bc   2.606675834 : `SYNC`               
115bc   2.605569825 : `SYNC`               
115bc   2.605537625 : `SYNC`               
115ae   2.604356416 : `ASSERT`             
115b7   2.604256515 : `SYNC`               
115bc   2.604240615 : `SYNC`               
115bc   2.604236315 : `SYNC`               
115bc   2.604210915 : `SYNC`               
115b7   2.604143414 : `SYNC`               
115b7   2.604142714 : `SYNC`               
115b7   2.604137314 : `GC`GCALLOC`GCROOTS` 

@janvorli how should we configure these stress runs to maximize chances that we can debug these hard to repro failures from a dump + stress log?

@janvorli
Copy link
Member

janvorli commented Jun 9, 2020

should we configure these stress runs to maximize chances that we can debug these hard to repro failures from a dump + stress log?

I think it is a good idea. The only downside is a little higher memory consumption and a larger crash dump.

@janvorli
Copy link
Member

janvorli commented Jun 9, 2020

As for the GC handle issue, it guess it could be caused by a double free of a handle. I've seen that when looking into #32171. If double free of handle happens, it can later lead to double-allocation of the same handle. So something else gets the same handle too, frees it and the other place that got the copy becomes invalid.

@janvorli
Copy link
Member

janvorli commented Jun 9, 2020

Ah, I see the question was "how should we" and not "should we".
So I meant that we could maximize the total stress log size to 1GB and set the per thread size to 4MB (that preserves the ratio between per thread and total size). Or if we are sure we will always run a limited number of threads, we can set the per thread limit higher.

@janvorli
Copy link
Member

janvorli commented Jun 9, 2020

(The default ratio basically means that upto 256 threads with 128KB buffer each could fit into the total size of 32MB)

@janvorli
Copy link
Member

janvorli commented Jun 9, 2020

One more thing regarding the issue. When I was debugging the one I've mentioned, I've added instrumentation around the GC handle creation and freeing, keeping a list of allocated handles and checking at each free that the handle is in the list and at each allocate that the handle is not there. Both under an added spinlock. That helped me to catch the offender.

@AndyAyersMS
Copy link
Member

thread28 and thread17 failure asserts:

IsExecutingOnAltStack() || (curSP <= m_pFrame && m_pFrame < m_CacheStackBase) || m_pFrame == (Frame*) -1\n File: /__w/1/s/src/coreclr/src/vm/threads.h Line: 1859

thread24 failure:

Segmentation fault 

So will look at thread24 next I guess.

@AndyAyersMS
Copy link
Member

These haven't failed in the past 3 weekend runs, and I've never been able to repro them locally. So closing.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI GCStress JitStress CLR JIT issues involving JIT internal stress modes tenet-reliability Reliability/stability related issue (stress, load problems, etc.)
Projects
None yet
Development

No branches or pull requests

6 participants