Test failure: baseservices/threading/generics/WaitCallback/thread11/thread11.sh #36636

v-haren · 2020-05-18T07:27:15Z

failed in job: runtime-coreclr gcstress-extra 20200517.1

failed test:
baseservices/threading/generics/WaitCallback/thread11/thread11.sh
baseservices/threading/generics/WaitCallback/thread24/thread24.sh

Error message

Assert failure(PID 71086 [0x000115ae], Thread: 71086 [0x115ae]): !CREATE_CHECK_STRING(pMT && pMT->Validate())
 File: /__w/1/s/src/coreclr/src/vm/object.cpp Line: 545


 Image: /home/helixbot/work/B0A30949/p/corerun

/home/helixbot/work/B0A30949/w/A2ED08E4/e/baseservices/threading/generics/WaitCallback/thread11/thread11.sh: line 317: 71086 Aborted $LAUNCHER $ExePath "${CLRTestExecutionArguments[@]}"

Return code: 1
Raw output file: /home/helixbot/work/B0A30949/w/A2ED08E4/e/baseservices/threading/Reports/baseservices.threading/generics/WaitCallback/thread11/thread11.output.txt
Raw output:
BEGIN EXECUTION
/home/helixbot/work/B0A30949/p/corerun thread11.dll ''
Writing minidump to file /home/helixbot/dotnetbuild/dumps/coredump.71086.dmp
Written 81625088 bytes (19928 pages) to core file
Dump successfully written
Expected: 100
Actual: 134
END EXECUTION - FAILED
Test Harness Exitcode is : 1
To run the test:
> set CORE_ROOT=/home/helixbot/work/B0A30949/p
> /home/helixbot/work/B0A30949/w/A2ED08E4/e/baseservices/threading/generics/WaitCallback/thread11/thread11.sh
Expected: True
Actual: False

category:correctness
theme:gc-stress
skill-level:expert
cost:medium

The text was updated successfully, but these errors were encountered:

ghost · 2020-05-18T07:27:17Z

Tagging subscribers to this area: @Maoni0
Notify danmosemsft if you want to be subscribed.

BruceForstall · 2020-05-18T16:46:43Z

Case: CoreCLR Linux x64 Checked gcstress0xc_zapdisable_jitstress2 @ Ubuntu.1804.Amd64.Open

BruceForstall · 2020-06-01T22:58:46Z

@AndyAyersMS Can you look at this?

AndyAyersMS · 2020-06-02T07:50:01Z

Sure.

AndyAyersMS · 2020-06-02T18:02:41Z

Can't repro this locally (yet).

There are core dumps from the failing runs above. After downloading the thread11 dump and the matching build artifacts, lldb shows the following backtrace:

    frame #7: 0x00007f86cc2c6ba3 libcoreclr.so`::DbgAssertDialog(szFile="", iLine=545, szExpr=<unavailable>) at debug.cpp:0
    frame #8: 0x00007f86cc362c97 libcoreclr.so`Object::ValidateInner(this=0x00007f8624011350, bDeep=YES, bVerifyNextHeader=YES, bVerifySyncBlock=YES) at object.cpp:545
    frame #9: 0x00007f86cc712bcf libcoreclr.so`ValidateFetchObjrefForHandle(objref=OBJECTREF @ 0x00007fff06c0f138) at handletable.cpp:331
    frame #10: 0x00007f86cc705cd4 libcoreclr.so`WKS::GCHeap::StressHeap(gc_alloc_context*) [inlined] HndFetchHandle(handle=0x00007f86ce3a13e0) at handletable.h:183
    frame #11: 0x00007f86cc705c7b libcoreclr.so`WKS::GCHeap::StressHeap(this=<unavailable>, context=<unavailable>) at gc.cpp:35592
    frame #12: 0x00007f86cc43c63f libcoreclr.so`DoGcStress(regs=0x0000558e89d74890, nativeCodeVersion=<unavailable>) at gccover.cpp:1796
    frame #13: 0x00007f86cc43c1bf libcoreclr.so`OnGcCoverageInterrupt(regs=0x0000558e89d74890) at gccover.cpp:1427

It seems the handle table kept by the gc for stress has an invalid reference.

Managed stack that lead to this gc stress point is:

00007FFF06C0F1E0 00007f86cdd9723a [ResumableFrame: 00007fff06c0f1e0] 
00007FFF06C101B8 00007f865314608e [InlinedCallFrame: 00007fff06c101b8] System.Threading.ThreadPool.RequestWorkerThread()
00007FFF06C101B0 00007F865314608E System.Threading.ThreadPoolWorkQueue.EnsureThreadRequested() [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPool.cs @ 441]
00007FFF06C10270 00007F8653145CE2 System.Threading.ThreadPoolWorkQueue.Enqueue(System.Object, Boolean) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPool.cs @ 490]
00007FFF06C102B0 00007F8653140EEB System.Threading.ThreadPool.QueueUserWorkItem(System.Threading.WaitCallback, System.Object) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPool.cs @ 1093]
00007FFF06C102E0 00007F865314D908 Gen`1[[System.Guid, System.Private.CoreLib]].ThreadPoolTest()
00007FFF06C10330 00007F865313F8A1 Test.Main() [/__w/1/s/src/coreclr/tests/src/baseservices/threading/generics/WaitCallback/thread11.cs @ 75]

VerifyHeap does not detect any heap corruption.

The address being validated is 0x00007f8624011350.

Number of GC Heaps: 1
generation 0 starts at 0x00007F8624012D20
generation 1 starts at 0x00007F8624012B90
generation 2 starts at 0x00007F8623FFF000
ephemeral segment allocation context: none
         segment             begin         allocated              size
00007F8623FFE000  00007F8623FFF000  00007F86240AA950  0xab950(702800)
Large object heap starts at 0x00007F8633FFF000
         segment             begin         allocated              size
00007F8633FFE000  00007F8633FFF000  00007F8634003480  0x4480(17536)
Total Size:              Size: 0xafdd0 (720336) bytes.

Suspect this may not be a codegen issue; jitted code does not interact with this handle table.

@Maoni0 any suggestions on where to look next?

AndyAyersMS · 2020-06-05T19:20:43Z

@Maoni0 ping.. ? Is there a "gc-contrib" alias that I could nag instead?

Maoni0 · 2020-06-05T20:17:17Z

hmm, I did not get any notification from the first time you asked. it seems like GH notification is just not reliable.

I presume you are saying the failures were from with verify heap enabled and the failure occurred on a gen2 object would suggest that this is not a GC problem. it would also be odd if this was a handle table problem as handle table hasn't changed in a while. do you happen to have a rough timeline when this started to happen?

AndyAyersMS · 2020-06-05T21:06:13Z

My understanding (could be wrong) is that when GC stress is enabled the GC will allocate a bunch of big strings and root them with a handle table internal to the GC, and then over time trim/free these strings to encourage compaction so live objects move around.

The assert here is complaining that one of the handles in that GC internal handle table is not a valid object reference, I am at a loss to see how jitted codegen could cause this, so was looking for some guidance on either how to investigate further or perhaps who to reassign this to.

Maoni0 · 2020-06-06T00:31:29Z

I was just trying to get more info to see which area the problem is unlikely to be in so we can rule those out. StressHeap creates some GC handles like any other handle usage. GC doesn't handle them in any special way. I agree codegen looks very unlikely as well.

just making sure when you said VerifyHeap didn't report anything, did you mean the sos !VerifyHeap command or setting COMPlus_HeapVerify to 1? if you haven't done the latter I would suggest to always try that first.

without a repro this would be quite difficult to debug.

AndyAyersMS · 2020-06-06T01:10:04Z

It was sos's VerifyHeap.

I'll see if I can get this to repro... a few other of these thread tests are sporadically failling so we might get lucky. We could also try looking at the other core dumps to see if they look like the same failure.

AndyAyersMS · 2020-06-09T20:42:03Z

I can't repro this after 10K+ runs locally.

https://dev.azure.com/dnceng/public/_test/analytics?definitionId=662&contextType=build shows thread11, thread17, thread24, thread28 have all failed once in the past 30 days out of 168 total runs each.

Am going to look into the thread28 failure which just happened two days ago and see if it looks similar to this one.

Stress log from the thread11 dump doesn't contain any useful info:

STRESS LOG:
    facilitiesToLog  = 0xffffffff
    levelToLog       = 6
    MaxLogSizePerThread = 0x20000 (131072)
    MaxTotalLogSize = 0x2000000 (33554432)
    CurrentTotalLogChunk = 21
    ThreadsWithLogs  = 6
    Clock frequency  = 1.000 GHz
    Start time             Last message time      Total elapsed time 2.613 sec

THREAD  TIMESTAMP     FACILITY                              MESSAGE
  ID  (sec from start)
--------------------------------------------------------------------------------------
115bc   2.613246884 : `SYNC`               
115bc   2.613235684 : `SYNC`               
115bc   2.612156676 : `SYNC`               
115bc   2.612146276 : `SYNC`               
115bc   2.611066267 : `SYNC`               
115bc   2.611055467 : `SYNC`               
115bc   2.609964759 : `SYNC`               
115bc   2.609954159 : `SYNC`               
115bc   2.608872150 : `SYNC`               
115bc   2.608861250 : `SYNC`               
115bc   2.607778742 : `SYNC`               
115bc   2.607767942 : `SYNC`               
115bc   2.606687434 : `SYNC`               
115bc   2.606675834 : `SYNC`               
115bc   2.605569825 : `SYNC`               
115bc   2.605537625 : `SYNC`               
115ae   2.604356416 : `ASSERT`             
115b7   2.604256515 : `SYNC`               
115bc   2.604240615 : `SYNC`               
115bc   2.604236315 : `SYNC`               
115bc   2.604210915 : `SYNC`               
115b7   2.604143414 : `SYNC`               
115b7   2.604142714 : `SYNC`               
115b7   2.604137314 : `GC`GCALLOC`GCROOTS`

@janvorli how should we configure these stress runs to maximize chances that we can debug these hard to repro failures from a dump + stress log?

janvorli · 2020-06-09T20:53:53Z

should we configure these stress runs to maximize chances that we can debug these hard to repro failures from a dump + stress log?

I think it is a good idea. The only downside is a little higher memory consumption and a larger crash dump.

janvorli · 2020-06-09T21:03:36Z

As for the GC handle issue, it guess it could be caused by a double free of a handle. I've seen that when looking into #32171. If double free of handle happens, it can later lead to double-allocation of the same handle. So something else gets the same handle too, frees it and the other place that got the copy becomes invalid.

janvorli · 2020-06-09T21:15:08Z

Ah, I see the question was "how should we" and not "should we".
So I meant that we could maximize the total stress log size to 1GB and set the per thread size to 4MB (that preserves the ratio between per thread and total size). Or if we are sure we will always run a limited number of threads, we can set the per thread limit higher.

janvorli · 2020-06-09T21:17:34Z

(The default ratio basically means that upto 256 threads with 128KB buffer each could fit into the total size of 32MB)

janvorli · 2020-06-09T21:35:48Z

One more thing regarding the issue. When I was debugging the one I've mentioned, I've added instrumentation around the GC handle creation and freeing, keeping a list of allocated handles and checking at each free that the handle is in the list and at each allocate that the handle is not there. Both under an added spinlock. That helped me to catch the offender.

AndyAyersMS · 2020-06-10T00:10:38Z

thread28 and thread17 failure asserts:

IsExecutingOnAltStack() || (curSP <= m_pFrame && m_pFrame < m_CacheStackBase) || m_pFrame == (Frame*) -1\n File: /__w/1/s/src/coreclr/src/vm/threads.h Line: 1859

thread24 failure:

Segmentation fault

So will look at thread24 next I guess.

AndyAyersMS · 2020-08-04T18:44:13Z

These haven't failed in the past 3 weekend runs, and I've never been able to repro them locally. So closing.

v-haren added GCStress area-GC-coreclr labels May 18, 2020

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label May 18, 2020

BruceForstall added this to the 5.0 milestone May 18, 2020

BruceForstall removed the untriaged New issue has not been triaged by the area owner label May 18, 2020

Maoni0 added tenet-reliability Reliability/stability related issue (stress, load problems, etc.) and removed area-GC-coreclr labels May 18, 2020

BruceForstall added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI JitStress CLR JIT issues involving JIT internal stress modes labels May 26, 2020

BruceForstall assigned AndyAyersMS Jun 1, 2020

AndyAyersMS mentioned this issue Jun 5, 2020

Test failure: baseservices/threading/generics/WaitCallback/thread17/thread17.sh #36971

Closed

AndyAyersMS mentioned this issue Jul 29, 2020

Assertion in WKS::gc_heap::background_sweep: Consistency check failed #38518

Closed

AndyAyersMS closed this as completed Aug 4, 2020

ghost locked as resolved and limited conversation to collaborators Dec 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test failure: baseservices/threading/generics/WaitCallback/thread11/thread11.sh #36636

Test failure: baseservices/threading/generics/WaitCallback/thread11/thread11.sh #36636

v-haren commented May 18, 2020 •

edited by BruceForstall

Loading

ghost commented May 18, 2020

BruceForstall commented May 18, 2020

BruceForstall commented Jun 1, 2020

AndyAyersMS commented Jun 2, 2020

AndyAyersMS commented Jun 2, 2020

AndyAyersMS commented Jun 5, 2020

Maoni0 commented Jun 5, 2020

AndyAyersMS commented Jun 5, 2020

Maoni0 commented Jun 6, 2020

AndyAyersMS commented Jun 6, 2020

AndyAyersMS commented Jun 9, 2020

janvorli commented Jun 9, 2020

janvorli commented Jun 9, 2020

janvorli commented Jun 9, 2020

janvorli commented Jun 9, 2020

janvorli commented Jun 9, 2020

AndyAyersMS commented Jun 10, 2020

AndyAyersMS commented Aug 4, 2020

Test failure: baseservices/threading/generics/WaitCallback/thread11/thread11.sh #36636

Test failure: baseservices/threading/generics/WaitCallback/thread11/thread11.sh #36636

Comments

v-haren commented May 18, 2020 • edited by BruceForstall Loading

ghost commented May 18, 2020

BruceForstall commented May 18, 2020

BruceForstall commented Jun 1, 2020

AndyAyersMS commented Jun 2, 2020

AndyAyersMS commented Jun 2, 2020

AndyAyersMS commented Jun 5, 2020

Maoni0 commented Jun 5, 2020

AndyAyersMS commented Jun 5, 2020

Maoni0 commented Jun 6, 2020

AndyAyersMS commented Jun 6, 2020

AndyAyersMS commented Jun 9, 2020

janvorli commented Jun 9, 2020

janvorli commented Jun 9, 2020

janvorli commented Jun 9, 2020

janvorli commented Jun 9, 2020

janvorli commented Jun 9, 2020

AndyAyersMS commented Jun 10, 2020

AndyAyersMS commented Aug 4, 2020

v-haren commented May 18, 2020 •

edited by BruceForstall

Loading