-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test failure: baseservices/threading/generics/WaitCallback/thread11/thread11.sh #36636
Comments
Tagging subscribers to this area: @Maoni0 |
Case: |
@AndyAyersMS Can you look at this? |
Sure. |
Can't repro this locally (yet). There are core dumps from the failing runs above. After downloading the thread11 dump and the matching build artifacts, lldb shows the following backtrace:
It seems the handle table kept by the gc for stress has an invalid reference. Managed stack that lead to this gc stress point is:
VerifyHeap does not detect any heap corruption. The address being validated is
Suspect this may not be a codegen issue; jitted code does not interact with this handle table. @Maoni0 any suggestions on where to look next? |
@Maoni0 ping.. ? Is there a "gc-contrib" alias that I could nag instead? |
hmm, I did not get any notification from the first time you asked. it seems like GH notification is just not reliable. I presume you are saying the failures were from with verify heap enabled and the failure occurred on a gen2 object would suggest that this is not a GC problem. it would also be odd if this was a handle table problem as handle table hasn't changed in a while. do you happen to have a rough timeline when this started to happen? |
My understanding (could be wrong) is that when GC stress is enabled the GC will allocate a bunch of big strings and root them with a handle table internal to the GC, and then over time trim/free these strings to encourage compaction so live objects move around. The assert here is complaining that one of the handles in that GC internal handle table is not a valid object reference, I am at a loss to see how jitted codegen could cause this, so was looking for some guidance on either how to investigate further or perhaps who to reassign this to. |
I was just trying to get more info to see which area the problem is unlikely to be in so we can rule those out. StressHeap creates some GC handles like any other handle usage. GC doesn't handle them in any special way. I agree codegen looks very unlikely as well. just making sure when you said VerifyHeap didn't report anything, did you mean the sos !VerifyHeap command or setting COMPlus_HeapVerify to 1? if you haven't done the latter I would suggest to always try that first. without a repro this would be quite difficult to debug. |
It was sos's VerifyHeap. I'll see if I can get this to repro... a few other of these thread tests are sporadically failling so we might get lucky. We could also try looking at the other core dumps to see if they look like the same failure. |
I can't repro this after 10K+ runs locally. https://dev.azure.com/dnceng/public/_test/analytics?definitionId=662&contextType=build shows thread11, thread17, thread24, thread28 have all failed once in the past 30 days out of 168 total runs each. Am going to look into the thread28 failure which just happened two days ago and see if it looks similar to this one. Stress log from the thread11 dump doesn't contain any useful info:
@janvorli how should we configure these stress runs to maximize chances that we can debug these hard to repro failures from a dump + stress log? |
I think it is a good idea. The only downside is a little higher memory consumption and a larger crash dump. |
As for the GC handle issue, it guess it could be caused by a double free of a handle. I've seen that when looking into #32171. If double free of handle happens, it can later lead to double-allocation of the same handle. So something else gets the same handle too, frees it and the other place that got the copy becomes invalid. |
Ah, I see the question was "how should we" and not "should we". |
(The default ratio basically means that upto 256 threads with 128KB buffer each could fit into the total size of 32MB) |
One more thing regarding the issue. When I was debugging the one I've mentioned, I've added instrumentation around the GC handle creation and freeing, keeping a list of allocated handles and checking at each free that the handle is in the list and at each allocate that the handle is not there. Both under an added spinlock. That helped me to catch the offender. |
thread28 and thread17 failure asserts:
thread24 failure:
So will look at thread24 next I guess. |
These haven't failed in the past 3 weekend runs, and I've never been able to repro them locally. So closing. |
failed in job: runtime-coreclr gcstress-extra 20200517.1
failed test:
baseservices/threading/generics/WaitCallback/thread11/thread11.sh
baseservices/threading/generics/WaitCallback/thread24/thread24.sh
Error message
category:correctness
theme:gc-stress
skill-level:expert
cost:medium
The text was updated successfully, but these errors were encountered: