-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix GCStress timeouts in JIT/jit64 #85040
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak Issue Detailsnull
|
/azp run runtime-coreclr gcstress0x3-gcstress0xc |
Azure Pipelines successfully started running 1 pipeline(s). |
PTAL @kunalspathak (and this should help with the weekend gcstress failure) |
/azp run runtime-coreclr gcstress0x3-gcstress0xc |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run runtime-coreclr gcstress0x3-gcstress0xc |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run runtime-coreclr gcstress0x3-gcstress0xc |
Azure Pipelines successfully started running 1 pipeline(s). |
@trylek @davidwrighton We've been having gcstress timeouts occur every time we add merged test groups. The behavior has indicated some degradation over time within a gcstress process (probably the original motivation for striping). However, we've also seen individual tests take much longer, even when first or early in a merged test group run. My new theory is that the extra stack frames have a prohibitively high cost (and like it's just the test executor methods with the N try/catch blocks). The current iteration of this PR is (overly) aggressive at simplifying the stack. It also still marks several tests as RequiresProcessIsolation as leftover from my initial experiments. Before I go further, I was hoping to get some feedback on the area. My thought is to just go to one test per TestExecutor (and therefore simplify the logic there), make XHarnessTestRunner match it for consistency, and keep the RPIs in order to get gcstress testing unblocked. They can be removed in the future, though this is low priority since individual tests don't hurt test throughput too much. |
/azp run runtime-coreclr gcstress0x3-gcstress0xc |
Azure Pipelines successfully started running 1 pipeline(s). |
fyi - I'm now looking at using BuildAsStandalone in gcstress builds to completely avoid merged test groups for now. See #85284 though it will probably take a few rounds for me to get the yaml right. |
@markples - Do you think we might be able to reduce some of these costs by emitting calls to the individual test entrypoints through helper methods so that each such helper method would have just the one try-catch block? |
@trylek This PR currently does that (it was easy by setting the grouping value to 1). I think that it helped but still hit a problem (though it's been long enough that I don't remember the details), which is why I had shelved this and was trying the BuildAsStandalone thing. However, that has hit an issue that (at least) one of the HardwareIntrinsics projects is big enough to time out (test merging can stripe -within- a project since it is dealing with individual tests). |
fyi - this is close but I'm waiting for test results |
@trylek I propose that we move forward with these fixes for now. They might be overkill, and we might change things again in the future, but this gets jit64 gcstress under control and lets us move forward. A few JIT\Regression legs are still slow but working. (also resetting @kunalspathak 's review since much has changed since then) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me, thanks Mark!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me, thanks Mark!
/azp run runtime-coreclr outerloop |
Azure Pipelines successfully started running 1 pipeline(s). |
MemorySsa is failing elsewhere. |
running gcstress yet again because my other change restructued the groups |
/azp run runtime-coreclr gcstress0x3-gcstress0xc |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run runtime, runtime-coreclr gcstress0x3-gcstress0xc |
Azure Pipelines successfully started running 2 pipeline(s). |
/azp run runtime-coreclr gcstress0x3-gcstress0xc |
Azure Pipelines successfully started running 1 pipeline(s). |
This reverts commit 0992368.
/azp run runtime-coreclr gcstress0x3-gcstress0xc |
Azure Pipelines successfully started running 1 pipeline(s). |
Previous test run might have passed.. but the devops machine flaked out. JIT/jit64 and JIT/opt appear to be ok. |
/azp run runtime-coreclr gcstress0x3-gcstress0xc |
Azure Pipelines successfully started running 1 pipeline(s). |
Some of the gc stress legs are still quite slow, suggesting more striping would be desirable. Hopefully this current run is sufficient to unblock testing and that striping can be handled separately, but osx arm64 continues to be stubborn with this. |
Build analysis is showing a failure from a previous run of Failure was in https://dev.azure.com/dnceng-public/public/_build/results?buildId=277134&view=results |
This includes several changes that seem to help with the timeouts. It might be overkill but seems like a good direction as this has been broken for a while.
Should fix #85590