Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Program throws ArgumentOutOfRangeException in Release #88091

Closed
EgorBo opened this issue Jun 27, 2023 · 10 comments · Fixed by #88159
Closed

Program throws ArgumentOutOfRangeException in Release #88091

EgorBo opened this issue Jun 27, 2023 · 10 comments · Fixed by #88159
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI bug
Milestone

Comments

@EgorBo
Copy link
Member

EgorBo commented Jun 27, 2023

The following program throws ArgumentOutOfRangeException in Release and finishes normally in Debug:

https://gist.github.com/EgorBo/bdd1c3c2c65a80681e96f840df40cd2a

cc @AndyAyersMS @dotnet/jit-contrib

@EgorBo EgorBo added this to the 8.0.0 milestone Jun 27, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jun 27, 2023
@EgorBo EgorBo added bug area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed untriaged New issue has not been triaged by the area owner area-System.Collections labels Jun 27, 2023
@dotnet dotnet deleted a comment Jun 27, 2023
@AndyAyersMS
Copy link
Member

Running with DOTNET_JitCloneLoops=0 to simplify things a little.

The problem goes away if optJumpThreadPhi is disabled.
The problem goes away if CSE is diabled.

The actual bug is caused by this CSE:

Considering CSE #02 {$212, $28a} [def=52.182163, use=52.182163, cost=  6, call]
CSE Expression : 
N006 (  6,  6) CSE #02 (def)[000024] ---XG------                         *  ADD       int    <l:$215, c:$214>
N004 (  4,  4) CSE #01 (def)[000092] ---XG------                         +--*  IND       int    <l:$210, c:$211>
N003 (  2,  2)              [000133] -------N---                         |  \--*  ADD       byref  <l:$3c1, c:$3c2>
N001 (  1,  1)              [000020] -----------                         |     +--*  LCL_VAR   ref    V02 loc1         u:3 <l:$287, c:$103>
N002 (  1,  1)              [000132] -----------                         |     \--*  CNS_INT   long   16 Fseq[_size] $382
N005 (  1,  1)              [000023] -----------                         \--*  CNS_INT   int    -2 $cc

Moderate CSE Promotion (CSE is live across a call) (156.546490 >= 100.000000)
cseRefCnt=156.546490, aggressiveRefCnt=200.000000, moderateRefCnt=100.000000
defCnt=52.182163, useCnt=52.182163, cost=6, size=6, LiveAcrossCall
def_cost=2, use_cost=1, extra_no_cost=10, extra_yes_cost=100
CSE cost savings check (323.092979 >= 256.546490) passes

Promoting CSE:

lvaGrabTemp returning 10 (V10 rat0) (a long lifetime temp) called for CSE - moderate.
CSE #02 is single-def, so associated CSE temp V10 will be in SSA
New refCnts for V10: refCnt =  2, refCntWtd = 1.04
New refCnts for V10: refCnt =  3, refCntWtd = 1.57

CSE #02 def at [000024] replaced in BB03 with def of V10
optValnumCSE morphed tree:
N010 ( 13, 11)              [000025] DA-XG------                         *  STORE_LCL_VAR int    V04 loc3         d:2 <l:$28d, c:$28c>
N009 ( 13, 11)              [000227] -A-XG------                         \--*  COMMA     int    <l:$215, c:$214>
N007 ( 10,  9) CSE #02 (def)[000225] DA-XG------                            +--*  STORE_LCL_VAR int    V10 cse0         d:1 $VN.Void
N006 (  6,  6)              [000024] ---XG------                            |  \--*  ADD       int    <l:$215, c:$214>
N004 (  4,  4) CSE #01 (def)[000092] ---XG------                            |     +--*  IND       int    <l:$210, c:$211>
N003 (  2,  2)              [000133] -------N---                            |     |  \--*  ADD       byref  <l:$3c1, c:$3c2>
N001 (  1,  1)              [000020] -----------                            |     |     +--*  LCL_VAR   ref    V02 loc1         u:3 <l:$287, c:$103>
N002 (  1,  1)              [000132] -----------                            |     |     \--*  CNS_INT   long   16 Fseq[_size] $382
N005 (  1,  1)              [000023] -----------                            |     \--*  CNS_INT   int    -2 $cc
N008 (  3,  2)              [000226] -----------                            \--*  LCL_VAR   int    V10 cse0         u:1 <l:$215, c:$214>


Working on the replacement of the CSE #02 use at [000054] in BB11
Unmark CSE use #01 at [000116]:   1 ->   0
optValnumCSE morphed tree:
N002 (  3,  3)              [000055] DA--G------                         *  STORE_LCL_VAR int    V05 loc4         d:4 <l:$28d, c:$28c>
N001 (  3,  2)              [000228] -----------                         \--*  LCL_VAR   int    V10 cse0         u:1 <l:$212, c:$213>

The value of V02._size can change between the fetch in BB03 and the one in BB11, via this call

***** BB07
STMT00012 ( 0x033[E-] ... 0x03D )
N003 ( 16,  9) [000043] --CXG------                         *  CALL      void   Program:DoWork(System.Collections.Generic.List`1[NamedSet],int) $VN.Void
N001 (  1,  1) [000041] ----------- arg0 in rcx             +--*  LCL_VAR   ref    V02 loc1         u:3 <l:$287, c:$103>
N002 (  1,  1) [000042] ----------- arg1 in rdx             \--*  LCL_VAR   int    V04 loc3         u:3 $247

CSE availability propagates from BB03 through BB07 and BB10 to BB11.

image

If RBO's optJumpThreadPhi is disabled, then, BB10 has an additional pred BB09 where the CSE is not available.
CSE then treats the instance in BB11 as a def instead of a use, and so no CSE happens.

CSE is matching liberal VNs here. So looking at VN:

BB03

N001 [000020]   LCL_VAR   V02 loc1         u:3 => <l:$287 {$404[$340]}, c:$103 {MemOpaque:L00}>
N002 [000132]   CNS_INT   16 Fseq[_size] => $382 {LngCns:  16}
N003 [000133]   ADD       => <l:$3c1 {ADD($287, $382)}, c:$3c2 {ADD($103, $382)}>
    VNForHandle(_size) is $44, fieldType is int, size = 4
    VNForMapSelect($307, $44):mem returns $406 {$307[$44]}
    VNForMapSelect($406, $287):int returns $20f {$406[$287]}
N004 [000092]   IND       => <l:$210 {norm=$20f {$406[$287]}, exc=$28a {NullPtrExc($287)}}, c:$211 {norm=$146 {MemOpaque:L01}, exc=$28b {NullPtrExc($103)}}>
N005 [000023]   CNS_INT   -2 => $cc {IntCns -2}
N006 [000024]   ADD       => <l:$215 {norm=$212 {ADD($cc, $20f)}, exc=$28a {NullPtrExc($287)}}, c:$214 {norm=$213 {ADD($cc, $146)}, exc=$28b {NullPtrExc($103)}}>

BB11

N001 [000050]   LCL_VAR   V02 loc1         u:3 => <l:$287 {$404[$340]}, c:$103 {MemOpaque:L00}>
N002 [000161]   CNS_INT   16 Fseq[_size] => $382 {LngCns:  16}
N003 [000162]   ADD       => <l:$3c1 {ADD($287, $382)}, c:$3c2 {ADD($103, $382)}>
    VNForHandle(_size) is $44, fieldType is int, size = 4
    VNForMapSelect($307, $44):mem returns $406 {$307[$44]}
    VNForMapSelect($406, $287):int returns $20f {$406[$287]}
N004 [000116]   IND       => <l:$210 {norm=$20f {$406[$287]}, exc=$28a {NullPtrExc($287)}}, c:$218 {norm=$147 {MemOpaque:L00}, exc=$28b {NullPtrExc($103)}}>
N005 [000053]   CNS_INT   -2 => $cc {IntCns -2}
N006 [000054]   ADD       => <l:$215 {norm=$212 {ADD($cc, $20f)}, exc=$28a {NullPtrExc($287)}}, c:$21a {norm=$219 {ADD($cc, $147)}, exc=$28b {NullPtrExc($103)}}>

We see the same field _size fetched from the heap off of the same base $287.

As noted, the path from BB03 to BB11 includes call that takes V02 as an arg.

***** BB07, STMT00012(before)
N003 ( 16,  9) [000043] --CXG------                         *  CALL      void   Program:DoWork(System.Collections.Generic.List`1[NamedSet],int)
N001 (  1,  1) [000041] ----------- arg0 in rcx             +--*  LCL_VAR   ref    V02 loc1         u:3
N002 (  1,  1) [000042] ----------- arg1 in rdx             \--*  LCL_VAR   int    V04 loc3         u:3

N001 [000041]   LCL_VAR   V02 loc1         u:3 => <l:$287 {$404[$340]}, c:$103 {MemOpaque:L00}>
N002 [000042]   LCL_VAR   V04 loc3         u:3 => $247 {PhiDef($4, $3, $209)}
  fgCurMemoryVN[GcHeap] assigned for CALL at [000043] to VN: $1c8.
N003 [000043]   CALL      => $VN.Void

Which should trash any indirs in the GC heap based off of V02. It introduces a new memory VN.

But this memory VN merges back into the memory PHI in BB10 creating memory VN $307 which is used in the CSEs above.
BB10 is pred to both BB03 and BB11. So per VN, they seem to be reading from the same memory.

  Building phi application: $cb = SSA# 9.
  Building phi application: $c6 = SSA# 6.
  Building phi application: $305 = phi($c6, $cb).
  Building phi application: $c5 = SSA# 5.
  Building phi application: $306 = phi($c5, $305).
The SSA definition for GcHeap (#6) at start of BB10 is $307 {PhiMemoryDef($43, $306)}

There is a loop here from BB03->BB10, with entry at BB09. But inside there we see an unrecognized loop that excludes BB09?

=======================

So, options?

  • The bug is in RBO. We should not jump thread through a block with a memory PHI.
    • Inhibiting RBO this way "fixes" the bug but feels hacky. Seems like a similar case could be found that does not require RBO. But I do not know this for sure.
  • The bug is in CSE. CSE should kill availability of CSE #02 along the path from BB03 to BB11.
    • Note CSE never kills availability explicitly today, and it's not obvious how it could/should. Many different tree shapes can have the same VN. Consider a case where we assign a constant to a ref class field and then propagate that constant to unescaped byrefs. Subsequent loads of the field or byref should be CSEs, but only some of these will get killed by heap havoc. Seems like we would have to track availability per tree instead of per VN? Perhaps more simply we can just kill the whole CSE if any of its constituent trees is killed?
  • The bug is in VN. We should not assign the same VN to the trees in BB03 and BB11.
    • One invariant we should hold dear is that if two trees have the same VN they represent the same value at runtime. RBO for example relies on two predicate trees with the same VN to evaluate the same way. But consider that hoisting already has to do additional tests for "memory dependent" VNs to establish loop invariance, so even just a single tree may not produce the same value from one loop iteration to the next. The solution there was to separately track the implicit memory dependence of the VN and show that the same memory is available before the loop as within. But in this case the same memory is avaiable for both BB03 and BB11 indirs. So perhaps the invariant that holds here is more subtle?
    • The only "variable" in the VNs here is the VN for memory, and VN gives all memory SSA defs the same VN, leading to.....
  • The bug is in SSA. We should not have the same SSA def for memory in BB03 and BB11; if they differed, we would not have the same VN and so not try and CSE.
    • There are no memory stores in BB10, BB03 or BB11, so no obvious way to figure out where or why to introduce this new def. So this also feels somewhat artificial/hacky.
  • The bug is LOOPs, caused by the un-recognized loop headed by BB10
    • In particular we have a "middle exit" loop where the side effect that causes trouble happens between the loop exit and loop header. Maybe for such loops we need to induce a new SSA def when we leave the loop, to represent potential ambiguity into which iteration might represent various bits of memory?

One question I've been asking myself here without any clear insight is why this issue is specific to VNs involving memory. Could we have the same situation with just a simple local that's conditionally defined? If so, it seems like for sure we would have hit this case already.

Also not clear how narrow/fragile the repro is; Egor suggests it is very sensitive to this code shape. Perhaps some experimentation will suggest which of the elements above is necessary?

@dotnet dotnet deleted a comment Jun 27, 2023
@AndyAyersMS
Copy link
Member

Deleted the bot comment since it contained the large source code example and was bogging down viewing the bug.

@AndyAyersMS AndyAyersMS self-assigned this Jun 27, 2023
@AndyAyersMS
Copy link
Member

Still not seeing a really clean fix here.

  • It is not easy to back-map from a VN to its "memory dependence". We have a special side table for this if a VN ends up snapping through a MapSelect but we don't record info for loads that bottom out on opaque memory VNs or get blocked by PHIs. And it's not easy to walk the VN tree abstractly in general because of all those crazy special encodings for some VN func args (seems like we ought to fix this someday so we can have say a VN tree walker of sorts).
    • Perhaps this analysis would be easier with trees, but we don't capture the intermediate memory SSA states on the trees, we'd have to dig these out of the associated VNs.
  • Tried just re-ordering CSE and RBO, but ended up with lots of diffs; would take a while to dig through these to figure out what's going on. EG asp.net
[13:35:47] --------------------------------------------------------------------------------
[13:35:47] 2,608 contexts with diffs (657 improvements, 1,608 regressions, 343 same size)
[13:35:47]   -6,300/+19,074 bytes

@AndyAyersMS
Copy link
Member

Here are two graphs showing memory SSA states (MI = memory in, MO = memory out) and VNs. One is before RBO and the other after

After VN After RBO
image image

@EgorBo
Copy link
Member Author

EgorBo commented Jun 28, 2023

Btw, when I was investigating this I limitted RBO to only run for one specific BB only (the one that cuased problems) and only do Jump-threading-Phi on it to minimze unrelated code movement, but maybe it's what you did here as well

@AndyAyersMS
Copy link
Member

Diffs from blocking optJumpThreadPhi for blocks with any kind of memory phi:

Overall (+117,686 bytes)
Collection Base size (bytes) Diff size (bytes)
aspnet.run.windows.x64.checked.mch 38,046,237 +30,260
benchmarks.run.windows.x64.checked.mch 7,293,703 +3,252
benchmarks.run_pgo.windows.x64.checked.mch 27,312,724 +24,278
benchmarks.run_tiered.windows.x64.checked.mch 10,236,085 +2,421
coreclr_tests.run.windows.x64.checked.mch 369,434,284 +26,015
libraries.crossgen2.windows.x64.checked.mch 6,295,320 +217
libraries.pmi.windows.x64.checked.mch 59,196,125 +10,403
libraries_tests.pmi.windows.x64.checked.mch 115,763,833 +19,722
realworld.run.windows.x64.checked.mch 13,758,228 +1,118

https://gist.github.com/AndyAyersMS/5f8bd9ce5f68d55e94346da3da76db98

AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Jun 28, 2023
If phi-based RBO bypasses a block with a memory PHI, it is possible for CSE to find
invalid memory-based CSEs. An example of this is seen in the attached test case.

Ideally perhaps CSE would kill availability of these CSEs at the point where memory
can change, but that seems difficult to arrange. Instead, we mark the bypased block
as one that will not propagate any incoming CSEs, as the failures we know of require
CSEs to flow back through this block.

Fixes dotnet#88091.
@AndyAyersMS
Copy link
Member

I am going to test out an alternate fix, where if RBO bypasses a block with memory PHI, we mark that block so that it kills off any incoming CSEs. This seems fairly surgical but also a bit harder to justify as a solid fix. The assumption here is that this issue can only arise if there is a path from one CSE instance to another that passes through a memory PHI, so if we block propagation of all CSEs then that second instance will not be a use of the first.

Ideally perhaps we'd kill off only the impacted CSEs, but that seems harder to do...

@SingleAccretion
Copy link
Contributor

@AndyAyersMS I am confused by one thing in the memory SSA picture:

image

How does the MI (I am assuming "Memory Incoming") PHI end up having the value of $307 here, given $307 is one of its input arguments (or is that an artifact of the dumping method)?

@AndyAyersMS
Copy link
Member

How does the MI (I am assuming "Memory Incoming") PHI end up having the value of $307 here, given $307 is one of its input arguments (or is that an artifact of the dumping method)?

It is a result of the BB10->BB03->BB10 path where the code can (apparently) loop around without modifying memory.

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Jun 28, 2023

Note that when we actually value number BB10 we won't have that VN available, but the VN memory phis hold only the SSA numbers, not the value numbers, so after the fact we can't be sure which VNs where there already when building the phi, and which weren't.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jun 28, 2023
AndyAyersMS added a commit that referenced this issue Jun 29, 2023
If phi-based RBO bypasses a block with a memory PHI, it is possible for CSE to find
invalid memory-based CSEs. An example of this is seen in the attached test case.

Ideally perhaps CSE would kill availability of these CSEs at the point where memory
can change, but that seems difficult to arrange. Instead, we mark the bypased block
as one that will not propagate any incoming CSEs, as the failures we know of require
CSEs to flow back through this block.

Fixes #88091.
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jun 29, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Jul 30, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants