Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undoing boxing doesn't work with C# 7 pattern matching #10195

Closed
svick opened this issue Apr 19, 2018 · 7 comments
Closed

Undoing boxing doesn't work with C# 7 pattern matching #10195

svick opened this issue Apr 19, 2018 · 7 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization Priority:2 Work that is important, but not critical for the release
Milestone

Comments

@svick
Copy link
Contributor

svick commented Apr 19, 2018

Short version: I was reading @stephentoub's article Performance Improvements in .NET Core 2.1. I noticed that his example for avoiding boxing allocations thanks to dotnet/coreclr#14698 uses is followed by a cast, when in C# 7, the same code could be simplified using pattern matching. So I was wondering if using C# 7 features also results in the same efficient code. It turns out it doesn't and I think this should be improved.

More details:

Consider this code:

using System.Runtime.CompilerServices;

class Program
{
    static void Main()
    {
        Cast(new Dog());
        Pattern(new Dog());
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static void Cast<T>(T thing)
    {
        if (thing is IAnimal)
            ((IAnimal)thing).MakeSound();
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static void Pattern<T>(T thing)
    {
        if (thing is IAnimal animal)
            animal.MakeSound();
    }    
}

struct Dog : IAnimal
{
    public void Bark() { }
    void IAnimal.MakeSound() => Bark();
}

interface IAnimal
{
    void MakeSound();
}

The IL for the relevant methods is:

.method private hidebysig static void  Cast<T>(!!T thing) cil managed noinlining
{
  // Code size       30 (0x1e)
  .maxstack  8
  IL_0000:  ldarg.0
  IL_0001:  box        !!T
  IL_0006:  isinst     IAnimal
  IL_000b:  brfalse.s  IL_001d
  IL_000d:  ldarg.0
  IL_000e:  box        !!T
  IL_0013:  castclass  IAnimal
  IL_0018:  callvirt   instance void IAnimal::MakeSound()
  IL_001d:  ret
}

.method private hidebysig static void  Pattern<T>(!!T thing) cil managed noinlining
{
  // Code size       22 (0x16)
  .maxstack  2
  .locals init (class IAnimal V_0)
  IL_0000:  ldarg.0
  IL_0001:  box        !!T
  IL_0006:  isinst     IAnimal
  IL_000b:  dup
  IL_000c:  stloc.0
  IL_000d:  brfalse.s  IL_0015
  IL_000f:  ldloc.0
  IL_0010:  callvirt   instance void IAnimal::MakeSound()
  IL_0015:  ret
}

Notice how in Pattern, the boxed object is saved to a local variable (typed as the interface).

The disassembly from .Net Core 2.1.0-preview2-26406-04 win10-x64 is:

; Assembly listing for method Program:Cast(struct)
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00    ] (  2,  2   )  struct ( 8) [rsp+0x08]   do-not-enreg[XS] addr-exposed
;* V01 tmp0         [V01    ] (  0,  0   )     ref  ->  zero-ref    class-hnd exact
;* V02 tmp1         [V02    ] (  0,  0   )  struct ( 8) zero-ref    do-not-enreg[SF] class-hnd exact
;# V03 OutArgs      [V03    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]  
;
; Lcl frame size = 0

G_M19994_IG01:
       48894C2408           mov      qword ptr [rsp+08H], rcx

G_M19994_IG02:
       C3                   ret      

; Total bytes of code 6, prolog size 0 for method Program:Cast(struct)
; ============================================================
; Assembly listing for method Program:Pattern(struct)
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00    ] (  4,  4   )  struct ( 8) [rsp+0x30]   do-not-enreg[XSF] addr-exposed
;  V01 loc0         [V01,T02] (  3,  2   )     ref  ->  rax         class-hnd exact
;  V02 tmp0         [V02,T00] (  4,  8   )     ref  ->  rax         class-hnd exact
;  V03 tmp1         [V03,T01] (  2,  4   )     ref  ->  rax         class-hnd exact
;  V04 OutArgs      [V04    ] (  1,  1   )  lclBlk (32) [rsp+0x00]  
;
; Lcl frame size = 40

G_M22101_IG01:
       4883EC28             sub      rsp, 40
       48894C2430           mov      qword ptr [rsp+30H], rcx

G_M22101_IG02:
       48B9005F64B2F87F0000 mov      rcx, 0x7FF8B2645F00
       E8A86B0F5F           call     CORINFO_HELP_NEWSFAST
       480FBE4C2430         movsx    rcx, byte  ptr [rsp+30H]
       884808               mov      byte  ptr [rax+8], cl
       488BC8               mov      rcx, rax
       E897FBFFFF           call     Dog:IAnimal.MakeSound():this
       90                   nop      

G_M22101_IG03:
       4883C428             add      rsp, 40
       C3                   ret      

; Total bytes of code 47, prolog size 4 for method Program:Pattern(struct)
; ============================================================

Notice how for Cast, almost all the code, including the boxing allocation, is optimized away (the remaining mov seems to be unnecessary, but that's not really relevant here). But for Pattern, all the code is still there, including an allocation and a non-inlined call to Dog.IAnimal.MakeSound.

The two versions of the code do the same thing, so I think they should have comparable performance. Especially since the pattern matching version is more readable and I suspect it's also going to be more common in new code than the other version.

How hard would it be to make this optimization work even in the pattern matching version?

If it would be too hard to perform this optimization in the JIT, is there a reasonable way for the C# compiler to emit IL that would be optmized?

cc (?): @AndyAyersMS, @benaadams, @justinvp

category:cq
theme:importer
skill-level:expert
cost:medium

@AndyAyersMS
Copy link
Member

The dup gets in the way currently. See #9118. I would like to fix this in the jit.

@AndyAyersMS
Copy link
Member

Was curious if the prototype stack allocation code (see #4584) could unblock this, but not quite.

What the dup does here is block the part of devirtualization that asks for an unboxing stub. That is, in Pattern we devirtualize the interface call, but then don't realize what we have is a boxed entry point. And we don't think to ask, as the "this" is a temp copy of the boxed reference and not the Box itself. So we leave the boxed entry point there. Then when the inliner goes to inline it fails to get any information. So we end up not inlining. And because of this the box looks like it escapes and can't be stack allocated.

One might imagine that the devirtualization code could always ask for the unboxed entry if there is one (or the initial devirtualization check with the VM could return a flag saying this is the boxed entry) and then we could update the devirtualized call to invoke the unboxed entry on the internals of the boxed object. We would know at this point that the call could not make the boxed object escape (actually that is more generally true: calls to value class methods can't make their "this" pointers escape). So even if the method did not get inlined the stack alloc code might kick in and remove the newobj. But we'd still end up making a copy and it's possible that copy or the payload pointer math might confuse us and block stack allocation.

Still think we are better off pursuing the "Multi-use Box" ideas where instead of having a Box temp we forward the Box itself around, and refcount it so that if we manage to knock out enough uses we can remove the newobj that way.

In this case that might mean using the single-def local non-null info I've also been prototyping to remove the early null check after the isinst, so that the dual-use box is immeidately reduced to single-use and normal devirt + unboxing + inlining would apply.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@davidfowl
Copy link
Member

@AndyAyersMS this seems more important to optimize as pattern matching becomes more prevalent.

@AndyAyersMS
Copy link
Member

@davidfowl agreed.

Unfortunately it's not easy. We need to do this early in the jit, but the "multi-use" box optimization requires a moderate amount of analysis (in particular: find all the consumers of the box, and to verify none of them can modify the boxed value), and our powers of deduction early in the jit are limited.

@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Jan 24, 2023
@markples markples modified the milestones: Future, 8.0.0 Apr 14, 2023
@markples markples self-assigned this Apr 14, 2023
@markples markples added the Priority:2 Work that is important, but not critical for the release label Jun 6, 2023
@markples markples modified the milestones: 8.0.0, 9.0.0 Jul 18, 2023
@AndyAyersMS
Copy link
Member

Suspect we won't get to this in .NET 9, so moving to future.

@AndyAyersMS AndyAyersMS modified the milestones: 9.0.0, Future Apr 14, 2024
@am11
Copy link
Member

am11 commented Apr 14, 2024

Equation changes when there are multiple usages:

    [MethodImpl(MethodImplOptions.NoInlining)]
    static void Cast<T>(T thing)
    {
        if (thing is IAnimal)
        {
            var animal = (IAnimal)thing;
            animal.MakeSound();
            animal.MakeSound();
        }
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static void Pattern<T>(T thing)
    {
        if (thing is IAnimal animal)
        {
            animal.MakeSound();
            animal.MakeSound();
        }    
    }

now Pattern() has better perf score.

with loop, codegen becomes identical:

    [MethodImpl(MethodImplOptions.NoInlining)]
    static void Cast<T>(T thing)
    {
        if (thing is IAnimal)
        {
            var animal = ((IAnimal)thing);
            for (int i = 0; i < 10_000; i++)
                animal.MakeSound();
        }
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static void Pattern<T>(T thing)
    {
        if (thing is IAnimal animal)
        {
            for (int i = 0; i < 10_000; i++)
                animal.MakeSound();
        }    
    }

so it's essentially the single-use scenario for which Pattern has suboptimal codegen. Perhaps roslyn or JIT importer could explicitly detect this pattern and adjust if it is common enough?

AndyAyersMS added a commit that referenced this issue Jul 1, 2024
Enable object stack allocation for ref classes and extend the support to include boxed value classes. Use a specialized unbox helper for stack allocated boxes, both to avoid apparent escape of the box by the helper, and to ensure all box field accesses are visible to the JIT. Update the local address visitor to rewrite trees involving address of stack allocated boxes in some cases to avoid address exposure. Disable old promotion for stack allocated boxes (since we have no field handles) and allow physical promotion to enregister the box method table and/or payload as appropriate. In OSR methods handle the fact that the stack allocation may actually have been a heap allocation by the Tier0 method.

The analysis TP cost is around 0.4-0.7% (notes below). Boxes are much less likely to escape than ref classes (roughly ~90% of boxes escape, ~99.8% of ref classes escape). Codegen impact is diminished somewhat because many of the boxes are dead and were already getting optimized away.
 
Fixes #4584, #9118, #10195, #11192, #53585, #58554, #85570

---------

Co-authored-by: Jakob Botsch Nielsen <[email protected]>
Co-authored-by: Jan Kotas <[email protected]>
@AndyAyersMS
Copy link
Member

Fixed by #103361

@github-actions github-actions bot locked and limited conversation to collaborators Aug 1, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization Priority:2 Work that is important, but not critical for the release
Projects
None yet
Development

No branches or pull requests

8 participants