Improve `new Vector2/3/4(c1, c2, c3, c4)` codegen #14393

mikedn · 2017-10-09T18:58:10Z

Fixes #17256

CarolEidt · 2018-04-13T16:07:11Z

Thanks for the update. I'll set this one as "Future", and you can decide whether to close or leave it open.

BruceForstall · 2018-04-16T23:25:44Z

Maybe the JIT could generate pools in a crossgen or AOT mode. Or if some metric like number of use instances warrants it.

At the very least I'd like the JIT to dump these data sections so that they show up in diffs (I have work done for that but it is not pushed yet).

It we don't do that already, then by all means we should dump them!

mikedn · 2018-04-17T17:33:44Z

Well, I don't know if we'll end up doing NEG/ABS but the data section improvements are good for #17256 and that one really should be fixed, it generates way too much code. So I'll keep working on this.

mikedn · 2018-05-09T15:27:08Z

@CarolEidt What do you think about lowering SIMDIntrinsicInitN(c1, c2, c3, c4) to IND(CLS_VAR_ADDR(dataSectionHandle)) to solve #17256? See the last commit above.

It looks like the JIT never attempted to create data sections before codegen but there doesn't appear to be anything preventing us from doing that. The alternative is to mark all 4 arguments as contained but that requires more code changes (ContainCheckSIMD, BuildSIMD and genSIMDIntrinsicInitN). It's also not clear if we can then make the constant vector contained like we could probably contain the indir.

return new Vector4(1, 2, 3, 4) + new Vector4(4, 3, 2, 1); generates

G_M33474_IG01:
       C5F877               vzeroupper
G_M33474_IG02:
       C4E179100524000000   vmovupd  xmm0, xmmword ptr [reloc @RWD00]
       C4E179100D2B000000   vmovupd  xmm1, xmmword ptr [reloc @RWD16]
       C4E17858C1           vaddps   xmm0, xmm1 ; could be contained
       C4E1791101           vmovupd  xmmword ptr [rcx], xmm0
       488BC1               mov      rax, rcx
G_M33474_IG03:
       C3                   ret

RWD00  db       000h, 000h, 080h, 03Fh, 000h, 000h, 000h, 040h, 000h, 000h, 040h, 040h, 000h, 000h, 080h, 040h
RWD16  db       000h, 000h, 080h, 040h, 000h, 000h, 040h, 040h, 000h, 000h, 000h, 040h, 000h, 000h, 080h, 03Fh

CarolEidt · 2018-09-14T00:57:39Z

@CarolEidt What do you think about lowering SIMDIntrinsicInitN(c1, c2, c3, c4) to IND(CLS_VAR_ADDR(dataSectionHandle)) to solve #17256? See the last commit above.

I like it - I think it's a good approach; I wouldn't have thought it would be this straightforward to do this - thanks (and sorry for the long delay in response).

CarolEidt

LGTM with a couple of questions and a request for a function header.

CarolEidt · 2018-09-14T01:03:39Z

src/jit/emit.cpp

-        emitDataGenBeg(4, false, false);
-        emitDataGenData(0, &zero, 4);
-        emitDataGenEnd();
+        assert(size <= 16);


Wouldn't this assert hold for any case (not just when align is true)?

Constants can have any size (e.g. 32 bytes for AVX stuff), it's just that alignment is currently supported only for 4, 8 and 16 bytes constants. Hmm, maybe I should check those specific sizes instead

I added a comment to clarify this assert.

CarolEidt · 2018-09-14T01:05:54Z

src/jit/emit.cpp

@@ -5533,6 +5562,111 @@ void emitter::emitOutputDataSec(dataSecDsc* sec, BYTE* dst)
    }
 }

+#ifdef DEBUG
+void emitter::emitDispDataSec(dataSecDsc* section)


Nice! Might I ask for a basic function header?

Sure, will add one.

CarolEidt · 2018-09-14T01:09:20Z

src/jit/emit.cpp

+                    }
+                    else
+                    {
+                        printf("dd\t%08Xh", ig->igOffs - igFirst->igOffs);


Even in non-diffable format would it be useful to print the labels as well?

Hmm, I don't remember why I did it like this. I probably thought that being able to see the actual numbers could be useful. Maybe not, I'll check to see how it looks either way. Or perhaps better, in non-diffable mode have

dd offset_number ; label in comment

Hah, I forgot that it does something like this already. Here's the diffable output:

RWD00 dd G_M55886_IG03 - G_M55886_IG02 dd G_M55886_IG05 - G_M55886_IG02 dd G_M55886_IG07 - G_M55886_IG02 dd G_M55886_IG09 - G_M55886_IG02 dd G_M55886_IG10 - G_M55886_IG02

and the non-diffable output:

RWD00 dd 00000023h ; case G_M55886_IG03 dd 00000029h ; case G_M55886_IG05 dd 0000002Fh ; case G_M55886_IG07 dd 00000035h ; case G_M55886_IG09 dd 0000003Ch ; case G_M55886_IG10

mikedn · 2018-09-14T05:16:08Z

I wouldn't have thought it would be this straightforward to do this - thanks (and sorry for the long delay in response).

Well, it would be better if the constants were sorted by size to minimize alignment holes. There's also no support for 32 byte aligned constants, the VM lacks an option for that.

No problem about the delay, there's plenty of things to do in the JIT world and this change doesn't block anything else :)

mikedn · 2018-09-16T09:53:30Z

it would be better if the constants were sorted by size to minimize alignment holes

Actually they are. As is now all 16 byte constants are created in lowering and all other constants (4/8 bytes in size) are created during codegen. So all 16 byte constants come first and there are no alignment holes between them.

This will only be a problem if we try to extend it to 32 byte constants for AVX intrinsics. But for Vector4 this works just fine.

mikedn · 2018-09-16T10:02:06Z

This works well even for Vector2 where one may expect that building the vector from components isn't that bad. It kind of is:

; before
...
       vmovss   xmm0, dword ptr [reloc @RWD00]
       vmovss   xmm1, dword ptr [reloc @RWD04]
       vxorps   xmm2, xmm2
       vmovss   xmm2, xmm2, xmm1
       vpslldq  xmm2, 4
       vmovss   xmm2, xmm2, xmm0
       vmovaps  xmm0, xmm2
       vmovsd   qword ptr [rsp+10H], xmm0
       vmovss   xmm0, dword ptr [reloc @RWD08]
       vmovss   xmm1, dword ptr [reloc @RWD12]
       vxorps   xmm2, xmm2
       vmovss   xmm2, xmm2, xmm1
       vpslldq  xmm2, 4
       vmovss   xmm2, xmm2, xmm0
       vmovaps  xmm0, xmm2
       vmovsd   qword ptr [rsp+08H], xmm0
       vmovsd   xmm0, qword ptr [rsp+10H]
       vmovsd   xmm1, qword ptr [rsp+08H]
       vaddps   xmm0, xmm1
       vmovd    rax, xmm0
...

RWD00  dd       3F800000h
RWD04  dd       40000000h
RWD08  dd       40800000h
RWD12  dd       40400000h

; after
...
       vmovsd   xmm0, qword ptr [reloc @RWD00]
       vmovsd   qword ptr [rsp+10H], xmm0
       vmovsd   xmm0, qword ptr [reloc @RWD16]
       vmovsd   qword ptr [rsp+08H], xmm0
       vmovsd   xmm0, qword ptr [rsp+10H]
       vmovsd   xmm1, qword ptr [rsp+08H]
       vaddps   xmm0, xmm1
       vmovd    rax, xmm0
...
RWD00  db       000h, 000h, 080h, 03Fh, 000h, 000h, 000h, 040h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h
RWD16  db       000h, 000h, 080h, 040h, 000h, 000h, 040h, 040h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h

Uses 16 bytes more of data but saves 70 bytes of code (138 bytes vs. 68 bytes).

CarolEidt · 2018-09-17T21:19:46Z

@dotnet-bot test Ubuntu x64 Checked CoreFX Test
@dotnet-bot test Ubuntu x64 Checked Innerloop Build and Test
@dotnet-bot test Ubuntu x64 Formattin

4creators · 2018-09-19T19:38:20Z

@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test
@dotnet-bot test Ubuntu x64 Checked CoreFX Test
@dotnet-bot test Ubuntu x64 Formatting

4creators · 2018-09-19T19:45:10Z

@dotnet-bot test Ubuntu x64 Checked CoreFX Test

4creators · 2018-09-19T20:55:33Z

@dotnet-bot test Ubuntu x64 Checked CoreFX Test

4creators · 2018-09-25T07:41:16Z

@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test
@dotnet-bot test Ubuntu x64 Checked CoreFX Test

4creators · 2018-09-28T14:29:24Z

@dotnet-bot test Ubuntu x64 Checked CoreFX Test

mikedn · 2018-10-24T20:44:31Z

@dotnet-bot test Ubuntu x64 Checked CoreFX Tests

mikedn · 2018-10-27T06:46:07Z

The WIP bot got stuck, it didn't update the WIP check even after changing the title.

tannergooding · 2019-01-10T19:14:32Z

Has there been any consideration on optimizing the case where most (but not all) of the inputs are constant?

For example: Create(c, c, v, c) (where c is a constant and v is a variable input)? I would guess that it would be better to create a 16-byte vector constant where the variable argument is zero and then have codegen do a single insertion (similar to what the C# compiler does for the arrays that are mostly constant).

mikedn · 2019-01-10T19:50:58Z

For example: Create(c, c, v, c) (where c is a constant and v is a variable input)? I would guess that it would be better to create a 16-byte vector constant where the variable argument is zero and then have codegen do a single insertion (similar to what the C# compiler does for the arrays that are mostly constant).

Yes, that's doable. It's just that it's a lot of work - looking at various interesting patterns, figuring out the right trade offs, deciding what instructions to generate depending on the number of variables to insert, their position, the available ISA etc. I've looked a bit at what native compilers do here and it seems pretty daunting to do something comprehensive.

Just an example: For

_mm256_set_epi32(x, 1, x, 1, x, 1, x, 1);

VC++ generates

vmovdqu xmm0, XMMWORD PTR __xmm@00000000000000010000000000000001
vpinsrd xmm0, xmm0, ecx, 1
vpinsrd xmm0, xmm0, ecx, 3
vinsertf128 ymm0, ymm0, xmm0, 1

4creators · 2019-01-13T23:48:45Z

@BruceForstall @mikedn Any chance of having this PR reviewed and merged before 3.0 ships?

fiigii · 2019-01-22T20:57:54Z

@CarolEidt @mikedn Is this PR ready to merge? I want to make the similar optimizations for Vector128/256 as well.

mikedn · 2019-01-23T06:49:42Z

This one is done but I'm considering using a somewhat different approach in the future.

We should probably have a GT_CNS_SIMD node and handle that the same way floating point constants are handled. That would help in other cases (e.g. initializing a block with a non-zero bit pattern, avoid the need for special codegen for NI_SSE41_TestAllOnes) and perhaps would allow reusing constants already loaded into registers via LSRA's constant intervals.

But that's probably quite a bit of work and thus seems unlikely to catch .NET 3.0. So for now this is probably the best we can do.

briansull

Remove uses of LEGACY_BACKEND

briansull · 2019-01-23T18:57:27Z

src/jit/emit.h

@@ -216,6 +216,15 @@ class emitLocation
    unsigned  codePos; // the code position within the IG (see emitCurOffset())
 };

+#ifndef LEGACY_BACKEND


I believe that we have deprecated LEGACY_BACKEND

Yes, thanks!

mikedn · 2019-01-24T16:40:04Z

OSX failed due to the usual Jenkins exception:

00:32:07.313 Test passed.
00:32:07.338 ...FATAL: command execution failed
00:32:21.662 java.nio.channels.ClosedChannelException
00:32:21.662 	at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
00:32:21.662 	at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:142)
00:32:21.662 	at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:721)

@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test

fiigii · 2019-01-24T19:54:55Z

@CarolEidt Can we merge this PR?

tannergooding · 2019-01-24T20:29:52Z

@fiigii, are you going to be putting up a similar PR for Vector64<T>, Vector128<T>, and Vector256<T>?

fiigii · 2019-01-24T20:32:46Z

Yes, but I am busy with other things now. Will look into it next week.

Improve `new Vector2/3/4(c1, c2, c3, c4)` codegen Commit migrated from dotnet/coreclr@aab30e3

dnfclas added the cla-already-signed label Oct 9, 2017

This comment has been minimized.

Sign in to view

jkotas added the area-CodeGen label Dec 28, 2017

mikedn force-pushed the float-neg-abs branch from 07c2079 to 01f2ea0 Compare January 18, 2018 22:31

CarolEidt requested a review from BruceForstall April 12, 2018 22:56

CarolEidt added this to the 2.2.0 milestone Apr 12, 2018

This comment has been minimized.

Sign in to view

CarolEidt modified the milestones: 2.2.0, Future Apr 13, 2018

mikedn force-pushed the float-neg-abs branch from 01f2ea0 to 273bfe3 Compare May 8, 2018 19:55

mikedn mentioned this pull request Jun 22, 2018

Optimize SSE/SSE2 Set[All]Vector128 #18605

Closed

CarolEidt approved these changes Sep 14, 2018

View reviewed changes

mikedn added 3 commits September 16, 2018 13:27

Include the data section in disassembly

dfafa6b

Add support for 16 byte aligned constants

f1ca994

Optimize new Vector4(c1,c2,c3,c4)

3d58487

mikedn force-pushed the float-neg-abs branch from 3e7be95 to 3d58487 Compare September 16, 2018 10:33

mikedn changed the title ~~[WIP] Improve FP NEG/ABS codegen~~ [WIP] Improve new Vector2/3/4(c1, c2, c3, c4) codegen Sep 16, 2018

mikedn changed the title ~~[WIP] Improve new Vector2/3/4(c1, c2, c3, c4) codegen~~ Improve new Vector2/3/4(c1, c2, c3, c4) codegen Oct 25, 2018

mikedn changed the title ~~Improve new Vector2/3/4(c1, c2, c3, c4) codegen~~ [WIP] Improve new Vector2/3/4(c1, c2, c3, c4) codegen Oct 25, 2018

mikedn changed the title ~~[WIP] Improve new Vector2/3/4(c1, c2, c3, c4) codegen~~ Improve new Vector2/3/4(c1, c2, c3, c4) codegen Oct 25, 2018

RussKeldorph removed the cla-already-signed label Jan 11, 2019

briansull approved these changes Jan 23, 2019

View reviewed changes

Remove obsolete LEGACY_BACKEND ifdef

bae7269

CarolEidt merged commit aab30e3 into dotnet:master Jan 24, 2019

mikedn deleted the float-neg-abs branch September 28, 2019 19:23

tannergooding mentioned this pull request Apr 5, 2020

Support emitting a constant for the Vector64/128/256.Create hardware intrinsic methods dotnet/runtime#10033

Closed

picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022

Merge pull request dotnet/coreclr#14393 from mikedn/float-neg-abs

0cc82aa

Improve `new Vector2/3/4(c1, c2, c3, c4)` codegen Commit migrated from dotnet/coreclr@aab30e3

Improve new Vector2/3/4(c1, c2, c3, c4) codegen #14393

Improve new Vector2/3/4(c1, c2, c3, c4) codegen #14393

Conversation

mikedn commented Oct 9, 2017 • edited Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

CarolEidt commented Apr 13, 2018

BruceForstall commented Apr 16, 2018

mikedn commented Apr 17, 2018

mikedn commented May 9, 2018

CarolEidt commented Sep 14, 2018

CarolEidt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikedn commented Sep 14, 2018

mikedn commented Sep 16, 2018

mikedn commented Sep 16, 2018

CarolEidt commented Sep 17, 2018

4creators commented Sep 19, 2018

4creators commented Sep 19, 2018

4creators commented Sep 19, 2018

4creators commented Sep 25, 2018

4creators commented Sep 28, 2018

mikedn commented Oct 24, 2018

mikedn commented Oct 27, 2018 • edited Loading

tannergooding commented Jan 10, 2019

mikedn commented Jan 10, 2019

4creators commented Jan 13, 2019 • edited Loading

fiigii commented Jan 22, 2019

mikedn commented Jan 23, 2019

briansull left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikedn commented Jan 24, 2019

fiigii commented Jan 24, 2019

tannergooding commented Jan 24, 2019

fiigii commented Jan 24, 2019

Improve `new Vector2/3/4(c1, c2, c3, c4)` codegen #14393

Improve `new Vector2/3/4(c1, c2, c3, c4)` codegen #14393

mikedn commented Oct 9, 2017 •

edited

Loading

mikedn commented Oct 27, 2018 •

edited

Loading

4creators commented Jan 13, 2019 •

edited

Loading