Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Improve new Vector2/3/4(c1, c2, c3, c4) codegen #14393

Merged
merged 4 commits into from
Jan 24, 2019

Conversation

mikedn
Copy link

@mikedn mikedn commented Oct 9, 2017

Fixes #17256

@mikedn

This comment has been minimized.

@tannergooding

This comment has been minimized.

@CarolEidt CarolEidt requested a review from BruceForstall April 12, 2018 22:56
@CarolEidt CarolEidt added this to the 2.2.0 milestone Apr 12, 2018
@mikedn

This comment has been minimized.

@CarolEidt
Copy link

Thanks for the update. I'll set this one as "Future", and you can decide whether to close or leave it open.

@CarolEidt CarolEidt modified the milestones: 2.2.0, Future Apr 13, 2018
@BruceForstall
Copy link
Member

Maybe the JIT could generate pools in a crossgen or AOT mode. Or if some metric like number of use instances warrants it.

At the very least I'd like the JIT to dump these data sections so that they show up in diffs (I have work done for that but it is not pushed yet).

It we don't do that already, then by all means we should dump them!

@mikedn
Copy link
Author

mikedn commented Apr 17, 2018

Well, I don't know if we'll end up doing NEG/ABS but the data section improvements are good for #17256 and that one really should be fixed, it generates way too much code. So I'll keep working on this.

@mikedn
Copy link
Author

mikedn commented May 9, 2018

@CarolEidt What do you think about lowering SIMDIntrinsicInitN(c1, c2, c3, c4) to IND(CLS_VAR_ADDR(dataSectionHandle)) to solve #17256? See the last commit above.

It looks like the JIT never attempted to create data sections before codegen but there doesn't appear to be anything preventing us from doing that. The alternative is to mark all 4 arguments as contained but that requires more code changes (ContainCheckSIMD, BuildSIMD and genSIMDIntrinsicInitN). It's also not clear if we can then make the constant vector contained like we could probably contain the indir.

return new Vector4(1, 2, 3, 4) + new Vector4(4, 3, 2, 1); generates

G_M33474_IG01:
       C5F877               vzeroupper
G_M33474_IG02:
       C4E179100524000000   vmovupd  xmm0, xmmword ptr [reloc @RWD00]
       C4E179100D2B000000   vmovupd  xmm1, xmmword ptr [reloc @RWD16]
       C4E17858C1           vaddps   xmm0, xmm1 ; could be contained
       C4E1791101           vmovupd  xmmword ptr [rcx], xmm0
       488BC1               mov      rax, rcx
G_M33474_IG03:
       C3                   ret

RWD00  db       000h, 000h, 080h, 03Fh, 000h, 000h, 000h, 040h, 000h, 000h, 040h, 040h, 000h, 000h, 080h, 040h
RWD16  db       000h, 000h, 080h, 040h, 000h, 000h, 040h, 040h, 000h, 000h, 000h, 040h, 000h, 000h, 080h, 03Fh

@CarolEidt
Copy link

@CarolEidt What do you think about lowering SIMDIntrinsicInitN(c1, c2, c3, c4) to IND(CLS_VAR_ADDR(dataSectionHandle)) to solve #17256? See the last commit above.

I like it - I think it's a good approach; I wouldn't have thought it would be this straightforward to do this - thanks (and sorry for the long delay in response).

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a couple of questions and a request for a function header.

emitDataGenBeg(4, false, false);
emitDataGenData(0, &zero, 4);
emitDataGenEnd();
assert(size <= 16);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this assert hold for any case (not just when align is true)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constants can have any size (e.g. 32 bytes for AVX stuff), it's just that alignment is currently supported only for 4, 8 and 16 bytes constants. Hmm, maybe I should check those specific sizes instead

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment to clarify this assert.

@@ -5533,6 +5562,111 @@ void emitter::emitOutputDataSec(dataSecDsc* sec, BYTE* dst)
}
}

#ifdef DEBUG
void emitter::emitDispDataSec(dataSecDsc* section)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Might I ask for a basic function header?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will add one.

}
else
{
printf("dd\t%08Xh", ig->igOffs - igFirst->igOffs);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even in non-diffable format would it be useful to print the labels as well?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't remember why I did it like this. I probably thought that being able to see the actual numbers could be useful. Maybe not, I'll check to see how it looks either way. Or perhaps better, in non-diffable mode have

dd offset_number ; label in comment

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah, I forgot that it does something like this already. Here's the diffable output:

RWD00  dd       G_M55886_IG03 - G_M55886_IG02
       dd       G_M55886_IG05 - G_M55886_IG02
       dd       G_M55886_IG07 - G_M55886_IG02
       dd       G_M55886_IG09 - G_M55886_IG02
       dd       G_M55886_IG10 - G_M55886_IG02

and the non-diffable output:

RWD00  dd       00000023h ; case G_M55886_IG03
       dd       00000029h ; case G_M55886_IG05
       dd       0000002Fh ; case G_M55886_IG07
       dd       00000035h ; case G_M55886_IG09
       dd       0000003Ch ; case G_M55886_IG10

@mikedn
Copy link
Author

mikedn commented Sep 14, 2018

I wouldn't have thought it would be this straightforward to do this - thanks (and sorry for the long delay in response).

Well, it would be better if the constants were sorted by size to minimize alignment holes. There's also no support for 32 byte aligned constants, the VM lacks an option for that.

No problem about the delay, there's plenty of things to do in the JIT world and this change doesn't block anything else :)

@mikedn
Copy link
Author

mikedn commented Sep 16, 2018

it would be better if the constants were sorted by size to minimize alignment holes

Actually they are. As is now all 16 byte constants are created in lowering and all other constants (4/8 bytes in size) are created during codegen. So all 16 byte constants come first and there are no alignment holes between them.

This will only be a problem if we try to extend it to 32 byte constants for AVX intrinsics. But for Vector4 this works just fine.

@mikedn
Copy link
Author

mikedn commented Sep 16, 2018

This works well even for Vector2 where one may expect that building the vector from components isn't that bad. It kind of is:

; before
...
       vmovss   xmm0, dword ptr [reloc @RWD00]
       vmovss   xmm1, dword ptr [reloc @RWD04]
       vxorps   xmm2, xmm2
       vmovss   xmm2, xmm2, xmm1
       vpslldq  xmm2, 4
       vmovss   xmm2, xmm2, xmm0
       vmovaps  xmm0, xmm2
       vmovsd   qword ptr [rsp+10H], xmm0
       vmovss   xmm0, dword ptr [reloc @RWD08]
       vmovss   xmm1, dword ptr [reloc @RWD12]
       vxorps   xmm2, xmm2
       vmovss   xmm2, xmm2, xmm1
       vpslldq  xmm2, 4
       vmovss   xmm2, xmm2, xmm0
       vmovaps  xmm0, xmm2
       vmovsd   qword ptr [rsp+08H], xmm0
       vmovsd   xmm0, qword ptr [rsp+10H]
       vmovsd   xmm1, qword ptr [rsp+08H]
       vaddps   xmm0, xmm1
       vmovd    rax, xmm0
...

RWD00  dd       3F800000h
RWD04  dd       40000000h
RWD08  dd       40800000h
RWD12  dd       40400000h
; after
...
       vmovsd   xmm0, qword ptr [reloc @RWD00]
       vmovsd   qword ptr [rsp+10H], xmm0
       vmovsd   xmm0, qword ptr [reloc @RWD16]
       vmovsd   qword ptr [rsp+08H], xmm0
       vmovsd   xmm0, qword ptr [rsp+10H]
       vmovsd   xmm1, qword ptr [rsp+08H]
       vaddps   xmm0, xmm1
       vmovd    rax, xmm0
...
RWD00  db       000h, 000h, 080h, 03Fh, 000h, 000h, 000h, 040h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h
RWD16  db       000h, 000h, 080h, 040h, 000h, 000h, 040h, 040h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h

Uses 16 bytes more of data but saves 70 bytes of code (138 bytes vs. 68 bytes).

@mikedn mikedn changed the title [WIP] Improve FP NEG/ABS codegen [WIP] Improve new Vector2/3/4(c1, c2, c3, c4) codegen Sep 16, 2018
@CarolEidt
Copy link

@dotnet-bot test Ubuntu x64 Checked CoreFX Test
@dotnet-bot test Ubuntu x64 Checked Innerloop Build and Test
@dotnet-bot test Ubuntu x64 Formattin

@4creators
Copy link

@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test
@dotnet-bot test Ubuntu x64 Checked CoreFX Test
@dotnet-bot test Ubuntu x64 Formatting

@4creators
Copy link

@dotnet-bot test Ubuntu x64 Checked CoreFX Test

1 similar comment
@4creators
Copy link

@dotnet-bot test Ubuntu x64 Checked CoreFX Test

@4creators
Copy link

@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test
@dotnet-bot test Ubuntu x64 Checked CoreFX Test

@4creators
Copy link

@dotnet-bot test Ubuntu x64 Checked CoreFX Test

@mikedn
Copy link
Author

mikedn commented Oct 24, 2018

@dotnet-bot test Ubuntu x64 Checked CoreFX Tests

@mikedn mikedn changed the title [WIP] Improve new Vector2/3/4(c1, c2, c3, c4) codegen Improve new Vector2/3/4(c1, c2, c3, c4) codegen Oct 25, 2018
@mikedn mikedn changed the title Improve new Vector2/3/4(c1, c2, c3, c4) codegen [WIP] Improve new Vector2/3/4(c1, c2, c3, c4) codegen Oct 25, 2018
@mikedn mikedn changed the title [WIP] Improve new Vector2/3/4(c1, c2, c3, c4) codegen Improve new Vector2/3/4(c1, c2, c3, c4) codegen Oct 25, 2018
@mikedn
Copy link
Author

mikedn commented Oct 27, 2018

The WIP bot got stuck, it didn't update the WIP check even after changing the title.

@tannergooding
Copy link
Member

Has there been any consideration on optimizing the case where most (but not all) of the inputs are constant?

For example: Create(c, c, v, c) (where c is a constant and v is a variable input)? I would guess that it would be better to create a 16-byte vector constant where the variable argument is zero and then have codegen do a single insertion (similar to what the C# compiler does for the arrays that are mostly constant).

@mikedn
Copy link
Author

mikedn commented Jan 10, 2019

For example: Create(c, c, v, c) (where c is a constant and v is a variable input)? I would guess that it would be better to create a 16-byte vector constant where the variable argument is zero and then have codegen do a single insertion (similar to what the C# compiler does for the arrays that are mostly constant).

Yes, that's doable. It's just that it's a lot of work - looking at various interesting patterns, figuring out the right trade offs, deciding what instructions to generate depending on the number of variables to insert, their position, the available ISA etc. I've looked a bit at what native compilers do here and it seems pretty daunting to do something comprehensive.

Just an example: For

_mm256_set_epi32(x, 1, x, 1, x, 1, x, 1);

VC++ generates

vmovdqu xmm0, XMMWORD PTR __xmm@00000000000000010000000000000001
vpinsrd xmm0, xmm0, ecx, 1
vpinsrd xmm0, xmm0, ecx, 3
vinsertf128 ymm0, ymm0, xmm0, 1

@4creators
Copy link

4creators commented Jan 13, 2019

@BruceForstall @mikedn Any chance of having this PR reviewed and merged before 3.0 ships?

@fiigii
Copy link

fiigii commented Jan 22, 2019

@CarolEidt @mikedn Is this PR ready to merge? I want to make the similar optimizations for Vector128/256 as well.

@mikedn
Copy link
Author

mikedn commented Jan 23, 2019

This one is done but I'm considering using a somewhat different approach in the future.

We should probably have a GT_CNS_SIMD node and handle that the same way floating point constants are handled. That would help in other cases (e.g. initializing a block with a non-zero bit pattern, avoid the need for special codegen for NI_SSE41_TestAllOnes) and perhaps would allow reusing constants already loaded into registers via LSRA's constant intervals.

But that's probably quite a bit of work and thus seems unlikely to catch .NET 3.0. So for now this is probably the best we can do.

Copy link

@briansull briansull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove uses of LEGACY_BACKEND

src/jit/emit.h Outdated
@@ -216,6 +216,15 @@ class emitLocation
unsigned codePos; // the code position within the IG (see emitCurOffset())
};

#ifndef LEGACY_BACKEND

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that we have deprecated LEGACY_BACKEND

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks!

@mikedn
Copy link
Author

mikedn commented Jan 24, 2019

OSX failed due to the usual Jenkins exception:

00:32:07.313 Test passed.
00:32:07.338 ...FATAL: command execution failed
00:32:21.662 java.nio.channels.ClosedChannelException
00:32:21.662 	at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
00:32:21.662 	at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:142)
00:32:21.662 	at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:721)

@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test

@fiigii
Copy link

fiigii commented Jan 24, 2019

@CarolEidt Can we merge this PR?

@CarolEidt CarolEidt merged commit aab30e3 into dotnet:master Jan 24, 2019
@tannergooding
Copy link
Member

@fiigii, are you going to be putting up a similar PR for Vector64<T>, Vector128<T>, and Vector256<T>?

@fiigii
Copy link

fiigii commented Jan 24, 2019

Yes, but I am busy with other things now. Will look into it next week.

@mikedn mikedn deleted the float-neg-abs branch September 28, 2019 19:23
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
Improve `new Vector2/3/4(c1, c2, c3, c4)` codegen

Commit migrated from dotnet/coreclr@aab30e3
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants