-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Improve new Vector2/3/4(c1, c2, c3, c4)
codegen
#14393
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Thanks for the update. I'll set this one as "Future", and you can decide whether to close or leave it open. |
Maybe the JIT could generate pools in a crossgen or AOT mode. Or if some metric like number of use instances warrants it.
It we don't do that already, then by all means we should dump them! |
Well, I don't know if we'll end up doing NEG/ABS but the data section improvements are good for #17256 and that one really should be fixed, it generates way too much code. So I'll keep working on this. |
@CarolEidt What do you think about lowering It looks like the JIT never attempted to create data sections before codegen but there doesn't appear to be anything preventing us from doing that. The alternative is to mark all 4 arguments as contained but that requires more code changes (
G_M33474_IG01:
C5F877 vzeroupper
G_M33474_IG02:
C4E179100524000000 vmovupd xmm0, xmmword ptr [reloc @RWD00]
C4E179100D2B000000 vmovupd xmm1, xmmword ptr [reloc @RWD16]
C4E17858C1 vaddps xmm0, xmm1 ; could be contained
C4E1791101 vmovupd xmmword ptr [rcx], xmm0
488BC1 mov rax, rcx
G_M33474_IG03:
C3 ret
RWD00 db 000h, 000h, 080h, 03Fh, 000h, 000h, 000h, 040h, 000h, 000h, 040h, 040h, 000h, 000h, 080h, 040h
RWD16 db 000h, 000h, 080h, 040h, 000h, 000h, 040h, 040h, 000h, 000h, 000h, 040h, 000h, 000h, 080h, 03Fh |
I like it - I think it's a good approach; I wouldn't have thought it would be this straightforward to do this - thanks (and sorry for the long delay in response). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a couple of questions and a request for a function header.
emitDataGenBeg(4, false, false); | ||
emitDataGenData(0, &zero, 4); | ||
emitDataGenEnd(); | ||
assert(size <= 16); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this assert hold for any case (not just when align
is true)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Constants can have any size (e.g. 32 bytes for AVX stuff), it's just that alignment is currently supported only for 4, 8 and 16 bytes constants. Hmm, maybe I should check those specific sizes instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment to clarify this assert.
@@ -5533,6 +5562,111 @@ void emitter::emitOutputDataSec(dataSecDsc* sec, BYTE* dst) | |||
} | |||
} | |||
|
|||
#ifdef DEBUG | |||
void emitter::emitDispDataSec(dataSecDsc* section) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Might I ask for a basic function header?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will add one.
} | ||
else | ||
{ | ||
printf("dd\t%08Xh", ig->igOffs - igFirst->igOffs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even in non-diffable format would it be useful to print the labels as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I don't remember why I did it like this. I probably thought that being able to see the actual numbers could be useful. Maybe not, I'll check to see how it looks either way. Or perhaps better, in non-diffable mode have
dd offset_number ; label in comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hah, I forgot that it does something like this already. Here's the diffable output:
RWD00 dd G_M55886_IG03 - G_M55886_IG02
dd G_M55886_IG05 - G_M55886_IG02
dd G_M55886_IG07 - G_M55886_IG02
dd G_M55886_IG09 - G_M55886_IG02
dd G_M55886_IG10 - G_M55886_IG02
and the non-diffable output:
RWD00 dd 00000023h ; case G_M55886_IG03
dd 00000029h ; case G_M55886_IG05
dd 0000002Fh ; case G_M55886_IG07
dd 00000035h ; case G_M55886_IG09
dd 0000003Ch ; case G_M55886_IG10
Well, it would be better if the constants were sorted by size to minimize alignment holes. There's also no support for 32 byte aligned constants, the VM lacks an option for that. No problem about the delay, there's plenty of things to do in the JIT world and this change doesn't block anything else :) |
Actually they are. As is now all 16 byte constants are created in lowering and all other constants (4/8 bytes in size) are created during codegen. So all 16 byte constants come first and there are no alignment holes between them. This will only be a problem if we try to extend it to 32 byte constants for AVX intrinsics. But for |
This works well even for ; before
...
vmovss xmm0, dword ptr [reloc @RWD00]
vmovss xmm1, dword ptr [reloc @RWD04]
vxorps xmm2, xmm2
vmovss xmm2, xmm2, xmm1
vpslldq xmm2, 4
vmovss xmm2, xmm2, xmm0
vmovaps xmm0, xmm2
vmovsd qword ptr [rsp+10H], xmm0
vmovss xmm0, dword ptr [reloc @RWD08]
vmovss xmm1, dword ptr [reloc @RWD12]
vxorps xmm2, xmm2
vmovss xmm2, xmm2, xmm1
vpslldq xmm2, 4
vmovss xmm2, xmm2, xmm0
vmovaps xmm0, xmm2
vmovsd qword ptr [rsp+08H], xmm0
vmovsd xmm0, qword ptr [rsp+10H]
vmovsd xmm1, qword ptr [rsp+08H]
vaddps xmm0, xmm1
vmovd rax, xmm0
...
RWD00 dd 3F800000h
RWD04 dd 40000000h
RWD08 dd 40800000h
RWD12 dd 40400000h ; after
...
vmovsd xmm0, qword ptr [reloc @RWD00]
vmovsd qword ptr [rsp+10H], xmm0
vmovsd xmm0, qword ptr [reloc @RWD16]
vmovsd qword ptr [rsp+08H], xmm0
vmovsd xmm0, qword ptr [rsp+10H]
vmovsd xmm1, qword ptr [rsp+08H]
vaddps xmm0, xmm1
vmovd rax, xmm0
...
RWD00 db 000h, 000h, 080h, 03Fh, 000h, 000h, 000h, 040h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h
RWD16 db 000h, 000h, 080h, 040h, 000h, 000h, 040h, 040h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h Uses 16 bytes more of data but saves 70 bytes of code (138 bytes vs. 68 bytes). |
3e7be95
to
3d58487
Compare
new Vector2/3/4(c1, c2, c3, c4)
codegen
@dotnet-bot test Ubuntu x64 Checked CoreFX Test |
@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test |
@dotnet-bot test Ubuntu x64 Checked CoreFX Test |
1 similar comment
@dotnet-bot test Ubuntu x64 Checked CoreFX Test |
@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test |
@dotnet-bot test Ubuntu x64 Checked CoreFX Test |
@dotnet-bot test Ubuntu x64 Checked CoreFX Tests |
new Vector2/3/4(c1, c2, c3, c4)
codegennew Vector2/3/4(c1, c2, c3, c4)
codegen
new Vector2/3/4(c1, c2, c3, c4)
codegennew Vector2/3/4(c1, c2, c3, c4)
codegen
new Vector2/3/4(c1, c2, c3, c4)
codegennew Vector2/3/4(c1, c2, c3, c4)
codegen
The WIP bot got stuck, it didn't update the WIP check even after changing the title. |
Has there been any consideration on optimizing the case where most (but not all) of the inputs are constant? For example: |
Yes, that's doable. It's just that it's a lot of work - looking at various interesting patterns, figuring out the right trade offs, deciding what instructions to generate depending on the number of variables to insert, their position, the available ISA etc. I've looked a bit at what native compilers do here and it seems pretty daunting to do something comprehensive. Just an example: For _mm256_set_epi32(x, 1, x, 1, x, 1, x, 1); VC++ generates vmovdqu xmm0, XMMWORD PTR __xmm@00000000000000010000000000000001
vpinsrd xmm0, xmm0, ecx, 1
vpinsrd xmm0, xmm0, ecx, 3
vinsertf128 ymm0, ymm0, xmm0, 1 |
@BruceForstall @mikedn Any chance of having this PR reviewed and merged before 3.0 ships? |
@CarolEidt @mikedn Is this PR ready to merge? I want to make the similar optimizations for |
This one is done but I'm considering using a somewhat different approach in the future. We should probably have a But that's probably quite a bit of work and thus seems unlikely to catch .NET 3.0. So for now this is probably the best we can do. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove uses of LEGACY_BACKEND
src/jit/emit.h
Outdated
@@ -216,6 +216,15 @@ class emitLocation | |||
unsigned codePos; // the code position within the IG (see emitCurOffset()) | |||
}; | |||
|
|||
#ifndef LEGACY_BACKEND |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that we have deprecated LEGACY_BACKEND
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks!
OSX failed due to the usual Jenkins exception:
@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test |
@CarolEidt Can we merge this PR? |
@fiigii, are you going to be putting up a similar PR for |
Yes, but I am busy with other things now. Will look into it next week. |
Improve `new Vector2/3/4(c1, c2, c3, c4)` codegen Commit migrated from dotnet/coreclr@aab30e3
Fixes #17256