-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x64: Improve memory support in {insert,extract}lane
#5982
Conversation
This commit improves adds support to Cranelift to emit `pextr{b,w,d,q}` with a memory destination, merging a store-of-extract operation into one instruction. Additionally AVX support is added for the `pextr*` instructions. I've additionally tried to ensure that codegen tests and runtests exist for all forms of these instructions too.
This is a generalization of #5924 for more vector types and more lanes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but see a couple comments below.
address | ||
offset)) | ||
(side_effect | ||
(x64_movsd_store (to_amode flags address offset) value))) | ||
(rule 2 (lower (store flags | ||
(has_type $I8 (extractlane value (u8_from_uimm8 n))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked and u8_from_uimm8
does not guarantee that n
is in the right range (e.g., 0 to 15 here). Did you see any validation that would prevent wrong n
here at the CLIF level or do we rely exclusively on the Wasm-level construction? If there is nothing and it ends up being too tricky to do here in ISLE, maybe emit.rs
could gain some assert!
s or something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the block in the verifier which validates that lanes are always inbounds, so I think we should be good on that front. I can add some extra asserts though to the backend too.
; movq %rsp, %rbp | ||
; block0: | ||
; vmovsd 0(%rdi), %xmm3 | ||
; vmovsd %xmm0, %xmm3, %xmm0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the other cases, I started to worry that one of these instructions actually zeroes bits we do not want to zero. I think it is this case; from the MOVSD
documentation:
Legacy version: When the source and destination operands are XMM registers, bits MAXVL:64 of the destination operand remains unchanged. When the source operand is a memory location and destination operand is an XMM registers, the quadword at bits 127:64 of the destination operand is cleared to all 0s, bits MAXVL:128 of the destination operand remains unchanged.
VEX and EVEX encoded register-register syntax: Moves a scalar double precision floating-point value from the second source operand (the third operand) to the low quadword element of the destination operand (the first operand). Bits 127:64 of the destination operand are copied from the first source operand (the second operand). Bits (MAXVL-1:128) of the corresponding destination register are zeroed
So it is the legacy SSE case where a movsd 0(%rdi) ....
would zero too many bits but actually in the AVX code we could merge these two vmovsd
into one. Is there a way to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going by this which I realize isn't official but has been what I've been using so far, my read is that VEX.128.F2.0F 11 /r: VMOVSD xmm1, xmm2, xmm3
has different semantics than VEX.128.F2.0F 10 /r: VMOVSD xmm1, m64
so I think that AVX and SSE match here where if you use the register-to-register form it preserves the upper bits but if you use the memory-to-register form it always zeros the upper bits, so I don't think we can fuse?
I'll admit though that this is all real subtle and if the official docs are different I wouldn't be too surprised.
; vmovsd 0(%rdi), %xmm3 | ||
; vmovsd %xmm0, %xmm3, %xmm0 | ||
; movsd 0(%rdi), %xmm3 | ||
; movsd %xmm0, %xmm3, %xmm0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I should also note that |
This commit improves adds support to Cranelift to emit
pextr{b,w,d,q}
with a memory destination, merging a store-of-extract operation into one instruction. Additionally AVX support is added for thepextr*
instructions.I've additionally tried to ensure that codegen tests and runtests exist for all forms of these instructions too.