-
Notifications
You must be signed in to change notification settings - Fork 43
i64x2.widen_(low/high)_i32x4_(s/u) instructions #290
Conversation
Why do these instructions have Are there applications where these instructions would be important for performance? |
Good catch, @tlively! Should be |
7e6a638
to
41099ae
Compare
41099ae
to
14e4b18
Compare
As for the applications, the lack of these instructions came up when porting fixed-point neural network inference microkernels in XNNPACK to WebAssembly SIMD: these lines would benefit from the 64-bit widening instructions. |
I think this looks good, fills up some holes in the i64x2 column, and the codegen is good, I'll get started on prototyping in V8, starting with arm64, then x64. |
Btw we have 0xc7 and 0xca reserved in the opcode space (see NewOpcodes.md) so I think we can use those (in BinarySIMD.md). |
14e4b18
to
0d87b03
Compare
@ngzhian Assigned opcodes in |
prototyped on arm64 in https://crrev.com/c/2441369 |
As proposed in WebAssembly/simd#290. As usual, these instructions are available only via builtin functions and intrinsics while they are in the prototyping stage. Differential Revision: https://reviews.llvm.org/D90504
This is prototyped in LLVM (but not Binaryen) via |
0d87b03
to
2e55aab
Compare
Experiment to enable these instructions in XNNPACK is in google/XNNPACK#1237, but fails to link (even with optimization disabled): |
- i64x2.eq (WebAssembly/simd#381) - i64x2 widens (WebAssembly/simd#290) - i64x2.bitmask (WebAssembly/simd#368) - signselect ops (WebAssembly/simd#124)
- i64x2.eq (WebAssembly/simd#381) - i64x2 widens (WebAssembly/simd#290) - i64x2.bitmask (WebAssembly/simd#368) - signselect ops (WebAssembly/simd#124)
I evaluated performance impact of these instructions by leveraging them in requantization parts of fixed-point neural network inference primitives in XNNPACK library, similarly to #376. Performance impact on the requantization primitive is summarized in the table below:
The code modifications can be seen in google/XNNPACK#1237. The benefits of these instructions in this particular use-case are superseded by #376. |
@Maratyszcza do you have any other potential use cases for these instructions? Otherwise the case for including these is not very compelling due to the redundancy with the merged extending multiply instructions :/ |
I think in this case we need a good argument to NOT include them, as they are extra forms of existing instructions rather than new instructions. Typically we don't even evaluate all instruction forms, and just assume that speedup on one of the variants applies to other instruction variants. And here we have a demonstration that these instruction forms are actually useful. |
2e55aab
to
88442a2
Compare
This was discussed in (#402 12/22/2020 Sync Meeting). Provisional voting results were: SF:4 F:3 N: 1 Link to minutes: https://docs.google.com/document/d/1Tnf-fvRcCVj_vv8CjAN_qUvj4Wa57JnPa8ghGi4_C-s/edit# |
88442a2
to
f5b90cd
Compare
These instructions were accepted into the proposal: WebAssembly/simd#290 Bug: v8:10972 Change-Id: Ia2cce2df575786babe770b043b1e90bf953c5f9b Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2643658 Reviewed-by: Deepti Gandluri <[email protected]> Commit-Queue: Zhi An Ng <[email protected]> Cr-Commit-Position: refs/heads/master@{#72243}
…ifdefs Port ec8fbed Original Commit Message: These instructions were accepted into the proposal: WebAssembly/simd#290 [email protected], [email protected], [email protected], [email protected] BUG= LOG=N Change-Id: I69bbe90ab3af30d7748332a7e99b7812c95f96b4 Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2644939 Reviewed-by: Junliang Yan <[email protected]> Reviewed-by: Milad Fa <[email protected]> Commit-Queue: Milad Fa <[email protected]> Cr-Commit-Position: refs/heads/master@{#72257}
This was merged in WebAssembly#290. Also tweaked the file generation scripts: - Make simd_arithmetic more generic (allow different instruction name patterns) - create a new file simd_int_to_int_widen to generate all integer widening operations (including the ones implemented in this PR) - remove widening tests from simd_conversions.wast
This was merged in WebAssembly#290. Also tweaked the file generation scripts: - Make simd_arithmetic more generic (allow different instruction name patterns) - create a new file simd_int_to_int_widen to generate all integer widening operations (including the ones implemented in this PR) - remove widening tests from simd_conversions.wast
This is a simple change, validation and execution is already shape-agnostic, so we only need to add it to the syntax. Instructions were merged in WebAssembly#290.
This was merged in WebAssembly#290. Also tweaked the file generation scripts: - Make simd_arithmetic more generic (allow different instruction name patterns) - create a new file simd_int_to_int_widen to generate all integer widening operations (including the ones implemented in this PR) - remove widening tests from simd_conversions.wast
This is a simple change, validation and execution is already shape-agnostic, so we only need to add it to the syntax. Instructions were merged in #290.
This was merged in #290. Also tweaked the file generation scripts: - Make simd_arithmetic more generic (allow different instruction name patterns) - create a new file simd_int_to_int_widen to generate all integer widening operations (including the ones implemented in this PR) - remove widening tests from simd_conversions.wast
As proposed in WebAssembly/simd#290. As usual, these instructions are available only via builtin functions and intrinsics while they are in the prototyping stage. Differential Revision: https://reviews.llvm.org/D90504
Introduction
This is proposal to add new variants of existing
widen
instructions. The new variants convert the two low/high 32-bit integers in a SIMD vector to a vector of two 64-bit integers. It is unclear why these variants were left out of #89, as both x86 (since SSE4.1) and ARM NEON include the native equivalents. @AndrewScheidecker asked to reserve opcode space for these instructions, but it seems that there was no discussion about including/excluding the 64-bit forms.Mapping to Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.
x86/x86-64 processors with AVX instruction set
y = i64x2.widen_low_i32x4_s(x)
is lowered toVPMOVSXDQ xmm_y, xmm_x
y = i64x2.widen_low_i32x4_u(x)
is lowered toVPMOVZXDQ xmm_y, xmm_x
y = i64x2.widen_high_i32x4_s(x)
is lowered toVPUNPCKHQDQ xmm_y, xmm_x, xmm_x + VPMOVSXDQ xmm_y, xmm_y
y = i64x2.widen_high_i32x4_s(x)
is lowered toVPXOR xmm_y, xmm_y, xmm_y + VPUNPCKHDQ xmm_y, xmm_x, xmm_y
x86/x86-64 processors with SSE4.1 instruction set
y = i64x2.widen_low_i32x4_s(x)
is lowered toPMOVSXDQ xmm_y, xmm_x
y = i64x2.widen_low_i32x4_u(x)
is lowered toPMOVZXDQ xmm_y, xmm_x
y = i64x2.widen_high_i32x4_s(x)
is lowered toPSHUFD xmm_y, xmm_x, 0xEE + PMOVSXDQ xmm_y, xmm_y
y = i64x2.widen_high_i32x4_s(x)
is lowered toPSHUFD xmm_y, xmm_x, 0xEE + PMOVZXDQ xmm_y, xmm_y
x86/x86-64 processors with SSE2 instruction set
y = i64x2.widen_low_i32x4_s(x)
is lowered toPXOR x_tmp, x_tmp + MOVDQA xmm_y, xmm_x + PCMPGTD xmm_tmp, xmm_x + PUNPCKLDQ xmm_y, xmm_tmp
y = i64x2.widen_low_i32x4_u(x)
is lowered toPXOR x_tmp, x_tmp + MOVDQA xmm_y, xmm_x + PUNPCKLDQ xmm_y, xmm_tmp
y = i64x2.widen_high_i32x4_s(x)
is lowered toPXOR x_tmp, x_tmp + MOVDQA xmm_y, xmm_x + PCMPGTD xmm_tmp, xmm_x + PUNPCKHDQ xmm_y, xmm_tmp
y = i64x2.widen_high_i32x4_s(x)
is lowered toPXOR x_tmp, x_tmp + MOVDQA xmm_y, xmm_x + PUNPCKHDQ xmm_y, xmm_tmp
ARM64 processors
y = i64x2.widen_low_i32x4_s(x)
is lowered toSSHLL Vy.2D, Vx.2S, 0
y = i64x2.widen_low_i32x4_u(x)
is lowered toUSHLL Vy.2D, Vx.2S, 0
y = i64x2.widen_high_i32x4_s(x)
is lowered toSSHLL2 Vy.2D, Vx.2S, 0
y = i64x2.widen_high_i32x4_s(x)
is lowered toUSHLL2 Vy.2D, Vx.2S, 0
ARMv7 processors with NEON instruction set
y = i64x2.widen_low_i32x4_s(x)
is lowered toVMOVL.S32 Qy, Dx_lo
y = i64x2.widen_low_i32x4_u(x)
is lowered toVMOVL.U32 Qy, Dx_lo
y = i64x2.widen_high_i32x4_s(x)
is lowered toVMOVL.S32 Qy, Dx_hi
y = i64x2.widen_high_i32x4_s(x)
is lowered toVMOVL.U32 Qy, Dx_hi