Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

i64x2.widen_(low/high)_i32x4_(s/u) instructions #290

Merged
merged 1 commit into from
Jan 11, 2021

Conversation

Maratyszcza
Copy link
Contributor

@Maratyszcza Maratyszcza commented Aug 5, 2020

Introduction

This is proposal to add new variants of existing widen instructions. The new variants convert the two low/high 32-bit integers in a SIMD vector to a vector of two 64-bit integers. It is unclear why these variants were left out of #89, as both x86 (since SSE4.1) and ARM NEON include the native equivalents. @AndrewScheidecker asked to reserve opcode space for these instructions, but it seems that there was no discussion about including/excluding the 64-bit forms.

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX instruction set

  • i64x2.widen_low_i32x4_s
    • y = i64x2.widen_low_i32x4_s(x) is lowered to VPMOVSXDQ xmm_y, xmm_x
  • i64x2.widen_low_i32x4_u
    • y = i64x2.widen_low_i32x4_u(x) is lowered to VPMOVZXDQ xmm_y, xmm_x
  • i64x2.widen_high_i32x4_s
    • y = i64x2.widen_high_i32x4_s(x) is lowered to VPUNPCKHQDQ xmm_y, xmm_x, xmm_x + VPMOVSXDQ xmm_y, xmm_y
  • i64x2.widen_high_i32x4_s
    • y = i64x2.widen_high_i32x4_s(x) is lowered to VPXOR xmm_y, xmm_y, xmm_y + VPUNPCKHDQ xmm_y, xmm_x, xmm_y

x86/x86-64 processors with SSE4.1 instruction set

  • i64x2.widen_low_i32x4_s
    • y = i64x2.widen_low_i32x4_s(x) is lowered to PMOVSXDQ xmm_y, xmm_x
  • i64x2.widen_low_i32x4_u
    • y = i64x2.widen_low_i32x4_u(x) is lowered to PMOVZXDQ xmm_y, xmm_x
  • i64x2.widen_high_i32x4_s
    • y = i64x2.widen_high_i32x4_s(x) is lowered to PSHUFD xmm_y, xmm_x, 0xEE + PMOVSXDQ xmm_y, xmm_y
  • i64x2.widen_high_i32x4_s
    • y = i64x2.widen_high_i32x4_s(x) is lowered to PSHUFD xmm_y, xmm_x, 0xEE + PMOVZXDQ xmm_y, xmm_y

x86/x86-64 processors with SSE2 instruction set

  • i64x2.widen_low_i32x4_s
    • y = i64x2.widen_low_i32x4_s(x) is lowered to PXOR x_tmp, x_tmp + MOVDQA xmm_y, xmm_x + PCMPGTD xmm_tmp, xmm_x + PUNPCKLDQ xmm_y, xmm_tmp
  • i64x2.widen_low_i32x4_u
    • y = i64x2.widen_low_i32x4_u(x) is lowered to PXOR x_tmp, x_tmp + MOVDQA xmm_y, xmm_x + PUNPCKLDQ xmm_y, xmm_tmp
  • i64x2.widen_high_i32x4_s
    • y = i64x2.widen_high_i32x4_s(x) is lowered to PXOR x_tmp, x_tmp + MOVDQA xmm_y, xmm_x + PCMPGTD xmm_tmp, xmm_x + PUNPCKHDQ xmm_y, xmm_tmp
  • i64x2.widen_high_i32x4_s
    • y = i64x2.widen_high_i32x4_s(x) is lowered to PXOR x_tmp, x_tmp + MOVDQA xmm_y, xmm_x + PUNPCKHDQ xmm_y, xmm_tmp

ARM64 processors

  • i64x2.widen_low_i32x4_s
    • y = i64x2.widen_low_i32x4_s(x) is lowered to SSHLL Vy.2D, Vx.2S, 0
  • i64x2.widen_low_i32x4_u
    • y = i64x2.widen_low_i32x4_u(x) is lowered to USHLL Vy.2D, Vx.2S, 0
  • i64x2.widen_high_i32x4_s
    • y = i64x2.widen_high_i32x4_s(x) is lowered to SSHLL2 Vy.2D, Vx.2S, 0
  • i64x2.widen_high_i32x4_s
    • y = i64x2.widen_high_i32x4_s(x) is lowered to USHLL2 Vy.2D, Vx.2S, 0

ARMv7 processors with NEON instruction set

  • i64x2.widen_low_i32x4_s
    • y = i64x2.widen_low_i32x4_s(x) is lowered to VMOVL.S32 Qy, Dx_lo
  • i64x2.widen_low_i32x4_u
    • y = i64x2.widen_low_i32x4_u(x) is lowered to VMOVL.U32 Qy, Dx_lo
  • i64x2.widen_high_i32x4_s
    • y = i64x2.widen_high_i32x4_s(x) is lowered to VMOVL.S32 Qy, Dx_hi
  • i64x2.widen_high_i32x4_s
    • y = i64x2.widen_high_i32x4_s(x) is lowered to VMOVL.U32 Qy, Dx_hi

@tlively
Copy link
Member

tlively commented Aug 6, 2020

Why do these instructions have i32x8 in their names? Are they actually operating on 256-bits somehow? Or should that be i32x4?

Are there applications where these instructions would be important for performance?

@Maratyszcza
Copy link
Contributor Author

Good catch, @tlively! Should be i64x2 and i32x4 accordingly

@Maratyszcza Maratyszcza changed the title i64x4.widen_(low/high)_i32x8_(s/u) instructions i64x2.widen_(low/high)_i32x4_(s/u) instructions Aug 6, 2020
@Maratyszcza
Copy link
Contributor Author

Maratyszcza commented Sep 19, 2020

As for the applications, the lack of these instructions came up when porting fixed-point neural network inference microkernels in XNNPACK to WebAssembly SIMD: these lines would benefit from the 64-bit widening instructions.

@ngzhian
Copy link
Member

ngzhian commented Sep 29, 2020

I think this looks good, fills up some holes in the i64x2 column, and the codegen is good, I'll get started on prototyping in V8, starting with arm64, then x64.

@ngzhian
Copy link
Member

ngzhian commented Sep 29, 2020

Btw we have 0xc7 and 0xca reserved in the opcode space (see NewOpcodes.md) so I think we can use those (in BinarySIMD.md).

@Maratyszcza
Copy link
Contributor Author

@ngzhian Assigned opcodes in 0xc7-0xca range.

@ngzhian
Copy link
Member

ngzhian commented Oct 8, 2020

prototyped on arm64 in https://crrev.com/c/2441369

tlively added a commit to llvm/llvm-project that referenced this pull request Oct 30, 2020
As proposed in WebAssembly/simd#290. As usual, these
instructions are available only via builtin functions and intrinsics while they
are in the prototyping stage.

Differential Revision: https://reviews.llvm.org/D90504
@tlively
Copy link
Member

tlively commented Oct 30, 2020

This is prototyped in LLVM (but not Binaryen) via __builtin_wasm_widen_{low,high}_{s,u}_i32x4_i64x2. It should be usable in tot Emscripten in a few hours as long as you do not pass optimization flags at link time.

@Maratyszcza
Copy link
Contributor Author

Experiment to enable these instructions in XNNPACK is in google/XNNPACK#1237, but fails to link (even with optimization disabled): [parse exception: invalid code after SIMD prefix: 199 (at 0:1439773)]

tlively added a commit to tlively/binaryen that referenced this pull request Dec 11, 2020
tlively added a commit to WebAssembly/binaryen that referenced this pull request Dec 12, 2020
@Maratyszcza
Copy link
Contributor Author

I evaluated performance impact of these instructions by leveraging them in requantization parts of fixed-point neural network inference primitives in XNNPACK library, similarly to #376. Performance impact on the requantization primitive is summarized in the table below:

Processor (Device)  Performance with Chrome M86-level WAsm SIMD + i64x2.widen_(low/high)_i32x4_s Performance with Chrome M86-level WAsm SIMD (baseline) Speedup
Snapdragon 855 (LG G8 ThinQ) 2.46 GB/s 2.37 GB/s 4%
Snapdragon 670 (Pixel 3a) 1.39 GB/s 1.36 GB/s 2%
Exynos 8895 (Galaxy S8) 1.33 GB/s 1.28 GB/s 4%

The code modifications can be seen in google/XNNPACK#1237. The benefits of these instructions in this particular use-case are superseded by #376.

@tlively
Copy link
Member

tlively commented Dec 17, 2020

@Maratyszcza do you have any other potential use cases for these instructions? Otherwise the case for including these is not very compelling due to the redundancy with the merged extending multiply instructions :/

@Maratyszcza
Copy link
Contributor Author

I think in this case we need a good argument to NOT include them, as they are extra forms of existing instructions rather than new instructions. Typically we don't even evaluate all instruction forms, and just assume that speedup on one of the variants applies to other instruction variants. And here we have a demonstration that these instruction forms are actually useful.

@omnisip
Copy link

omnisip commented Dec 22, 2020

This was discussed in (#402 12/22/2020 Sync Meeting). Provisional voting results were:

SF:4 F:3 N: 1

Link to minutes: https://docs.google.com/document/d/1Tnf-fvRcCVj_vv8CjAN_qUvj4Wa57JnPa8ghGi4_C-s/edit#

@tlively tlively merged commit daa35f5 into WebAssembly:master Jan 11, 2021
pull bot pushed a commit to p-g-krish/v8 that referenced this pull request Jan 22, 2021
These instructions were accepted into the proposal:
WebAssembly/simd#290

Bug: v8:10972
Change-Id: Ia2cce2df575786babe770b043b1e90bf953c5f9b
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2643658
Reviewed-by: Deepti Gandluri <[email protected]>
Commit-Queue: Zhi An Ng <[email protected]>
Cr-Commit-Position: refs/heads/master@{#72243}
pull bot pushed a commit to p-g-krish/v8 that referenced this pull request Jan 22, 2021
…ifdefs

Port ec8fbed

Original Commit Message:

    These instructions were accepted into the proposal:
    WebAssembly/simd#290

[email protected], [email protected], [email protected], [email protected]
BUG=
LOG=N

Change-Id: I69bbe90ab3af30d7748332a7e99b7812c95f96b4
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2644939
Reviewed-by: Junliang Yan <[email protected]>
Reviewed-by: Milad Fa <[email protected]>
Commit-Queue: Milad Fa <[email protected]>
Cr-Commit-Position: refs/heads/master@{#72257}
ngzhian added a commit to ngzhian/simd that referenced this pull request Feb 4, 2021
This was merged in WebAssembly#290.

Also tweaked the file generation scripts:

- Make simd_arithmetic more generic (allow different instruction name
patterns)
- create a new file simd_int_to_int_widen to generate all integer
widening operations (including the ones implemented in this PR)
- remove widening tests from simd_conversions.wast
ngzhian added a commit to ngzhian/simd that referenced this pull request Feb 4, 2021
This was merged in WebAssembly#290.

Also tweaked the file generation scripts:

- Make simd_arithmetic more generic (allow different instruction name
patterns)
- create a new file simd_int_to_int_widen to generate all integer
widening operations (including the ones implemented in this PR)
- remove widening tests from simd_conversions.wast
ngzhian added a commit to ngzhian/simd that referenced this pull request Feb 4, 2021
This is a simple change, validation and execution is already
shape-agnostic, so we only need to add it to the syntax.

Instructions were merged in WebAssembly#290.
ngzhian added a commit to ngzhian/simd that referenced this pull request Feb 4, 2021
This was merged in WebAssembly#290.

Also tweaked the file generation scripts:

- Make simd_arithmetic more generic (allow different instruction name
patterns)
- create a new file simd_int_to_int_widen to generate all integer
widening operations (including the ones implemented in this PR)
- remove widening tests from simd_conversions.wast
ngzhian added a commit that referenced this pull request Feb 9, 2021
This is a simple change, validation and execution is already
shape-agnostic, so we only need to add it to the syntax.

Instructions were merged in #290.
ngzhian added a commit that referenced this pull request Feb 9, 2021
This was merged in #290.

Also tweaked the file generation scripts:

- Make simd_arithmetic more generic (allow different instruction name
patterns)
- create a new file simd_int_to_int_widen to generate all integer
widening operations (including the ones implemented in this PR)
- remove widening tests from simd_conversions.wast
arichardson pushed a commit to arichardson/llvm-project that referenced this pull request Mar 25, 2021
As proposed in WebAssembly/simd#290. As usual, these
instructions are available only via builtin functions and intrinsics while they
are in the prototyping stage.

Differential Revision: https://reviews.llvm.org/D90504
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants