i64x2.widen_(low/high)_i32x4_(s/u) instructions #290

Maratyszcza · 2020-08-05T15:10:09Z

Introduction

This is proposal to add new variants of existing widen instructions. The new variants convert the two low/high 32-bit integers in a SIMD vector to a vector of two 64-bit integers. It is unclear why these variants were left out of #89, as both x86 (since SSE4.1) and ARM NEON include the native equivalents. @AndrewScheidecker asked to reserve opcode space for these instructions, but it seems that there was no discussion about including/excluding the 64-bit forms.

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX instruction set

i64x2.widen_low_i32x4_s
- y = i64x2.widen_low_i32x4_s(x) is lowered to VPMOVSXDQ xmm_y, xmm_x
i64x2.widen_low_i32x4_u
- y = i64x2.widen_low_i32x4_u(x) is lowered to VPMOVZXDQ xmm_y, xmm_x
i64x2.widen_high_i32x4_s
- y = i64x2.widen_high_i32x4_s(x) is lowered to VPUNPCKHQDQ xmm_y, xmm_x, xmm_x + VPMOVSXDQ xmm_y, xmm_y
i64x2.widen_high_i32x4_s
- y = i64x2.widen_high_i32x4_s(x) is lowered to VPXOR xmm_y, xmm_y, xmm_y + VPUNPCKHDQ xmm_y, xmm_x, xmm_y

x86/x86-64 processors with SSE4.1 instruction set

i64x2.widen_low_i32x4_s
- y = i64x2.widen_low_i32x4_s(x) is lowered to PMOVSXDQ xmm_y, xmm_x
i64x2.widen_low_i32x4_u
- y = i64x2.widen_low_i32x4_u(x) is lowered to PMOVZXDQ xmm_y, xmm_x
i64x2.widen_high_i32x4_s
- y = i64x2.widen_high_i32x4_s(x) is lowered to PSHUFD xmm_y, xmm_x, 0xEE + PMOVSXDQ xmm_y, xmm_y
i64x2.widen_high_i32x4_s
- y = i64x2.widen_high_i32x4_s(x) is lowered to PSHUFD xmm_y, xmm_x, 0xEE + PMOVZXDQ xmm_y, xmm_y

x86/x86-64 processors with SSE2 instruction set

i64x2.widen_low_i32x4_s
- y = i64x2.widen_low_i32x4_s(x) is lowered to PXOR x_tmp, x_tmp + MOVDQA xmm_y, xmm_x + PCMPGTD xmm_tmp, xmm_x + PUNPCKLDQ xmm_y, xmm_tmp
i64x2.widen_low_i32x4_u
- y = i64x2.widen_low_i32x4_u(x) is lowered to PXOR x_tmp, x_tmp + MOVDQA xmm_y, xmm_x + PUNPCKLDQ xmm_y, xmm_tmp
i64x2.widen_high_i32x4_s
- y = i64x2.widen_high_i32x4_s(x) is lowered to PXOR x_tmp, x_tmp + MOVDQA xmm_y, xmm_x + PCMPGTD xmm_tmp, xmm_x + PUNPCKHDQ xmm_y, xmm_tmp
i64x2.widen_high_i32x4_s
- y = i64x2.widen_high_i32x4_s(x) is lowered to PXOR x_tmp, x_tmp + MOVDQA xmm_y, xmm_x + PUNPCKHDQ xmm_y, xmm_tmp

ARM64 processors

i64x2.widen_low_i32x4_s
- y = i64x2.widen_low_i32x4_s(x) is lowered to SSHLL Vy.2D, Vx.2S, 0
i64x2.widen_low_i32x4_u
- y = i64x2.widen_low_i32x4_u(x) is lowered to USHLL Vy.2D, Vx.2S, 0
i64x2.widen_high_i32x4_s
- y = i64x2.widen_high_i32x4_s(x) is lowered to SSHLL2 Vy.2D, Vx.2S, 0
i64x2.widen_high_i32x4_s
- y = i64x2.widen_high_i32x4_s(x) is lowered to USHLL2 Vy.2D, Vx.2S, 0

ARMv7 processors with NEON instruction set

i64x2.widen_low_i32x4_s
- y = i64x2.widen_low_i32x4_s(x) is lowered to VMOVL.S32 Qy, Dx_lo
i64x2.widen_low_i32x4_u
- y = i64x2.widen_low_i32x4_u(x) is lowered to VMOVL.U32 Qy, Dx_lo
i64x2.widen_high_i32x4_s
- y = i64x2.widen_high_i32x4_s(x) is lowered to VMOVL.S32 Qy, Dx_hi
i64x2.widen_high_i32x4_s
- y = i64x2.widen_high_i32x4_s(x) is lowered to VMOVL.U32 Qy, Dx_hi

tlively · 2020-08-06T05:11:08Z

Why do these instructions have i32x8 in their names? Are they actually operating on 256-bits somehow? Or should that be i32x4?

Are there applications where these instructions would be important for performance?

Maratyszcza · 2020-08-06T11:19:18Z

Good catch, @tlively! Should be i64x2 and i32x4 accordingly

Maratyszcza · 2020-09-19T10:12:01Z

As for the applications, the lack of these instructions came up when porting fixed-point neural network inference microkernels in XNNPACK to WebAssembly SIMD: these lines would benefit from the 64-bit widening instructions.

ngzhian · 2020-09-29T21:20:23Z

I think this looks good, fills up some holes in the i64x2 column, and the codegen is good, I'll get started on prototyping in V8, starting with arm64, then x64.

ngzhian · 2020-09-29T21:25:37Z

Btw we have 0xc7 and 0xca reserved in the opcode space (see NewOpcodes.md) so I think we can use those (in BinarySIMD.md).

Maratyszcza · 2020-09-30T18:40:06Z

@ngzhian Assigned opcodes in 0xc7-0xca range.

ngzhian · 2020-10-08T00:47:34Z

prototyped on arm64 in https://crrev.com/c/2441369

As proposed in WebAssembly/simd#290. As usual, these instructions are available only via builtin functions and intrinsics while they are in the prototyping stage. Differential Revision: https://reviews.llvm.org/D90504

tlively · 2020-10-30T22:54:41Z

This is prototyped in LLVM (but not Binaryen) via __builtin_wasm_widen_{low,high}_{s,u}_i32x4_i64x2. It should be usable in tot Emscripten in a few hours as long as you do not pass optimization flags at link time.

Maratyszcza · 2020-12-11T05:29:59Z

Experiment to enable these instructions in XNNPACK is in google/XNNPACK#1237, but fails to link (even with optimization disabled): [parse exception: invalid code after SIMD prefix: 199 (at 0:1439773)]

- i64x2.eq (WebAssembly/simd#381) - i64x2 widens (WebAssembly/simd#290) - i64x2.bitmask (WebAssembly/simd#368) - signselect ops (WebAssembly/simd#124)

Maratyszcza · 2020-12-17T19:42:25Z

I evaluated performance impact of these instructions by leveraging them in requantization parts of fixed-point neural network inference primitives in XNNPACK library, similarly to #376. Performance impact on the requantization primitive is summarized in the table below:

Processor (Device)	Performance with Chrome M86-level WAsm SIMD + `i64x2.widen_(low/high)_i32x4_s`	Performance with Chrome M86-level WAsm SIMD (baseline)	Speedup
Snapdragon 855 (LG G8 ThinQ)	2.46 GB/s	2.37 GB/s	4%
Snapdragon 670 (Pixel 3a)	1.39 GB/s	1.36 GB/s	2%
Exynos 8895 (Galaxy S8)	1.33 GB/s	1.28 GB/s	4%

The code modifications can be seen in google/XNNPACK#1237. The benefits of these instructions in this particular use-case are superseded by #376.

tlively · 2020-12-17T21:07:14Z

@Maratyszcza do you have any other potential use cases for these instructions? Otherwise the case for including these is not very compelling due to the redundancy with the merged extending multiply instructions :/

Maratyszcza · 2020-12-17T21:17:14Z

I think in this case we need a good argument to NOT include them, as they are extra forms of existing instructions rather than new instructions. Typically we don't even evaluate all instruction forms, and just assume that speedup on one of the variants applies to other instruction variants. And here we have a demonstration that these instruction forms are actually useful.

omnisip · 2020-12-22T21:03:25Z

This was discussed in (#402 12/22/2020 Sync Meeting). Provisional voting results were:

SF:4 F:3 N: 1

Link to minutes: https://docs.google.com/document/d/1Tnf-fvRcCVj_vv8CjAN_qUvj4Wa57JnPa8ghGi4_C-s/edit#

These instructions were accepted into the proposal: WebAssembly/simd#290 Bug: v8:10972 Change-Id: Ia2cce2df575786babe770b043b1e90bf953c5f9b Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2643658 Reviewed-by: Deepti Gandluri <[email protected]> Commit-Queue: Zhi An Ng <[email protected]> Cr-Commit-Position: refs/heads/master@{#72243}

…ifdefs Port ec8fbed Original Commit Message: These instructions were accepted into the proposal: WebAssembly/simd#290 [email protected], [email protected], [email protected], [email protected] BUG= LOG=N Change-Id: I69bbe90ab3af30d7748332a7e99b7812c95f96b4 Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2644939 Reviewed-by: Junliang Yan <[email protected]> Reviewed-by: Milad Fa <[email protected]> Commit-Queue: Milad Fa <[email protected]> Cr-Commit-Position: refs/heads/master@{#72257}

This was merged in WebAssembly#290. Also tweaked the file generation scripts: - Make simd_arithmetic more generic (allow different instruction name patterns) - create a new file simd_int_to_int_widen to generate all integer widening operations (including the ones implemented in this PR) - remove widening tests from simd_conversions.wast

This is a simple change, validation and execution is already shape-agnostic, so we only need to add it to the syntax. Instructions were merged in WebAssembly#290.

This was merged in WebAssembly#290. Also tweaked the file generation scripts: - Make simd_arithmetic more generic (allow different instruction name patterns) - create a new file simd_int_to_int_widen to generate all integer widening operations (including the ones implemented in this PR) - remove widening tests from simd_conversions.wast

This is a simple change, validation and execution is already shape-agnostic, so we only need to add it to the syntax. Instructions were merged in #290.

This was merged in #290. Also tweaked the file generation scripts: - Make simd_arithmetic more generic (allow different instruction name patterns) - create a new file simd_int_to_int_widen to generate all integer widening operations (including the ones implemented in this PR) - remove widening tests from simd_conversions.wast

As proposed in WebAssembly/simd#290. As usual, these instructions are available only via builtin functions and intrinsics while they are in the prototyping stage. Differential Revision: https://reviews.llvm.org/D90504

Maratyszcza changed the title ~~i64x4.widen_(low/high)_i32x8_(s/u) instructions~~ i64x2.widen_(low/high)_i32x4_(s/u) instructions Aug 6, 2020

Maratyszcza force-pushed the widen-64bit branch from 7e6a638 to 41099ae Compare August 6, 2020 11:23

Maratyszcza force-pushed the widen-64bit branch from 41099ae to 14e4b18 Compare September 19, 2020 10:08

Maratyszcza mentioned this pull request Sep 21, 2020

Finalizing the instruction set #343

Closed

Maratyszcza force-pushed the widen-64bit branch from 14e4b18 to 0d87b03 Compare September 30, 2020 18:39

omnisip mentioned this pull request Oct 6, 2020

i64x2.widen_(low/high)_i32x4_(s/u) instructions extensions for 8-bit and 16-bit integers #372

Open

omnisip mentioned this pull request Nov 8, 2020

#372 Integer Sign/Zero Extension for {8,16}->{32,64} #395

Closed

Maratyszcza force-pushed the widen-64bit branch from 0d87b03 to 2e55aab Compare December 4, 2020 21:08

tlively mentioned this pull request Dec 11, 2020

Prototype SIMD instructions implemented in LLVM WebAssembly/binaryen#3440

Merged

Maratyszcza force-pushed the widen-64bit branch from 2e55aab to 88442a2 Compare December 18, 2020 09:20

tlively mentioned this pull request Dec 22, 2020

When should our next Sync meeting be? #402

Closed

This was referenced Jan 8, 2021

Agenda for sync meeting 1/8/2021 #410

Closed

Tracking instructions with unassigned opcodes #421

Closed

tlively approved these changes Jan 11, 2021

View reviewed changes

i64x4.widen_(low/high)_i32x8_(s/u) instructions

f5b90cd

Maratyszcza force-pushed the widen-64bit branch from 88442a2 to f5b90cd Compare January 11, 2021 20:08

tlively merged commit daa35f5 into WebAssembly:master Jan 11, 2021

ngzhian mentioned this pull request Feb 4, 2021

[interpreter] Implement i64x2.widen_{low,high}_i32x4_{s,u} #446

Merged

ngzhian mentioned this pull request Feb 4, 2021

[spectext] Add i64x2.widen_{low,high}_i32x4_{s,u} to syntax #447

Merged

ngzhian added a commit that referenced this pull request Feb 9, 2021

[spectext] Add i64x2.widen_{low,high}_i32x4_{s,u} to syntax

52fac1b

This is a simple change, validation and execution is already shape-agnostic, so we only need to add it to the syntax. Instructions were merged in #290.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

i64x2.widen_(low/high)_i32x4_(s/u) instructions #290

i64x2.widen_(low/high)_i32x4_(s/u) instructions #290

Maratyszcza commented Aug 5, 2020 •

edited

Loading

tlively commented Aug 6, 2020

Maratyszcza commented Aug 6, 2020

Maratyszcza commented Sep 19, 2020 •

edited

Loading

ngzhian commented Sep 29, 2020

ngzhian commented Sep 29, 2020

Maratyszcza commented Sep 30, 2020

ngzhian commented Oct 8, 2020

tlively commented Oct 30, 2020

Maratyszcza commented Dec 11, 2020

Maratyszcza commented Dec 17, 2020

tlively commented Dec 17, 2020

Maratyszcza commented Dec 17, 2020

omnisip commented Dec 22, 2020 •

edited

Loading

i64x2.widen_(low/high)_i32x4_(s/u) instructions #290

i64x2.widen_(low/high)_i32x4_(s/u) instructions #290

Conversation

Maratyszcza commented Aug 5, 2020 • edited Loading

Introduction

Mapping to Common Instruction Sets

x86/x86-64 processors with AVX instruction set

x86/x86-64 processors with SSE4.1 instruction set

x86/x86-64 processors with SSE2 instruction set

ARM64 processors

ARMv7 processors with NEON instruction set

tlively commented Aug 6, 2020

Maratyszcza commented Aug 6, 2020

Maratyszcza commented Sep 19, 2020 • edited Loading

ngzhian commented Sep 29, 2020

ngzhian commented Sep 29, 2020

Maratyszcza commented Sep 30, 2020

ngzhian commented Oct 8, 2020

tlively commented Oct 30, 2020

Maratyszcza commented Dec 11, 2020

Maratyszcza commented Dec 17, 2020

tlively commented Dec 17, 2020

Maratyszcza commented Dec 17, 2020

omnisip commented Dec 22, 2020 • edited Loading

Maratyszcza commented Aug 5, 2020 •

edited

Loading

Maratyszcza commented Sep 19, 2020 •

edited

Loading

omnisip commented Dec 22, 2020 •

edited

Loading