i64x2.min_s and i64x2.max_s instructions #417

Maratyszcza · 2020-12-29T23:55:21Z

Introduction

This is proposal to add 64-bit variant of the existing min_s and max_s instructions. Only x86 processors with AVX512 natively support these instructions, but ARMv7 NEON, ARM64 and x86 with SSE4.2 or AVX can efficiently emulate them with 2-4 instructions.

Applications

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX512F and AVX512VL instruction sets

i64x2.min_s
- y = i64x2.min_s(a, b) is lowered to VPMINSQ xmm_y, xmm_a, xmm_b
i64x2.max_s
- y = i64x2.max_s(a, b) is lowered to VPMAXSQ xmm_y, xmm_a, xmm_b

x86/x86-64 processors with AVX instruction set

i64x2.min_s
- y = i64x2.min_s(a, b) (y is not a and y is not b) is lowered to:
  - VPCMPGTQ xmm_y, xmm_a, xmm_b
  - VPBLENDVB xmm_y, xmm_a, xmm_b, xmm_y
i64x2.max_s
- y = i64x2.max_s(a, b) (y is not a and y is not b) is lowered to:
  - VPCMPGTQ xmm_y, xmm_a, xmm_b
  - VPBLENDVB xmm_y, xmm_b, xmm_a, xmm_y

x86/x86-64 processors with SSE4.2 instruction set

i64x2.min_s
- y = i64x2.min_s(a, b) (y is not b and a/b/y are not in xmm0) is lowered to:
  - MOVDQA xmm0, xmm_a
  - MOVDQA xmm_y, xmm_a
  - PCMPGTQ xmm0, xmm_b
  - PBLENDVB xmm_y, xmm_b
i64x2.max_s
- y = i64x2.max_s(a, b) (y is not a and a/b/y are not in xmm0) is lowered to:
  - MOVDQA xmm0, xmm_a
  - MOVDQA xmm_y, xmm_b
  - PCMPGTQ xmm0, xmm_b
  - PBLENDVB xmm_y, xmm_a

x86/x86-64 processors with SSE4.1 instruction set

Based on this answer by user aqrit on Stack Overflow

i64x2.min_s
- y = i64x2.min_s(a, b) (y is not a and y is not b and a/b/y are not in xmm0) is lowered to:
  - MOVDQA xmm0, xmm_b
  - MOVDQA xmm_y, xmm_a
  - PSUBQ xmm0, xmm_a
  - PCMPEQD xmm_y, xmm_b
  - PAND xmm0, xmm_y
  - MOVDQA xmm_y, xmm_a
  - PCMPGTD xmm_y, xmm_b
  - POR xmm0, xmm_y
  - MOVDQA xmm_y, xmm_a
  - PSHUFD xmm0, xmm0, 0xF5
  - PBLENDVB xmm_y, xmm_b
i64x2.max_s
- y = i64x2.max_s(a, b) (y is not a and y is not b and a/b/y are not in xmm0) is lowered to:
  - MOVDQA xmm0, xmm_b
  - MOVDQA xmm_y, xmm_a
  - PSUBQ xmm0, xmm_a
  - PCMPEQD xmm_y, xmm_b
  - PAND xmm0, xmm_y
  - MOVDQA xmm_y, xmm_a
  - PCMPGTD xmm_y, xmm_b
  - POR xmm0, xmm_y
  - MOVDQA xmm_y, xmm_b
  - PSHUFD xmm0, xmm0, 0xF5
  - PBLENDVB xmm_y, xmm_a

x86/x86-64 processors with SSE2 instruction set

Based on this answer by user aqrit on Stack Overflow

i64x2.min_s
- y = i64x2.min_s(a, b) (y is not a and y is not b) is lowered to:
  - MOVDQA xmm_y, xmm_b
  - MOVDQA xmm_tmp, xmm_a
  - PSUBQ xmm_y, xmm_a
  - PCMPEQD xmm_tmp, xmm_b
  - PAND xmm_y, xmm_tmp
  - MOVDQA xmm_tmp, xmm_a
  - PCMPGTD xmm_tmp, xmm_b
  - POR xmm_y, xmm_tmp
  - MOVDQA xmm_tmp, xmm_b
  - PSHUFD xmm_y, xmm_y, 0xF5
  - PAND xmm_tmp, xmm_y
  - PANDN xmm_y, xmm_a
  - POR xmm_y, xmm_tmp
i64x2.max_s
- y = i64x2.max_s(a, b) (y is not a and y is not b) is lowered to:
  - MOVDQA xmm_y, xmm_b
  - MOVDQA xmm_tmp, xmm_a
  - PSUBQ xmm_y, xmm_a
  - PCMPEQD xmm_tmp, xmm_b
  - PAND xmm_y, xmm_tmp
  - MOVDQA xmm_tmp, xmm_a
  - PCMPGTD xmm_tmp, xmm_b
  - POR xmm_y, xmm_tmp
  - MOVDQA xmm_tmp, xmm_a
  - PSHUFD xmm_y, xmm_y, 0xF5
  - PAND xmm_tmp, xmm_y
  - PANDN xmm_y, xmm_b
  - POR xmm_y, xmm_tmp

ARM64 processors

i64x2.min_s
- y = i64x2.min_s(a, b) (y is not a and y is not b) is lowered to:
  - CMGT Vy.2D, Va.2D, Vb.2D
  - BSL Vy.16B, Vb.16B, Va.16B
i64x2.max_s
- y = i64x2.max_s(a, b) (y is not a and y is not b) is lowered to:
  - CMGT Vy.2D, Va.2D, Vb.2D
  - BSL Vy.16B, Va.16B, Vb.16B

ARMv7 processors with NEON instruction set

Based on this answer by user aqrit on Stack Overflow

i64x2.min_s
- y = i64x2.min_s(a, b) (y is not a and y is not b) is lowered to:
  - VQSUB.S64 Qy, Qb, Qa
  - VSHR.S64 Qy, Qy, #63
  - VBSL Qy, Qb, Qa
i64x2.max_s
- y = i64x2.max_s(a, b) (y is not a and y is not b) is lowered to:
  - VQSUB.S64 Qy, Qb, Qa
  - VSHR.S64 Qy, Qy, #63
  - VBSL Qy, Qa, Qb

ngzhian · 2021-01-05T00:49:56Z

I don't think this meets the bar for inclusion. The codegen is not great, and half of the use cases are SIMD libraries which expose such instructions (they don't use it).

Maratyszcza · 2021-01-05T20:52:47Z

It is expected that most uses of 64-bit integer operations is through either high-level wrappers or auto-vectorization: there are usually more efficient ways to do computations within narrower data types, but they are ISA-specific (e.g. on ARM NEON we may use saturated 32-bit arithmetics, but it is not portable to x86). Thus it is mainly the codes that trade some performance for portability (through high-level wrapper libraries or through auto-vectorization) that use 64-bit arithmetics.

IMO lowering on recentish systems isn't bad: 4 instructions on SSE4.2, 3 instructions on ARMv7 NEON, 2 instruction on ARM64 and AVX. Without specialized i64x2.min_s/i64x2.max_s instructions, but with i64x2.gt_s, we'd have the same 2/3 instructions on ARM64/ARMv7+NEON, but 6+ instructions on SSE4.2 and 4 instructions on AVX (because they'd have to use v128.bitselect instead of [V]PBLENDVB).

dtig · 2021-01-25T19:31:06Z

Adding a preliminary vote for the inclusion of i64x2 signed min/max operations to the SIMD proposal below. Please vote with -

👍 For including i64x2 signed min/max operations
👎 Against including i64x2 signed min/max operations

Maratyszcza · 2021-02-04T19:30:31Z

The community group unanimously decided against including these instructions in the 1/29/21 meeting (#429).

Maratyszcza mentioned this pull request Jan 5, 2021

Agenda for sync meeting 1/8/2021 #410

Closed

penzn mentioned this pull request Jan 8, 2021

Agenda for sync meeting 1/22/21 #419

Closed

abrown mentioned this pull request Jan 11, 2021

i64x2.min_u and i64x2.max_u instructions #418

Closed

i64x2.min_s and i64x2.max_s instructions

0614819

Maratyszcza force-pushed the minmaxs-64bit branch from 0f30463 to 0614819 Compare January 19, 2021 20:43

tlively mentioned this pull request Jan 23, 2021

Agenda for sync meeting 1/29/21 #429

Closed

ngzhian added the 2021-01-29 Agenda for sync meeting 1/29/21 label Jan 26, 2021

tlively added the post SIMD MVP label Feb 2, 2021

dtig removed the 2021-01-29 Agenda for sync meeting 1/29/21 label Feb 2, 2021

Maratyszcza closed this Feb 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

i64x2.min_s and i64x2.max_s instructions #417

i64x2.min_s and i64x2.max_s instructions #417

Maratyszcza commented Dec 29, 2020 •

edited

Loading

ngzhian commented Jan 5, 2021

Maratyszcza commented Jan 5, 2021

dtig commented Jan 25, 2021

Maratyszcza commented Feb 4, 2021

i64x2.min_s and i64x2.max_s instructions #417

i64x2.min_s and i64x2.max_s instructions #417

Conversation

Maratyszcza commented Dec 29, 2020 • edited Loading

Introduction

Applications

Mapping to Common Instruction Sets

x86/x86-64 processors with AVX512F and AVX512VL instruction sets

x86/x86-64 processors with AVX instruction set

x86/x86-64 processors with SSE4.2 instruction set

x86/x86-64 processors with SSE4.1 instruction set

x86/x86-64 processors with SSE2 instruction set

ARM64 processors

ARMv7 processors with NEON instruction set

ngzhian commented Jan 5, 2021

Maratyszcza commented Jan 5, 2021

dtig commented Jan 25, 2021

Maratyszcza commented Feb 4, 2021

Maratyszcza commented Dec 29, 2020 •

edited

Loading