Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

Pseudo-Minimum and Pseudo-Maximum instructions #122

Merged
merged 1 commit into from
Sep 11, 2020

Conversation

Maratyszcza
Copy link
Contributor

@Maratyszcza Maratyszcza commented Oct 21, 2019

Introduction

The f32x4.min/f32x4.max/f64x2.min/f64x2.max instructions in the WebAssembly SIMD are the natural extensions of the scalar WebAssembly MVP instructions to SIMD instruction sets. These instructions follow JavaScript rules on NaN propagation, i.e. if any operand is NaN, the output is NaN. However, the min/max operations as defined in the WebAssembly specification have several drawbacks:

  1. They have very asymmetric cost across popular architectures. E.g. while f32x4.min maps to a single instruction (FMIN Vd.4S, Vn.4S, Vm.4S) on ARM64, it doesn't have a direct or even a close equivalent in x86 instruction sets. As a result, V8 has to use 8 SSE2 instructions to lower this WebAssembly instruction.
  2. No operator or function in C or C++ defines minimum or maximum operation with the same semantic as WebAssembly. Therefore, optimizing compilers can't generate min/max instructions from WAsm SIMD by auto-vectorizing scalar codes.
  3. They lose information. In particular, the set of values { f32x4.min(a, b), f32x4.max(a, b) } is not identically equivalent to the set of values { a, b }. Consequentially, sorting networks, which underlie SIMD-friendly algorithms for sorting (e.g. Bitonic sort) and partial ordering, can't be implemented on top of min/max operations from WebAssembly SIMD.

New instructions

This PR introduce Pseudo-Minimum (f32x4.pmin and f64x2.pmin) and Pseudo-Maximum (f32x4.pmax and f64x2.pmax) instructions, which implement Pseudo-Minimum and Pseudo-Maximum operations with slightly different semantics than the Minimum and Maximum in the current spec. Pseudo-Minimum is defined as pmin(a, b) := b < a ? b : a and Pseudo-Maximum is defined as pmax(a, b) := a < b ? b : a. "Pseudo" in the name refers to the fact that these operations may not return the minimum in case of signed zero inputs, in particular:

  • pmin(+0.0, -0.0) == +0.0
  • pmax(-0.0, +0.0) == -0.0

The new instructions fix some of the issues with WebAssembly min/max instructions:

  1. They have much more uniform cost across different architectures. On x86 processors with SSE2 instruction sets, these instructions directly map to MINPS/MAXPS/MINPD/MAXPD instructions. ARM processors don't have an exact equivalent, but can implement the same operation with just two instructions. The table below compares the cost of the new f32x4.pmin instruction to the currently available alternatives:
Instructions x86 with SSE2 ARM NEON ARM64
f32x4.min 8 1 1
f32x4.bitselect(b, a, f32x4.lt(b, a)) 4 2 2
f32x4.pmin 1 2 2
  1. The definition of Pseudo-Minimum and Pseudo-Maximum operations exactly match the std::min<T> and std::max<T> functions in C++ standard template library. Thus, optimizing compilers are more likely to find opportunities for auto-vectorization in existing scalar codes.
  2. Pseudo-Minimum and Pseudo-Maximum operations jointly preserve information about their inputs, i.e. { pmin(a, b), pmax(a, b) } == { a, b }. Thus, they are suitable for efficient implementation of sorting networks, and in particular the bitonic sort algorithm.

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX instruction set

  • f32x4.pmin
    • y = f32x4.pmin(a, b) is lowered to VMINPS xmm_y, xmm_b, xmm_a
  • f32x4.pmax
    • y = f32x4.pmax(a, b) is lowered to VMAXPS xmm_y, xmm_b, xmm_a
  • f64x2.pmin
    • y = f64x2.pmin(a, b) is lowered to VMINPD xmm_y, xmm_b, xmm_a
  • f64x2.pmax
    • y = f64x2.pmax(a, b) is lowered to VMAXPD xmm_y, xmm_b, xmm_a

x86/x86-64 processors with SSE2 instruction set

  • f32x4.pmin
    • b = f32x4.pmin(a, b) is lowered to MINPS xmm_b, xmm_a
    • y = f32x4.pmin(a, b) is lowered to MOVAPS xmm_y, xmm_b + MINPS xmm_y, xmm_a
  • f32x4.pmax
    • b = f32x4.pmax(a, b) is lowered to MAXPS xmm_b, xmm_a
    • y = f32x4.pmax(a, b) is lowered to MOVAPS xmm_y, xmm_b + MAXPS xmm_y, xmm_a
  • f64x2.pmin
    • b = f64x2.pmin(a, b) is lowered to MINPD xmm_b, xmm_a
    • y = f64x2.pmin(a, b) is lowered to MOVAPD xmm_y, xmm_b + MINPD xmm_y, xmm_a
  • f64x2.pmax
    • b = f64x2.pmax(a, b) is lowered to MAXPD xmm_b, xmm_a
    • y = f64x2.pmax(a, b) is lowered to MOVAPD xmm_y, xmm_b + MAXPD xmm_y, xmm_a

Other processors and instruction sets

  • f32x4.pmin
    • y = f32x4.pmin(a, b) is lowered like v128.bitselect(b, a, f32x4.lt(b, a))
  • f32x4.pmax
    • y = f32x4.pmax(a, b) is lowered like v128.bitselect(b, a, f32x4.lt(a, b))
  • f64x2.pmin
    • y = f64x2.pmin(a, b) is lowered like v128.bitselect(b, a, f64x2.lt(b, a))
  • f64x2.pmax
    • y = f64x2.pmax(a, b) is lowered like v128.bitselect(b, a, f64x2.lt(a, b))

@penzn
Copy link
Contributor

penzn commented Oct 29, 2019

This actually seem like a very good idea. On the other hand, would going from scalar "regular" min/max to simd quasi min/max for the loop be an issue for optimizers? They would not be able to argue that the semantics are preserved. Should matching operations be added to scalar instruction set?

@tlively
Copy link
Member

tlively commented Oct 29, 2019

In LLVM at least, the autovectorizer does not know what kind of min/max instructions are natively supported so its behavior would not change. But if a user had a loop of min/max operations obeying the semantics of wasm's scalar min/max, the min max semantics would not be changed by vectorization so these new vector min/max instructions would not be used.

@sunfishcode
Copy link
Member

sunfishcode commented Oct 29, 2019

As a minor bikeshed, the word "quasi" involves nondeterminism in the quasi-fma proposal, so it's a little confusing that "quasi" here doesn't involve nondeterminism. What would you think about using the term "asymmetric" instead here?

@penzn
Copy link
Contributor

penzn commented Oct 30, 2019

It looks like emscripten resolves min and max scalar builtins would to calls, rather than to instructions. This makes sense given the semantics. For x86 Clang produces a max instruction from the same example.

$ cat max.c
#include <math.h>

float get_max(float a, float b) {
  return fmax(a, b);
}
$ ~/emscripten/emcc -O3 --target=wasm32 -S max.c -o -
get_max:                                # @get_max
        .functype       get_max (f32, f32) -> (f32)
# %bb.0:                                # %entry
        local.get       0
        f64.promote_f32
        local.get       1
        f64.promote_f32
        f64.call        fmax
        f32.demote_f64
                                        # fallthrough-return-value
        end_function

There might be use in defining min/max with native semantics for Wasm in general, not only for Wasm SIMD, I do have some mild concerns of adding it just here, but I don't feel very strongly either way.

@dtig
Copy link
Member

dtig commented Oct 31, 2019

The precedent here has been set by the MVP. Having different semantics for the scalar, and vector versions of these is problematic for engines that support a scalar fallback to vector operations because the behavior for these instructions is subtly different with different code paths. I'm concerned about baking this type non-conformity in the Spec, though arguably the applications that depend on the scalarized code are not the primary target of this proposal. This is one of the cases where the consistency with the MVP is at odds with the native semantics, and sounds like a good candidate to bring up to the broader CG for feedback, and possibly a resolution with a vote.

@Maratyszcza
Copy link
Contributor Author

@penzn Please note that I propose to add new instructions with the semantics of std::min and std::max in C++. I don't propose to remove or modify existing f32x4.min/f32x4.max/f64x2.min/f64x2.max instructions. Thus, an optimizing compiler would vectorize f32.min/f32.max/f64.min/f64.max as f32x4.min/f32x4.max/f64x2.min/f64x2.max instructions and vectorize std::min<float>/std::max<float>/std::min<double>/std::max<double> as f32x4.qmin/f32x4.qmax/f64x2.qmin/f64x2.qmax instructions. The operations of std::min<float>/std::max<float>/std::min<double>/std::max<double> indeed don't have an equivalent instruction in WAsm MVP and will be represented as a sequence of instructions. While it would be useful to have scalar equivalents, AFAIU they are out of scope of SIMD spec.

@Maratyszcza
Copy link
Contributor Author

@sunfishcode "quasi" is Latin for "almost". Quasi-Fused Multiply-Add is almost fused, in the sense that it is fused on most processors (e.g. all ARM64 processors), but can be non-fused on some (e.g. low-end Intel processors). Quasi Minimum/Maximum is almost Minimum/Maximum, in the sense that it usually produce minimum/maximum of two numbers, but may produce "wrong" result in two cases: qmin(+0.0, -0.0) == +0.0 and qmax(-0.0, +0.0) == -0.0.

That said, I care more about having these functions in the SIMD specifications then about their names, and open to alternative naming conventions. Unfortunately, "asymmetric" minimum/maximum would abbreviate to amin/amax, which can be confused with absolute minimum/maximum operation, present in some SIMD instruction sets (e.g. x86 AVX512 and MIPS MSA).

@Maratyszcza
Copy link
Contributor Author

@penzn C/C++ fmin and fmax have different semantics than floating-point min/max instructions in WebAssembly (and also different semantics than the proposed qmin/qmax instructions): if one of the operands is NaN, fmin and fmax return the other operand. This operation is called minNum/maxNum in IEEE754 specification.

@Maratyszcza
Copy link
Contributor Author

@dtig I agree that having inconsistency between scalar and SIMD operations would be concerning. However, my proposed is not about removing or modifying existing f32x4.min/f32x4.max/f64x2.min/f64x2.max operations, but rather about adding new f32x4.qmin/f32x4.qmax/f64x2.qmin/f64x2.qmax alongside existing ones. Of course, it would be helpful to have equivalents of Quasi-Minimum/Maximum operations as scalar instructions, but I'm afraid it would fall outside the scope of the SIMD specification.

@penzn
Copy link
Contributor

penzn commented Oct 31, 2019

@Maratyszcza you are right, fmin/fmax is different from std::min/std::max. My bad!

Looks like std min/max gets lowered into a FP comparison followed by select. It does get vectorized and its vectorized form does not involve Wasm SIMD min/max operations.

$ cat max.cc
#include <algorithm>

float get_max(float a, float b) {
  return std::max<float>(a, b);
}

void get_many(float * a1, float * a2, float * res, unsigned sz) {
  for (int i = 0; i < sz; ++i) {
    *res = std::max<float>(*a1, *a2);
    ++a1;
    ++a2;
    ++res;
  }
}
$ emcc -msimd128 -O3 -S max.cc -o -

This shows std::max getting vectorized into into f32x4.lt followed by v128.bitselect. LLVM IR also shows vector compare followed by vector select.

@Maratyszcza
Copy link
Contributor Author

@penzn In lieu of f32x4.qmin, f32x4.lt followed by v128.bitselect would be lowered into 4 instructions (1 for f32x4.lt and 3 for v128.bitselect).

@sunfishcode
Copy link
Member

It's not important to me what "quasi" means here, as long as it means something consistent within wasm. "different and nondeterministic rounding" and "different interpretation of NaN and -0" to me are different meanings.

Fwiw, minNum and maxNum were removed in the recently-published IEEE 754-2019. IEEE 754 now defines:

  • minimum and maximum, which wasm's min and max correspond to
  • minimumNumber and maximumNumber, which are similar to the old minNum and maxNum, but
    • correct a mistake in the handling of signalling NaN
    • make the handling of -0 deterministic (in the same way wasm does, by interpreting 0 to be "greater" than -0)

x86 is really the only popular platform that can't implement wasm's min and max in a single instruction today. And now that these are now standardized in IEEE 754, not to mention JavaScript and other popular languages, it's entirely possible that future x86 extensions will add them. While this will take a while even if true, waiting for this is a plausible strategy, if we consider WebAssembly to be around for a long time.

The C/C++ situation is unfortunate, although on one hand, now that IEEE 754 has minimumNumber and maximumNumber, wasm could probably add operators corresponding to those without too much trouble, in which case C's fmin and fmax could compile to those. And on the other, even with a<b?a:b or std::min, with compiler flags, users can override strict NaN and -0 semantics and recover the optimizations.

(Note: I don't have a strong opinion either way at this point; I want to raise these topics for discussion.)

@Maratyszcza Maratyszcza changed the title Quasi-Minimum and Quasi-Maximum instructions Pseudo-Minimum and Pseudo-Maximum instructions Nov 11, 2019
@Maratyszcza
Copy link
Contributor Author

@sunfishcode, @tlively: renamed instructions to Pseudo-Minimum and Pseudo-Maximum to avoid confusion with Quasi-FMA

@dtig
Copy link
Member

dtig commented Nov 18, 2019

@dtig I agree that having inconsistency between scalar and SIMD operations would be concerning. However, my proposed is not about removing or modifying existing f32x4.min/f32x4.max/f64x2.min/f64x2.max operations, but rather about adding new f32x4.qmin/f32x4.qmax/f64x2.qmin/f64x2.qmax alongside existing ones. Of course, it would be helpful to have equivalents of Quasi-Minimum/Maximum operations as scalar instructions, but I'm afraid it would fall outside the scope of the SIMD specification.

My concern with the proposed operations is not limited to just the existing operations - i.e. adding new pmin/pmax operations still means that there isn't a good way to emulate the new operations using MVP operations. I agree that adding the scalar versions of these operations falls outside the scope of this proposal, but without these any scalar fallbacks have added complexity, or will be inaccurate.

@zeux
Copy link
Contributor

zeux commented Jan 12, 2020

Just a note, I've hit this when trying to convert some code to WASM SIMD. I expected "max(0.f, v)" to result in an efficient codegen but instead it was really inefficient, and noticeably slower than using compare + and (and(v, ge(v, 0.f))).

@Maratyszcza
Copy link
Contributor Author

@dtig While these SIMD instructions don't have a direct equivalent in WAsm MVP, the scalar operation can be simulated with just two MVP instructions -- f32.lt + f32.select.

@ngzhian
Copy link
Member

ngzhian commented May 7, 2020

Note: I think the pmax lowering is incorrect:

y = f32x4.pmax(a, b) is lowered like v128.bitselect(a, b, f32x4.lt(b, a))

say b == a, f32x4.lt(b,a) would then be 0, which would select b. But it should select a, since pmax always returns the first input (like std::max) if the inputs are equal.

The lowering should be:

y = f32x4.pmax(a, b) is lowered like v128.bitselect(a, b, f32x4.le(b, a))
or
y = f32x4.pmax(a, b) is lowered like v128.bitselect(b, a, f32x4.gt(b, a))

Same for f64x2. @Maratyszcza please take a look, thanks!

@Maratyszcza
Copy link
Contributor Author

Maratyszcza commented May 7, 2020

@ngzhian You're right. std::max(a, b) := (a < b) ? b : a, and thus y = f32x4.pmax(a, b) is lowered like v128.bitselect(b, a, f32x4.lt(a, b)). Updated the PR description.

@ngzhian
Copy link
Member

ngzhian commented May 7, 2020

Thanks! I have another feedback on lowering on ARM.
F64x2Lt is not efficient on ARM at all. The current implementation uses compares lane by lane, and uses a few conditional moves (seehttps://source.chromium.org/chromium/chromium/src/+/master:v8/src/compiler/backend/arm/code-generator-arm.cc;l=1960;drc=3795f5bbfcf5f8c3f6f740b08a513e09ca818697).
I think pmin and pmax will do something similar and perhaps doesn't need the bitselect.

But lowering suggested here does hide a bit of the slowness on ARM.

(Also, of anyone has ideas on improving the f64x2lt implementation, please let me know, thanks!)

@Maratyszcza
Copy link
Contributor Author

F64 versions are doomed to be slow on 32-bit ARM due to lack of SIMD capability. However, the same applies to the standard f64x2.min/f64x2.max ops, so I don't think we make it worse than it is.

pull bot pushed a commit to p-g-krish/v8 that referenced this pull request May 8, 2020
This patch implements f32x4.pmin, f32x4.pmax, f64x2.pmin, and f64x2.pmax
for x64 and interpreter.

Pseudo-min and Pseudo-max instructions were proposed in
WebAssembly/simd#122. These instructions
exactly match std::min and std::max in C++ STL, and thus have different
semantics from the existing min and max.

The instruction-selector for x64 switches the operands around, because
it allows for defining the dst to be same as first (really the second
input node), allowing better codegen.

For example, b = f32x4.pmin(a, b) directly maps to vminps(b, b, a) or
minps(b, a), as long as we can define dst == b, and switching the
instruction operands around allows us to do that.

Bug: v8:10501
Change-Id: I06f983fc1764caf673e600ac91d9c0ac5166e17e
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2186630
Commit-Queue: Zhi An Ng <[email protected]>
Reviewed-by: Tobias Tebbi <[email protected]>
Reviewed-by: Deepti Gandluri <[email protected]>
Cr-Commit-Position: refs/heads/master@{#67688}
tlively added a commit to tlively/binaryen that referenced this pull request May 12, 2020
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified that referenced this pull request Aug 16, 2020
Implement some of the experimental SIMD opcodes that are supported by
all of V8, LLVM, and Binaryen, for maximum compatibility with test
content we might be exposed to.  Most/all of these will probably make
it into the spec, as they lead to substantial speedups in some
programs, and they are deterministic.

For spec and cpu mapping details, see:

WebAssembly/simd#122 (pmax/pmin)
WebAssembly/simd#232 (rounding)
WebAssembly/simd#127 (dot product)
WebAssembly/simd#237 (load zero)

The wasm bytecode values used here come from the binaryen changes that
are linked from those tickets, that's the best documentation right
now.  Current binaryen opcode mappings are here:
https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h

Also: Drive-by fix for signatures of vroundss and vroundsd, these are
unary operations and should follow the conventions for these with
src/dest arguments, not src0/src1/dest.

Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that
take Operand source and FloatRegister destination.

Differential Revision: https://phabricator.services.mozilla.com/D85982

UltraBlame original commit: 2e7ddb00c8f9240e148cf5843b50a7ba7b913351
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified that referenced this pull request Aug 16, 2020
Implement some of the experimental SIMD opcodes that are supported by
all of V8, LLVM, and Binaryen, for maximum compatibility with test
content we might be exposed to.  Most/all of these will probably make
it into the spec, as they lead to substantial speedups in some
programs, and they are deterministic.

For spec and cpu mapping details, see:

WebAssembly/simd#122 (pmax/pmin)
WebAssembly/simd#232 (rounding)
WebAssembly/simd#127 (dot product)
WebAssembly/simd#237 (load zero)

The wasm bytecode values used here come from the binaryen changes that
are linked from those tickets, that's the best documentation right
now.  Current binaryen opcode mappings are here:
https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h

Also: Drive-by fix for signatures of vroundss and vroundsd, these are
unary operations and should follow the conventions for these with
src/dest arguments, not src0/src1/dest.

Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that
take Operand source and FloatRegister destination.

Differential Revision: https://phabricator.services.mozilla.com/D85982

UltraBlame original commit: 2d73a015caaa3e70c175172158a6548625dc6da3
@ngzhian
Copy link
Member

ngzhian commented Sep 11, 2020

This has been accepted into the proposal [0] during the sync on 2020-09-04. I have an outstanding PR [1] to renumber the opcodes based on what's currently reserved in NewOpcodes.md. There also seems to be a merge conflict to be resolved, I suggest a rebase, then we can merge this in.

Also removed the "pending prototype data", since we have it in #122 (comment) and #122 (comment).

[0] https://docs.google.com/document/d/138cF6aOUa9RZC2tOR7AhlIQWdmX5EtpzXRTVDAN3bfo/edit# see "3. Pseudo min/max"
[1] Maratyszcza#1

@Maratyszcza
Copy link
Contributor Author

@ngzhian Rebased and updated opcodes as per your PR

@ngzhian ngzhian merged commit e1ff82e into WebAssembly:master Sep 11, 2020
pull bot pushed a commit to Alan-love/v8 that referenced this pull request Sep 11, 2020
F32x4 and F64x2 pmin and pmax were accepted into the proposal [0], this
removes all the ifdefs and todo guarding the prototypes, and moves these
instructions out of the post-mvp flag.

[0] WebAssembly/simd#122

Bug: v8:10904
Change-Id: I4e0c2f29ddc5d7fc19a209cd02b3d369617574a0
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2405802
Reviewed-by: Bill Budge <[email protected]>
Commit-Queue: Zhi An Ng <[email protected]>
Cr-Commit-Position: refs/heads/master@{#69855}
pull bot pushed a commit to Alan-love/v8 that referenced this pull request Sep 14, 2020
Port 3ba4431

Original Commit Message:

    F32x4 and F64x2 pmin and pmax were accepted into the proposal [0], this
    removes all the ifdefs and todo guarding the prototypes, and moves these
    instructions out of the post-mvp flag.

    [0] WebAssembly/simd#122

[email protected], [email protected], [email protected], [email protected]
BUG=
LOG=N

Change-Id: I8b2ae60240f769e1f4c0b00e98d53846519b305e
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2410806
Reviewed-by: Junliang Yan <[email protected]>
Reviewed-by: Milad Farazmand <[email protected]>
Commit-Queue: Milad Farazmand <[email protected]>
Cr-Commit-Position: refs/heads/master@{#69893}
moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this pull request Oct 14, 2020
…ed status. r=jseward

Background: WebAssembly/simd#122

For all the pseudo-min/max SIMD instructions:

- remove the internal 'Experimental' opcode suffix in the C++ code
- remove the guard on experimental Wasm instructions in all the C++ decoders
- move the test cases from simd/experimental.js to simd/ad-hack.js

I have checked that current V8 and wasm-tools use the same opcode
mappings.  V8 in turn guarantees the correct mapping for LLVM and
binaryen.

Differential Revision: https://phabricator.services.mozilla.com/D92928
jamienicol pushed a commit to jamienicol/gecko that referenced this pull request Oct 15, 2020
…ed status. r=jseward

Background: WebAssembly/simd#122

For all the pseudo-min/max SIMD instructions:

- remove the internal 'Experimental' opcode suffix in the C++ code
- remove the guard on experimental Wasm instructions in all the C++ decoders
- move the test cases from simd/experimental.js to simd/ad-hack.js

I have checked that current V8 and wasm-tools use the same opcode
mappings.  V8 in turn guarantees the correct mapping for LLVM and
binaryen.

Differential Revision: https://phabricator.services.mozilla.com/D92928
julian-seward1 added a commit to julian-seward1/wasmtime that referenced this pull request Oct 23, 2020
…structions

This patch implements, for aarch64, the following wasm SIMD extensions

  Floating-point rounding instructions
  WebAssembly/simd#232

  Pseudo-Minimum and Pseudo-Maximum instructions
  WebAssembly/simd#122

The changes are straightforward:

* `build.rs`: the relevant tests have been enabled

* `cranelift/codegen/meta/src/shared/instructions.rs`: new CLIF instructions
  `fmin_pseudo` and `fmax_pseudo`.  The wasm rounding instructions do not need
  any new CLIF instructions.

* `cranelift/wasm/src/code_translator.rs`: translation into CLIF; this is
  pretty much the same as any other unary or binary vector instruction (for
  the rounding and the pmin/max respectively)

* `cranelift/codegen/src/isa/aarch64/lower_inst.rs`:
  - `fmin_pseudo` and `fmax_pseudo` are converted into a two instruction
    sequence, `fcmpgt` followed by `bsl`
  - the CLIF rounding instructions are converted to a suitable vector
    `frint{n,z,p,m}` instruction.

* `cranelift/codegen/src/isa/aarch64/inst/mod.rs`: minor extension of `pub
  enum VecMisc2` to handle the rounding operations.  And corresponding `emit`
  cases.
julian-seward1 added a commit to julian-seward1/wasmtime that referenced this pull request Oct 23, 2020
…structions

This patch implements, for aarch64, the following wasm SIMD extensions

  Floating-point rounding instructions
  WebAssembly/simd#232

  Pseudo-Minimum and Pseudo-Maximum instructions
  WebAssembly/simd#122

The changes are straightforward:

* `build.rs`: the relevant tests have been enabled

* `cranelift/codegen/meta/src/shared/instructions.rs`: new CLIF instructions
  `fmin_pseudo` and `fmax_pseudo`.  The wasm rounding instructions do not need
  any new CLIF instructions.

* `cranelift/wasm/src/code_translator.rs`: translation into CLIF; this is
  pretty much the same as any other unary or binary vector instruction (for
  the rounding and the pmin/max respectively)

* `cranelift/codegen/src/isa/aarch64/lower_inst.rs`:
  - `fmin_pseudo` and `fmax_pseudo` are converted into a two instruction
    sequence, `fcmpgt` followed by `bsl`
  - the CLIF rounding instructions are converted to a suitable vector
    `frint{n,z,p,m}` instruction.

* `cranelift/codegen/src/isa/aarch64/inst/mod.rs`: minor extension of `pub
  enum VecMisc2` to handle the rounding operations.  And corresponding `emit`
  cases.
julian-seward1 added a commit to julian-seward1/wasmtime that referenced this pull request Oct 23, 2020
…structions

This patch implements, for aarch64, the following wasm SIMD extensions

  Floating-point rounding instructions
  WebAssembly/simd#232

  Pseudo-Minimum and Pseudo-Maximum instructions
  WebAssembly/simd#122

The changes are straightforward:

* `build.rs`: the relevant tests have been enabled

* `cranelift/codegen/meta/src/shared/instructions.rs`: new CLIF instructions
  `fmin_pseudo` and `fmax_pseudo`.  The wasm rounding instructions do not need
  any new CLIF instructions.

* `cranelift/wasm/src/code_translator.rs`: translation into CLIF; this is
  pretty much the same as any other unary or binary vector instruction (for
  the rounding and the pmin/max respectively)

* `cranelift/codegen/src/isa/aarch64/lower_inst.rs`:
  - `fmin_pseudo` and `fmax_pseudo` are converted into a two instruction
    sequence, `fcmpgt` followed by `bsl`
  - the CLIF rounding instructions are converted to a suitable vector
    `frint{n,z,p,m}` instruction.

* `cranelift/codegen/src/isa/aarch64/inst/mod.rs`: minor extension of `pub
  enum VecMisc2` to handle the rounding operations.  And corresponding `emit`
  cases.
julian-seward1 added a commit to julian-seward1/wasmtime that referenced this pull request Oct 24, 2020
…structions

This patch implements, for aarch64, the following wasm SIMD extensions

  Floating-point rounding instructions
  WebAssembly/simd#232

  Pseudo-Minimum and Pseudo-Maximum instructions
  WebAssembly/simd#122

The changes are straightforward:

* `build.rs`: the relevant tests have been enabled

* `cranelift/codegen/meta/src/shared/instructions.rs`: new CLIF instructions
  `fmin_pseudo` and `fmax_pseudo`.  The wasm rounding instructions do not need
  any new CLIF instructions.

* `cranelift/wasm/src/code_translator.rs`: translation into CLIF; this is
  pretty much the same as any other unary or binary vector instruction (for
  the rounding and the pmin/max respectively)

* `cranelift/codegen/src/isa/aarch64/lower_inst.rs`:
  - `fmin_pseudo` and `fmax_pseudo` are converted into a two instruction
    sequence, `fcmpgt` followed by `bsl`
  - the CLIF rounding instructions are converted to a suitable vector
    `frint{n,z,p,m}` instruction.

* `cranelift/codegen/src/isa/aarch64/inst/mod.rs`: minor extension of `pub
  enum VecMisc2` to handle the rounding operations.  And corresponding `emit`
  cases.
julian-seward1 added a commit to julian-seward1/wasmtime that referenced this pull request Oct 26, 2020
…structions

This patch implements, for aarch64, the following wasm SIMD extensions

  Floating-point rounding instructions
  WebAssembly/simd#232

  Pseudo-Minimum and Pseudo-Maximum instructions
  WebAssembly/simd#122

The changes are straightforward:

* `build.rs`: the relevant tests have been enabled

* `cranelift/codegen/meta/src/shared/instructions.rs`: new CLIF instructions
  `fmin_pseudo` and `fmax_pseudo`.  The wasm rounding instructions do not need
  any new CLIF instructions.

* `cranelift/wasm/src/code_translator.rs`: translation into CLIF; this is
  pretty much the same as any other unary or binary vector instruction (for
  the rounding and the pmin/max respectively)

* `cranelift/codegen/src/isa/aarch64/lower_inst.rs`:
  - `fmin_pseudo` and `fmax_pseudo` are converted into a two instruction
    sequence, `fcmpgt` followed by `bsl`
  - the CLIF rounding instructions are converted to a suitable vector
    `frint{n,z,p,m}` instruction.

* `cranelift/codegen/src/isa/aarch64/inst/mod.rs`: minor extension of `pub
  enum VecMisc2` to handle the rounding operations.  And corresponding `emit`
  cases.
julian-seward1 added a commit to bytecodealliance/wasmtime that referenced this pull request Oct 26, 2020
…structions

This patch implements, for aarch64, the following wasm SIMD extensions

  Floating-point rounding instructions
  WebAssembly/simd#232

  Pseudo-Minimum and Pseudo-Maximum instructions
  WebAssembly/simd#122

The changes are straightforward:

* `build.rs`: the relevant tests have been enabled

* `cranelift/codegen/meta/src/shared/instructions.rs`: new CLIF instructions
  `fmin_pseudo` and `fmax_pseudo`.  The wasm rounding instructions do not need
  any new CLIF instructions.

* `cranelift/wasm/src/code_translator.rs`: translation into CLIF; this is
  pretty much the same as any other unary or binary vector instruction (for
  the rounding and the pmin/max respectively)

* `cranelift/codegen/src/isa/aarch64/lower_inst.rs`:
  - `fmin_pseudo` and `fmax_pseudo` are converted into a two instruction
    sequence, `fcmpgt` followed by `bsl`
  - the CLIF rounding instructions are converted to a suitable vector
    `frint{n,z,p,m}` instruction.

* `cranelift/codegen/src/isa/aarch64/inst/mod.rs`: minor extension of `pub
  enum VecMisc2` to handle the rounding operations.  And corresponding `emit`
  cases.
ambroff pushed a commit to ambroff/gecko that referenced this pull request Nov 4, 2020
Implement some of the experimental SIMD opcodes that are supported by
all of V8, LLVM, and Binaryen, for maximum compatibility with test
content we might be exposed to.  Most/all of these will probably make
it into the spec, as they lead to substantial speedups in some
programs, and they are deterministic.

For spec and cpu mapping details, see:

WebAssembly/simd#122 (pmax/pmin)
WebAssembly/simd#232 (rounding)
WebAssembly/simd#127 (dot product)
WebAssembly/simd#237 (load zero)

The wasm bytecode values used here come from the binaryen changes that
are linked from those tickets, that's the best documentation right
now.  Current binaryen opcode mappings are here:
https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h

Also: Drive-by fix for signatures of vroundss and vroundsd, these are
unary operations and should follow the conventions for these with
src/dest arguments, not src0/src1/dest.

Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that
take Operand source and FloatRegister destination.

Differential Revision: https://phabricator.services.mozilla.com/D85982
ambroff pushed a commit to ambroff/gecko that referenced this pull request Nov 4, 2020
Implement some of the experimental SIMD opcodes that are supported by
all of V8, LLVM, and Binaryen, for maximum compatibility with test
content we might be exposed to.  Most/all of these will probably make
it into the spec, as they lead to substantial speedups in some
programs, and they are deterministic.

For spec and cpu mapping details, see:

WebAssembly/simd#122 (pmax/pmin)
WebAssembly/simd#232 (rounding)
WebAssembly/simd#127 (dot product)
WebAssembly/simd#237 (load zero)

The wasm bytecode values used here come from the binaryen changes that
are linked from those tickets, that's the best documentation right
now.  Current binaryen opcode mappings are here:
https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h

Also: Drive-by fix for signatures of vroundss and vroundsd, these are
unary operations and should follow the conventions for these with
src/dest arguments, not src0/src1/dest.

Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that
take Operand source and FloatRegister destination.

Differential Revision: https://phabricator.services.mozilla.com/D85982
cfallin pushed a commit to bytecodealliance/wasmtime that referenced this pull request Nov 30, 2020
…structions

This patch implements, for aarch64, the following wasm SIMD extensions

  Floating-point rounding instructions
  WebAssembly/simd#232

  Pseudo-Minimum and Pseudo-Maximum instructions
  WebAssembly/simd#122

The changes are straightforward:

* `build.rs`: the relevant tests have been enabled

* `cranelift/codegen/meta/src/shared/instructions.rs`: new CLIF instructions
  `fmin_pseudo` and `fmax_pseudo`.  The wasm rounding instructions do not need
  any new CLIF instructions.

* `cranelift/wasm/src/code_translator.rs`: translation into CLIF; this is
  pretty much the same as any other unary or binary vector instruction (for
  the rounding and the pmin/max respectively)

* `cranelift/codegen/src/isa/aarch64/lower_inst.rs`:
  - `fmin_pseudo` and `fmax_pseudo` are converted into a two instruction
    sequence, `fcmpgt` followed by `bsl`
  - the CLIF rounding instructions are converted to a suitable vector
    `frint{n,z,p,m}` instruction.

* `cranelift/codegen/src/isa/aarch64/inst/mod.rs`: minor extension of `pub
  enum VecMisc2` to handle the rounding operations.  And corresponding `emit`
  cases.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants