Inconsistency in behavior of MIR and SIMD shifts #91237

RalfJung · 2021-11-25T21:51:31Z

The MIR primitive binops for shifting are well-defined even when the shift offset is larger than the size of the left operand: the shift offset is truncated to that size before doing the shift. (MIR building relies on this, with overflow checks disabled it is entirely safe to produce MIR shift binops without any guards.)

On the other hand, the SIMD shift intrinsics are currently UB when the shift offset is larger than the size of the left operand. (This is based on the fact that they lower to LLVM operations that yield poison in that case.)

This is inconsistent. IMO it would be a good idea to make these two primitive shift operations in our language consistent. One is exposed as a binop and one as an intrinsic, but that does not fundamentally make one less primitive than the other. If they are consistent, this reduces possible sources of confusion for backend developers. It also makes implementing the SIMD intrinsics for CTFE/Miri a lot easier, since it can just call the MIR binop in a loop.

Also see the prior discussion at rust-lang/miri#1920 (comment).

RalfJung · 2021-11-25T21:56:11Z

The consistent behavior should probably be that either both of them truncate, or both of them are UB when the offset is too big. I will try to summarize what I understood to be the arguments in favor of either solution.

Arguments in favor of UB:

This matches the LLVM backend, so each MIR binop would directly correspond to an LLVM operation.
Defining less behavior in principle means more backends will be directly compatible. The surrounding MIR can then implement whatever behavior it desires for the overflowing case (truncation, panic, or unsafely exposing the UB).

Arguments in favor of truncation:

Matches the cranelift backend (if I understand correctly). So by having truncation in the opration, we avoid double-truncating with this backend.
The cost of the additional bitmasking to implement the truncation on LLVM is believed to be insignificant.

workingjubilee · 2021-11-25T23:34:30Z

My current hypothesis regarding the performance of lowering using cranelift::simd_shl(V<T, N>, V<T, N>) -> V<T, N> as our rust::simd_shl(V<T, N>, V<T, N>) -> V, which would, for LLVM, lower to llvm::simd_shl(lhs, llvm::simd_and(rhs, lhs::T::BITS -1)):

Assuming that LLVM simply generates the appropriate machine::simd_and(V, V) -> V , and machine::simd_shl(V, V) -> V, in the case of a tight loop this may slow the operation to what we can probably squint and round off to "half as fast" if it must execute both instructions. I am assuming the cost of generating the bitmask vector has either been paid previously or is amortized over the loop.

This depends on the CPU, of course.

For some vector ISAs, this should be unpenalized, as it matches the definition of machine::simd_shl. This is the case for the natural lowering for the PowerPC, RISC-V, and WASM vector ISAs. It is reasonable to expect an aggressively optimizing compiler such as LLVM to recognize this opportunity and drop the llvm::simd_and outright.
For other vector ISAs, like those of x86 and Arm, an optimizing compiler such as LLVM could use the presence of the llvm::simd_and, if the inputs are const (and one of them would be!), to recognize the result will fit in an appropriate immediate encoding, thus fold this into a single operation. So, 0 penalty at runtime in that case also, except some compile time.
For older x86 CPUs, {V}PAND (x86::simd_and in this narrative) and {V}PSLLW (x86::simd_shl) have similar throughput, but consulting Agner Fog's performance tables, for more recent ones the throughput of instructions like {V}PAND is even better by a significant ratio and may simply dominate over the throughput of {V}PSLLW, making the instruction "almost free" by comparison.

A side note: apparently Zig exposes a shl_exact that emits the shl with LLVM's nuw and nsw hints.

I would have to recheck everything with respect to simd_ashr and simd_lshr (we encode both as just simd_shr with a given type-based overload in Rust) to attain 100% confidence, but I believe the logic generalizes.

In all other cases, we implement a wrapping logic where applicable. This is another case it applies, and per rust-lang/rust#91237, we may wish to specify this as the natural behavior of simd_{shl,shr}.

For all other operators, we use wrapping logic where applicable. This is another case it applies. Per rust-lang/rust#91237, we may wish to specify this as the natural behavior of `simd_{shl,shr}`.

workingjubilee · 2022-01-21T04:52:38Z

On the PG-PSIMD end, we did in fact go the route of wrapping shifts in our implementation, so if that should be moved deeper into the compiler (so that our impl turns into a bare call to the intrinsic), that's fine from our POV. We may still want to have access to a simd_shr_exact or simd_shr_unchecked or similar, but that would be a new development as far as we are concerned.

RalfJung · 2022-02-04T16:36:31Z

Sounds great. :D So if someone with the required LLVM knowledge could implement the codegen changes for the simd_shr/shl intrinsics, it looks like all involved parties would approve of that change.

workingjubilee · 2022-02-05T22:55:31Z

Hm, if we did it in our own LLVM lowering, we could use a const-known parameter that is in-bounds to simply emit shl, ashr, and ushr instead, also.

RalfJung mentioned this issue Nov 25, 2021

Shl/Shr should throw UB on too large shift rust-lang/miri#1920

Closed

camelid added A-MIR Area: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.html C-discussion Category: Discussion or questions that doesn't represent real issues. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Nov 26, 2021

workingjubilee added A-SIMD Area: SIMD (Single Instruction Multiple Data) A-cranelift Things relevant to the [future] cranelift backend A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. labels Nov 26, 2021

workingjubilee mentioned this issue Nov 26, 2021

Panic-free ops via masking: {div,rem}_or rust-lang/portable-simd#202

Open

This comment was marked as resolved.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistency in behavior of MIR and SIMD shifts #91237

Inconsistency in behavior of MIR and SIMD shifts #91237

RalfJung commented Nov 25, 2021

RalfJung commented Nov 25, 2021

workingjubilee commented Nov 25, 2021 •

edited

Loading

workingjubilee commented Jan 21, 2022 •

edited

Loading

RalfJung commented Feb 4, 2022

workingjubilee commented Feb 5, 2022

This comment was marked as resolved.

Inconsistency in behavior of MIR and SIMD shifts #91237

Inconsistency in behavior of MIR and SIMD shifts #91237

Comments

RalfJung commented Nov 25, 2021

RalfJung commented Nov 25, 2021

workingjubilee commented Nov 25, 2021 • edited Loading

workingjubilee commented Jan 21, 2022 • edited Loading

RalfJung commented Feb 4, 2022

workingjubilee commented Feb 5, 2022

This comment was marked as resolved.

workingjubilee commented Nov 25, 2021 •

edited

Loading

workingjubilee commented Jan 21, 2022 •

edited

Loading