Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency in behavior of MIR and SIMD shifts #91237

Open
RalfJung opened this issue Nov 25, 2021 · 6 comments
Open

Inconsistency in behavior of MIR and SIMD shifts #91237

RalfJung opened this issue Nov 25, 2021 · 6 comments
Labels
A-cranelift Things relevant to the [future] cranelift backend A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-MIR Area: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.html A-SIMD Area: SIMD (Single Instruction Multiple Data) C-discussion Category: Discussion or questions that doesn't represent real issues. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@RalfJung
Copy link
Member

The MIR primitive binops for shifting are well-defined even when the shift offset is larger than the size of the left operand: the shift offset is truncated to that size before doing the shift. (MIR building relies on this, with overflow checks disabled it is entirely safe to produce MIR shift binops without any guards.)

On the other hand, the SIMD shift intrinsics are currently UB when the shift offset is larger than the size of the left operand. (This is based on the fact that they lower to LLVM operations that yield poison in that case.)

This is inconsistent. IMO it would be a good idea to make these two primitive shift operations in our language consistent. One is exposed as a binop and one as an intrinsic, but that does not fundamentally make one less primitive than the other. If they are consistent, this reduces possible sources of confusion for backend developers. It also makes implementing the SIMD intrinsics for CTFE/Miri a lot easier, since it can just call the MIR binop in a loop.

Also see the prior discussion at rust-lang/miri#1920 (comment).

@RalfJung
Copy link
Member Author

The consistent behavior should probably be that either both of them truncate, or both of them are UB when the offset is too big. I will try to summarize what I understood to be the arguments in favor of either solution.

Arguments in favor of UB:

  • This matches the LLVM backend, so each MIR binop would directly correspond to an LLVM operation.
  • Defining less behavior in principle means more backends will be directly compatible. The surrounding MIR can then implement whatever behavior it desires for the overflowing case (truncation, panic, or unsafely exposing the UB).

Arguments in favor of truncation:

  • Matches the cranelift backend (if I understand correctly). So by having truncation in the opration, we avoid double-truncating with this backend.
  • The cost of the additional bitmasking to implement the truncation on LLVM is believed to be insignificant.

@workingjubilee
Copy link
Member

workingjubilee commented Nov 25, 2021

My current hypothesis regarding the performance of lowering using cranelift::simd_shl(V<T, N>, V<T, N>) -> V<T, N> as our rust::simd_shl(V<T, N>, V<T, N>) -> V, which would, for LLVM, lower to llvm::simd_shl(lhs, llvm::simd_and(rhs, lhs::T::BITS -1)):

Assuming that LLVM simply generates the appropriate machine::simd_and(V, V) -> V , and machine::simd_shl(V, V) -> V, in the case of a tight loop this may slow the operation to what we can probably squint and round off to "half as fast" if it must execute both instructions. I am assuming the cost of generating the bitmask vector has either been paid previously or is amortized over the loop.

This depends on the CPU, of course.

  • For some vector ISAs, this should be unpenalized, as it matches the definition of machine::simd_shl. This is the case for the natural lowering for the PowerPC, RISC-V, and WASM vector ISAs. It is reasonable to expect an aggressively optimizing compiler such as LLVM to recognize this opportunity and drop the llvm::simd_and outright.
  • For other vector ISAs, like those of x86 and Arm, an optimizing compiler such as LLVM could use the presence of the llvm::simd_and, if the inputs are const (and one of them would be!), to recognize the result will fit in an appropriate immediate encoding, thus fold this into a single operation. So, 0 penalty at runtime in that case also, except some compile time.
  • For older x86 CPUs, {V}PAND (x86::simd_and in this narrative) and {V}PSLLW (x86::simd_shl) have similar throughput, but consulting Agner Fog's performance tables, for more recent ones the throughput of instructions like {V}PAND is even better by a significant ratio and may simply dominate over the throughput of {V}PSLLW, making the instruction "almost free" by comparison.

A side note: apparently Zig exposes a shl_exact that emits the shl with LLVM's nuw and nsw hints.

I would have to recheck everything with respect to simd_ashr and simd_lshr (we encode both as just simd_shr with a given type-based overload in Rust) to attain 100% confidence, but I believe the logic generalizes.

@camelid camelid added A-MIR Area: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.html C-discussion Category: Discussion or questions that doesn't represent real issues. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Nov 26, 2021
@workingjubilee workingjubilee added A-SIMD Area: SIMD (Single Instruction Multiple Data) A-cranelift Things relevant to the [future] cranelift backend A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. labels Nov 26, 2021
workingjubilee added a commit to rust-lang/portable-simd that referenced this issue Dec 9, 2021
In all other cases, we implement a wrapping logic where applicable.
This is another case it applies, and per rust-lang/rust#91237,
we may wish to specify this as the natural behavior of simd_{shl,shr}.
workingjubilee added a commit to rust-lang/portable-simd that referenced this issue Dec 9, 2021
For all other operators, we use wrapping logic where applicable.
This is another case it applies. Per rust-lang/rust#91237, we may
wish to specify this as the natural behavior of `simd_{shl,shr}`.
@workingjubilee
Copy link
Member

workingjubilee commented Jan 21, 2022

On the PG-PSIMD end, we did in fact go the route of wrapping shifts in our implementation, so if that should be moved deeper into the compiler (so that our impl turns into a bare call to the intrinsic), that's fine from our POV. We may still want to have access to a simd_shr_exact or simd_shr_unchecked or similar, but that would be a new development as far as we are concerned.

@RalfJung
Copy link
Member Author

RalfJung commented Feb 4, 2022

Sounds great. :D So if someone with the required LLVM knowledge could implement the codegen changes for the simd_shr/shl intrinsics, it looks like all involved parties would approve of that change.

@workingjubilee
Copy link
Member

Hm, if we did it in our own LLVM lowering, we could use a const-known parameter that is in-bounds to simply emit shl, ashr, and ushr instead, also.

@RalfJung

This comment was marked as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cranelift Things relevant to the [future] cranelift backend A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-MIR Area: Mid-level IR (MIR) - https://blog.rust-lang.org/2016/04/19/MIR.html A-SIMD Area: SIMD (Single Instruction Multiple Data) C-discussion Category: Discussion or questions that doesn't represent real issues. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

3 participants