Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[mono] LLVM 11: Explicitly zero the unused bits of the result registe…
…r for AddPairwiseScalar (#53694) LLVM 11 and above optimize %9 = extractelement <2 x float> %arm64_ld1, i32 0 %10 = extractelement <2 x float> %arm64_ld1, i32 1 %arm64_faddp_scalar = fadd float %9, %10 %11 = insertelement <2 x float> undef, float %arm64_faddp_scalar, i32 0 (which is translated to scalar `faddp`) into %shift = shufflevector <2 x float> %arm64_ld1, <2 x float> undef, <2 x i32> <i32 1, i32 undef> %10 = fadd <2 x float> %arm64_ld1, %shift %11 = shufflevector <2 x float> %10, <2 x float> undef, <2 x i32> <i32 0, i32 undef> (which is translated to a sequence of `dup` and vector `fadd`). This change works around this by explicitly zeroing the unused bits of the results of `AddPairwiseScalar`; the generated code is noisier, but the semantics are correct. The "Arm Architecture Reference Manual Armv8, for Armv8-A architecture profile" version G.a calls out the zero-extending semantics of scalar operations that use SIMD registers (see "aarch64/functions/registers/V") but judging by the generated code it doesn't look like LLVM exploits this for optimization. This also affects `vpadds_f32` in Clang.
- Loading branch information