Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[X86] lowerV4F64Shuffle - prefer lowerShuffleAsDecomposedShuffleMerge if we're blending inplace/splatable shuffle inputs on AVX2 targets #126420

Merged
merged 1 commit into from
Feb 9, 2025

Conversation

RKSimon
Copy link
Collaborator

@RKSimon RKSimon commented Feb 9, 2025

More aggressively use broadcast instructions where possible

Fixes #50315

… if we're blending inplace/splatable shuffle inputs on AVX2 targets

More aggressively use broadcast instructions where possible

Fixes llvm#50315
@llvmbot
Copy link
Member

llvmbot commented Feb 9, 2025

@llvm/pr-subscribers-backend-x86

Author: Simon Pilgrim (RKSimon)

Changes

More aggressively use broadcast instructions where possible

Fixes #50315


Full diff: https://github.com/llvm/llvm-project/pull/126420.diff

5 Files Affected:

  • (modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+19-1)
  • (modified) llvm/test/CodeGen/X86/copy-low-subvec-elt-to-high-subvec-elt.ll (+2-2)
  • (modified) llvm/test/CodeGen/X86/horizontal-sum.ll (+10-10)
  • (modified) llvm/test/CodeGen/X86/matrix-multiply.ll (+41-41)
  • (modified) llvm/test/CodeGen/X86/vector-shuffle-combining-avx.ll (+19-19)
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 744e4e740cb2102..9a916a663a64c20 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -12689,6 +12689,20 @@ static bool isShuffleMaskInputInPlace(int Input, ArrayRef<int> Mask) {
   return true;
 }
 
+/// Test whether the specified input (0 or 1) is a broadcast/splat blended by
+/// the given mask.
+///
+static bool isShuffleMaskInputBroadcastable(int Input, ArrayRef<int> Mask,
+                                            int BroadcastableElement = 0) {
+  assert((Input == 0 || Input == 1) && "Only two inputs to shuffles.");
+  int Size = Mask.size();
+  for (int i = 0; i < Size; ++i)
+    if (Mask[i] >= 0 && Mask[i] / Size == Input &&
+        Mask[i] % Size != BroadcastableElement)
+      return false;
+  return true;
+}
+
 /// If we are extracting two 128-bit halves of a vector and shuffling the
 /// result, match that to a 256-bit AVX2 vperm* instruction to avoid a
 /// multi-shuffle lowering.
@@ -16190,6 +16204,8 @@ static SDValue lowerV4F64Shuffle(const SDLoc &DL, ArrayRef<int> Mask,
 
   bool V1IsInPlace = isShuffleMaskInputInPlace(0, Mask);
   bool V2IsInPlace = isShuffleMaskInputInPlace(1, Mask);
+  bool V1IsSplat = isShuffleMaskInputBroadcastable(0, Mask);
+  bool V2IsSplat = isShuffleMaskInputBroadcastable(1, Mask);
 
   // If we have lane crossing shuffles AND they don't all come from the lower
   // lane elements, lower to SHUFPD(VPERM2F128(V1, V2), VPERM2F128(V1, V2)).
@@ -16198,7 +16214,9 @@ static SDValue lowerV4F64Shuffle(const SDLoc &DL, ArrayRef<int> Mask,
   if (is128BitLaneCrossingShuffleMask(MVT::v4f64, Mask) &&
       !all_of(Mask, [](int M) { return M < 2 || (4 <= M && M < 6); }) &&
       (V1.getOpcode() != ISD::BUILD_VECTOR) &&
-      (V2.getOpcode() != ISD::BUILD_VECTOR))
+      (V2.getOpcode() != ISD::BUILD_VECTOR) &&
+      (!Subtarget.hasAVX2() ||
+       !((V1IsInPlace || V1IsSplat) && (V2IsInPlace || V2IsSplat))))
     return lowerShuffleAsLanePermuteAndSHUFP(DL, MVT::v4f64, V1, V2, Mask, DAG);
 
   // If we have one input in place, then we can permute the other input and
diff --git a/llvm/test/CodeGen/X86/copy-low-subvec-elt-to-high-subvec-elt.ll b/llvm/test/CodeGen/X86/copy-low-subvec-elt-to-high-subvec-elt.ll
index 1baaab0931cb9ad..26a88ab15e3cca1 100644
--- a/llvm/test/CodeGen/X86/copy-low-subvec-elt-to-high-subvec-elt.ll
+++ b/llvm/test/CodeGen/X86/copy-low-subvec-elt-to-high-subvec-elt.ll
@@ -151,8 +151,8 @@ define <4 x double> @vec256_eltty_double_source_subvec_0_target_subvec_mask_2_un
 define <4 x double> @vec256_eltty_double_source_subvec_0_target_subvec_mask_2_binary(<4 x double> %x, <4 x double> %y) nounwind {
 ; CHECK-LABEL: vec256_eltty_double_source_subvec_0_target_subvec_mask_2_binary:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm1
-; CHECK-NEXT:    vshufpd {{.*#+}} ymm0 = ymm0[0],ymm1[1],ymm0[2],ymm1[2]
+; CHECK-NEXT:    vbroadcastsd %xmm1, %ymm1
+; CHECK-NEXT:    vblendps {{.*#+}} ymm0 = ymm0[0,1,2,3,4,5],ymm1[6,7]
 ; CHECK-NEXT:    retq
   %r = shufflevector <4 x double> %x, <4 x double> %y, <4 x i32> <i32 0, i32 1, i32 2, i32 4>
   ret <4 x double> %r
diff --git a/llvm/test/CodeGen/X86/horizontal-sum.ll b/llvm/test/CodeGen/X86/horizontal-sum.ll
index 5fe1e2996ee9b08..e2cc3ae0dca0af2 100644
--- a/llvm/test/CodeGen/X86/horizontal-sum.ll
+++ b/llvm/test/CodeGen/X86/horizontal-sum.ll
@@ -256,11 +256,11 @@ define <8 x float> @pair_sum_v8f32_v4f32(<4 x float> %0, <4 x float> %1, <4 x fl
 ; AVX2-SLOW-NEXT:    vaddps %xmm2, %xmm1, %xmm1
 ; AVX2-SLOW-NEXT:    vmovlhps {{.*#+}} xmm0 = xmm0[0],xmm1[0]
 ; AVX2-SLOW-NEXT:    vshufpd {{.*#+}} xmm1 = xmm1[1,0]
-; AVX2-SLOW-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm1
-; AVX2-SLOW-NEXT:    vhaddps %xmm7, %xmm6, %xmm2
-; AVX2-SLOW-NEXT:    vhaddps %xmm2, %xmm2, %xmm2
-; AVX2-SLOW-NEXT:    vinsertf128 $1, %xmm2, %ymm0, %ymm0
-; AVX2-SLOW-NEXT:    vshufpd {{.*#+}} ymm0 = ymm1[0],ymm0[1],ymm1[2],ymm0[2]
+; AVX2-SLOW-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX2-SLOW-NEXT:    vhaddps %xmm7, %xmm6, %xmm1
+; AVX2-SLOW-NEXT:    vhaddps %xmm1, %xmm1, %xmm1
+; AVX2-SLOW-NEXT:    vbroadcastsd %xmm1, %ymm1
+; AVX2-SLOW-NEXT:    vblendps {{.*#+}} ymm0 = ymm0[0,1,2,3,4,5],ymm1[6,7]
 ; AVX2-SLOW-NEXT:    retq
 ;
 ; AVX2-FAST-LABEL: pair_sum_v8f32_v4f32:
@@ -277,11 +277,11 @@ define <8 x float> @pair_sum_v8f32_v4f32(<4 x float> %0, <4 x float> %1, <4 x fl
 ; AVX2-FAST-NEXT:    vaddps %xmm2, %xmm1, %xmm1
 ; AVX2-FAST-NEXT:    vmovlhps {{.*#+}} xmm0 = xmm0[0],xmm1[0]
 ; AVX2-FAST-NEXT:    vshufpd {{.*#+}} xmm1 = xmm1[1,0]
-; AVX2-FAST-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm1
-; AVX2-FAST-NEXT:    vhaddps %xmm7, %xmm6, %xmm2
-; AVX2-FAST-NEXT:    vhaddps %xmm0, %xmm2, %xmm2
-; AVX2-FAST-NEXT:    vinsertf128 $1, %xmm2, %ymm0, %ymm0
-; AVX2-FAST-NEXT:    vshufpd {{.*#+}} ymm0 = ymm1[0],ymm0[1],ymm1[2],ymm0[2]
+; AVX2-FAST-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
+; AVX2-FAST-NEXT:    vhaddps %xmm7, %xmm6, %xmm1
+; AVX2-FAST-NEXT:    vhaddps %xmm0, %xmm1, %xmm1
+; AVX2-FAST-NEXT:    vbroadcastsd %xmm1, %ymm1
+; AVX2-FAST-NEXT:    vblendps {{.*#+}} ymm0 = ymm0[0,1,2,3,4,5],ymm1[6,7]
 ; AVX2-FAST-NEXT:    retq
   %9 = shufflevector <4 x float> %0, <4 x float> poison, <2 x i32> <i32 0, i32 2>
   %10 = shufflevector <4 x float> %0, <4 x float> poison, <2 x i32> <i32 1, i32 3>
diff --git a/llvm/test/CodeGen/X86/matrix-multiply.ll b/llvm/test/CodeGen/X86/matrix-multiply.ll
index bdc1ff4c157e4fc..a38ca339cd5e133 100644
--- a/llvm/test/CodeGen/X86/matrix-multiply.ll
+++ b/llvm/test/CodeGen/X86/matrix-multiply.ll
@@ -659,57 +659,57 @@ define <9 x double> @test_mul3x3_f64(<9 x double> %a0, <9 x double> %a1) nounwin
 ; AVX2:       # %bb.0: # %entry
 ; AVX2-NEXT:    movq %rdi, %rax
 ; AVX2-NEXT:    vmovsd {{.*#+}} xmm8 = mem[0],zero
-; AVX2-NEXT:    vunpcklpd {{.*#+}} xmm1 = xmm0[0],xmm1[0]
+; AVX2-NEXT:    vunpcklpd {{.*#+}} xmm0 = xmm0[0],xmm1[0]
 ; AVX2-NEXT:    vmovddup {{.*#+}} xmm9 = mem[0,0]
-; AVX2-NEXT:    vmulpd %xmm1, %xmm9, %xmm0
-; AVX2-NEXT:    vunpcklpd {{.*#+}} xmm3 = xmm3[0],xmm4[0]
-; AVX2-NEXT:    vmovddup {{.*#+}} xmm4 = mem[0,0]
-; AVX2-NEXT:    vmulpd %xmm4, %xmm3, %xmm10
-; AVX2-NEXT:    vaddpd %xmm0, %xmm10, %xmm0
+; AVX2-NEXT:    vmulpd %xmm0, %xmm9, %xmm10
+; AVX2-NEXT:    vunpcklpd {{.*#+}} xmm1 = xmm3[0],xmm4[0]
+; AVX2-NEXT:    vmovddup {{.*#+}} xmm3 = mem[0,0]
+; AVX2-NEXT:    vmulpd %xmm3, %xmm1, %xmm4
+; AVX2-NEXT:    vaddpd %xmm4, %xmm10, %xmm4
 ; AVX2-NEXT:    vunpcklpd {{.*#+}} xmm6 = xmm6[0],xmm7[0]
 ; AVX2-NEXT:    vmovddup {{.*#+}} xmm7 = mem[0,0]
 ; AVX2-NEXT:    vmulpd %xmm7, %xmm6, %xmm10
-; AVX2-NEXT:    vaddpd %xmm0, %xmm10, %xmm0
+; AVX2-NEXT:    vaddpd %xmm4, %xmm10, %xmm4
 ; AVX2-NEXT:    vmulsd %xmm2, %xmm9, %xmm9
-; AVX2-NEXT:    vmulsd %xmm4, %xmm5, %xmm4
-; AVX2-NEXT:    vaddsd %xmm4, %xmm9, %xmm4
+; AVX2-NEXT:    vmulsd %xmm3, %xmm5, %xmm3
+; AVX2-NEXT:    vaddsd %xmm3, %xmm9, %xmm3
 ; AVX2-NEXT:    vmulsd %xmm7, %xmm8, %xmm7
-; AVX2-NEXT:    vaddsd %xmm7, %xmm4, %xmm4
-; AVX2-NEXT:    vinsertf128 $1, %xmm4, %ymm0, %ymm4
-; AVX2-NEXT:    vmovddup {{.*#+}} xmm7 = mem[0,0]
-; AVX2-NEXT:    vmulpd %xmm7, %xmm1, %xmm9
+; AVX2-NEXT:    vaddsd %xmm7, %xmm3, %xmm3
+; AVX2-NEXT:    vinsertf128 $1, %xmm3, %ymm4, %ymm3
+; AVX2-NEXT:    vmovddup {{.*#+}} xmm4 = mem[0,0]
+; AVX2-NEXT:    vmulpd %xmm4, %xmm0, %xmm7
+; AVX2-NEXT:    vmovddup {{.*#+}} xmm9 = mem[0,0]
+; AVX2-NEXT:    vmulpd %xmm1, %xmm9, %xmm10
+; AVX2-NEXT:    vaddpd %xmm7, %xmm10, %xmm7
 ; AVX2-NEXT:    vmovddup {{.*#+}} xmm10 = mem[0,0]
-; AVX2-NEXT:    vmulpd %xmm3, %xmm10, %xmm11
-; AVX2-NEXT:    vaddpd %xmm11, %xmm9, %xmm9
-; AVX2-NEXT:    vmovddup {{.*#+}} xmm11 = mem[0,0]
-; AVX2-NEXT:    vmulpd %xmm6, %xmm11, %xmm12
-; AVX2-NEXT:    vaddpd %xmm12, %xmm9, %xmm9
-; AVX2-NEXT:    vmulsd %xmm7, %xmm2, %xmm7
-; AVX2-NEXT:    vmulsd %xmm5, %xmm10, %xmm10
-; AVX2-NEXT:    vaddsd %xmm7, %xmm10, %xmm7
-; AVX2-NEXT:    vmulsd %xmm11, %xmm8, %xmm10
-; AVX2-NEXT:    vaddsd %xmm7, %xmm10, %xmm7
+; AVX2-NEXT:    vmulpd %xmm6, %xmm10, %xmm11
+; AVX2-NEXT:    vaddpd %xmm7, %xmm11, %xmm7
+; AVX2-NEXT:    vmulsd %xmm4, %xmm2, %xmm4
+; AVX2-NEXT:    vmulsd %xmm5, %xmm9, %xmm9
+; AVX2-NEXT:    vaddsd %xmm4, %xmm9, %xmm4
+; AVX2-NEXT:    vmulsd %xmm10, %xmm8, %xmm9
+; AVX2-NEXT:    vaddsd %xmm4, %xmm9, %xmm4
+; AVX2-NEXT:    vmovddup {{.*#+}} xmm9 = mem[0,0]
+; AVX2-NEXT:    vmulpd %xmm0, %xmm9, %xmm0
 ; AVX2-NEXT:    vmovddup {{.*#+}} xmm10 = mem[0,0]
 ; AVX2-NEXT:    vmulpd %xmm1, %xmm10, %xmm1
-; AVX2-NEXT:    vmovddup {{.*#+}} xmm11 = mem[0,0]
-; AVX2-NEXT:    vmulpd %xmm3, %xmm11, %xmm3
-; AVX2-NEXT:    vaddpd %xmm3, %xmm1, %xmm1
-; AVX2-NEXT:    vmovddup {{.*#+}} xmm3 = mem[0,0]
-; AVX2-NEXT:    vmulpd %xmm3, %xmm6, %xmm6
-; AVX2-NEXT:    vaddpd %xmm6, %xmm1, %xmm1
-; AVX2-NEXT:    vmulsd %xmm2, %xmm10, %xmm2
-; AVX2-NEXT:    vmulsd %xmm5, %xmm11, %xmm5
+; AVX2-NEXT:    vaddpd %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    vmovddup {{.*#+}} xmm1 = mem[0,0]
+; AVX2-NEXT:    vmulpd %xmm1, %xmm6, %xmm6
+; AVX2-NEXT:    vaddpd %xmm6, %xmm0, %xmm0
+; AVX2-NEXT:    vmulsd %xmm2, %xmm9, %xmm2
+; AVX2-NEXT:    vmulsd %xmm5, %xmm10, %xmm5
 ; AVX2-NEXT:    vaddsd %xmm5, %xmm2, %xmm2
-; AVX2-NEXT:    vmulsd %xmm3, %xmm8, %xmm3
-; AVX2-NEXT:    vaddsd %xmm3, %xmm2, %xmm2
-; AVX2-NEXT:    vinsertf128 $1, %xmm9, %ymm0, %ymm0
-; AVX2-NEXT:    vshufpd {{.*#+}} ymm0 = ymm4[0],ymm0[1],ymm4[2],ymm0[2]
-; AVX2-NEXT:    vinsertf128 $1, %xmm1, %ymm7, %ymm3
-; AVX2-NEXT:    vinsertf128 $1, %xmm1, %ymm9, %ymm1
-; AVX2-NEXT:    vshufpd {{.*#+}} ymm1 = ymm1[1],ymm3[0],ymm1[2],ymm3[3]
-; AVX2-NEXT:    vmovsd %xmm2, 64(%rdi)
-; AVX2-NEXT:    vmovapd %ymm1, 32(%rdi)
-; AVX2-NEXT:    vmovapd %ymm0, (%rdi)
+; AVX2-NEXT:    vmulsd %xmm1, %xmm8, %xmm1
+; AVX2-NEXT:    vaddsd %xmm1, %xmm2, %xmm1
+; AVX2-NEXT:    vbroadcastsd %xmm7, %ymm2
+; AVX2-NEXT:    vblendpd {{.*#+}} ymm2 = ymm3[0,1,2],ymm2[3]
+; AVX2-NEXT:    vinsertf128 $1, %xmm0, %ymm4, %ymm3
+; AVX2-NEXT:    vinsertf128 $1, %xmm0, %ymm7, %ymm0
+; AVX2-NEXT:    vshufpd {{.*#+}} ymm0 = ymm0[1],ymm3[0],ymm0[2],ymm3[3]
+; AVX2-NEXT:    vmovsd %xmm1, 64(%rdi)
+; AVX2-NEXT:    vmovapd %ymm0, 32(%rdi)
+; AVX2-NEXT:    vmovapd %ymm2, (%rdi)
 ; AVX2-NEXT:    vzeroupper
 ; AVX2-NEXT:    retq
 ;
diff --git a/llvm/test/CodeGen/X86/vector-shuffle-combining-avx.ll b/llvm/test/CodeGen/X86/vector-shuffle-combining-avx.ll
index 79602a18693dbed..00af58544e25c0b 100644
--- a/llvm/test/CodeGen/X86/vector-shuffle-combining-avx.ll
+++ b/llvm/test/CodeGen/X86/vector-shuffle-combining-avx.ll
@@ -493,11 +493,11 @@ define void @PR48908(<4 x double> %v0, <4 x double> %v1, <4 x double> %v2, ptr n
 ; X86-AVX2-NEXT:    movl {{[0-9]+}}(%esp), %eax
 ; X86-AVX2-NEXT:    movl {{[0-9]+}}(%esp), %ecx
 ; X86-AVX2-NEXT:    movl {{[0-9]+}}(%esp), %edx
-; X86-AVX2-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm3
+; X86-AVX2-NEXT:    vbroadcastsd %xmm1, %ymm3
 ; X86-AVX2-NEXT:    vperm2f128 {{.*#+}} ymm4 = ymm1[2,3],ymm2[0,1]
 ; X86-AVX2-NEXT:    vshufpd {{.*#+}} xmm5 = xmm1[1,0]
 ; X86-AVX2-NEXT:    vperm2f128 {{.*#+}} ymm6 = ymm0[0,1],ymm2[0,1]
-; X86-AVX2-NEXT:    vpermpd {{.*#+}} ymm3 = ymm3[0,2,2,1]
+; X86-AVX2-NEXT:    vperm2f128 {{.*#+}} ymm3 = ymm3[2,3],ymm0[0,1]
 ; X86-AVX2-NEXT:    vblendpd {{.*#+}} ymm3 = ymm6[0],ymm3[1],ymm6[2],ymm3[3]
 ; X86-AVX2-NEXT:    vmovapd %ymm3, (%edx)
 ; X86-AVX2-NEXT:    vblendpd {{.*#+}} ymm3 = ymm5[0,1],ymm0[2,3]
@@ -520,13 +520,13 @@ define void @PR48908(<4 x double> %v0, <4 x double> %v1, <4 x double> %v2, ptr n
 ; X86-AVX512-NEXT:    movl {{[0-9]+}}(%esp), %eax
 ; X86-AVX512-NEXT:    movl {{[0-9]+}}(%esp), %ecx
 ; X86-AVX512-NEXT:    movl {{[0-9]+}}(%esp), %edx
-; X86-AVX512-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm4
 ; X86-AVX512-NEXT:    vpmovsxbq {{.*#+}} ymm3 = [1,2,8,9]
 ; X86-AVX512-NEXT:    vpermi2pd %zmm2, %zmm1, %zmm3
-; X86-AVX512-NEXT:    vpmovsxbq {{.*#+}} ymm5 = [0,10,2,9]
-; X86-AVX512-NEXT:    vperm2f128 {{.*#+}} ymm6 = ymm0[0,1],ymm2[0,1]
-; X86-AVX512-NEXT:    vpermt2pd %zmm4, %zmm5, %zmm6
-; X86-AVX512-NEXT:    vmovapd %ymm6, (%edx)
+; X86-AVX512-NEXT:    vpmovsxbq {{.*#+}} ymm4 = [0,8,2,1]
+; X86-AVX512-NEXT:    vpermi2pd %zmm1, %zmm0, %zmm4
+; X86-AVX512-NEXT:    vperm2f128 {{.*#+}} ymm5 = ymm0[0,1],ymm2[0,1]
+; X86-AVX512-NEXT:    vblendpd {{.*#+}} ymm4 = ymm5[0],ymm4[1],ymm5[2],ymm4[3]
+; X86-AVX512-NEXT:    vmovapd %ymm4, (%edx)
 ; X86-AVX512-NEXT:    vpmovsxbq {{.*#+}} ymm4 = [0,3,10,1]
 ; X86-AVX512-NEXT:    vpermi2pd %zmm0, %zmm3, %zmm4
 ; X86-AVX512-NEXT:    vmovapd %ymm4, (%ecx)
@@ -563,11 +563,11 @@ define void @PR48908(<4 x double> %v0, <4 x double> %v1, <4 x double> %v2, ptr n
 ;
 ; X64-AVX2-LABEL: PR48908:
 ; X64-AVX2:       # %bb.0:
-; X64-AVX2-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm3
+; X64-AVX2-NEXT:    vbroadcastsd %xmm1, %ymm3
 ; X64-AVX2-NEXT:    vperm2f128 {{.*#+}} ymm4 = ymm1[2,3],ymm2[0,1]
 ; X64-AVX2-NEXT:    vshufpd {{.*#+}} xmm5 = xmm1[1,0]
 ; X64-AVX2-NEXT:    vperm2f128 {{.*#+}} ymm6 = ymm0[0,1],ymm2[0,1]
-; X64-AVX2-NEXT:    vpermpd {{.*#+}} ymm3 = ymm3[0,2,2,1]
+; X64-AVX2-NEXT:    vperm2f128 {{.*#+}} ymm3 = ymm3[2,3],ymm0[0,1]
 ; X64-AVX2-NEXT:    vblendpd {{.*#+}} ymm3 = ymm6[0],ymm3[1],ymm6[2],ymm3[3]
 ; X64-AVX2-NEXT:    vmovapd %ymm3, (%rdi)
 ; X64-AVX2-NEXT:    vblendpd {{.*#+}} ymm3 = ymm5[0,1],ymm0[2,3]
@@ -587,16 +587,16 @@ define void @PR48908(<4 x double> %v0, <4 x double> %v1, <4 x double> %v2, ptr n
 ; X64-AVX512-NEXT:    # kill: def $ymm2 killed $ymm2 def $zmm2
 ; X64-AVX512-NEXT:    # kill: def $ymm1 killed $ymm1 def $zmm1
 ; X64-AVX512-NEXT:    # kill: def $ymm0 killed $ymm0 def $zmm0
-; X64-AVX512-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm3
-; X64-AVX512-NEXT:    vpmovsxbq {{.*#+}} ymm4 = [1,2,8,9]
-; X64-AVX512-NEXT:    vpermi2pd %zmm2, %zmm1, %zmm4
-; X64-AVX512-NEXT:    vpmovsxbq {{.*#+}} ymm5 = [0,10,2,9]
-; X64-AVX512-NEXT:    vperm2f128 {{.*#+}} ymm6 = ymm0[0,1],ymm2[0,1]
-; X64-AVX512-NEXT:    vpermt2pd %zmm3, %zmm5, %zmm6
-; X64-AVX512-NEXT:    vmovapd %ymm6, (%rdi)
-; X64-AVX512-NEXT:    vpmovsxbq {{.*#+}} ymm3 = [0,3,10,1]
-; X64-AVX512-NEXT:    vpermi2pd %zmm0, %zmm4, %zmm3
-; X64-AVX512-NEXT:    vmovapd %ymm3, (%rsi)
+; X64-AVX512-NEXT:    vpmovsxbq {{.*#+}} ymm3 = [1,2,8,9]
+; X64-AVX512-NEXT:    vpermi2pd %zmm2, %zmm1, %zmm3
+; X64-AVX512-NEXT:    vpmovsxbq {{.*#+}} ymm4 = [0,8,2,1]
+; X64-AVX512-NEXT:    vpermi2pd %zmm1, %zmm0, %zmm4
+; X64-AVX512-NEXT:    vperm2f128 {{.*#+}} ymm5 = ymm0[0,1],ymm2[0,1]
+; X64-AVX512-NEXT:    vblendpd {{.*#+}} ymm4 = ymm5[0],ymm4[1],ymm5[2],ymm4[3]
+; X64-AVX512-NEXT:    vmovapd %ymm4, (%rdi)
+; X64-AVX512-NEXT:    vpmovsxbq {{.*#+}} ymm4 = [0,3,10,1]
+; X64-AVX512-NEXT:    vpermi2pd %zmm0, %zmm3, %zmm4
+; X64-AVX512-NEXT:    vmovapd %ymm4, (%rsi)
 ; X64-AVX512-NEXT:    vpmovsxbq {{.*#+}} xmm3 = [3,11]
 ; X64-AVX512-NEXT:    vpermi2pd %zmm1, %zmm0, %zmm3
 ; X64-AVX512-NEXT:    vpmovsxbq {{.*#+}} ymm0 = [2,8,9,3]

@RKSimon RKSimon merged commit 4972722 into llvm:main Feb 9, 2025
8 of 10 checks passed
@RKSimon RKSimon deleted the x86-blend-broadcast branch February 9, 2025 17:11
@llvm-ci
Copy link
Collaborator

llvm-ci commented Feb 9, 2025

LLVM Buildbot has detected a new failure on builder fuchsia-x86_64-linux running on fuchsia-debian-64-us-central1-a-1 while building llvm at step 4 "annotate".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/11/builds/12334

Here is the relevant piece of the build log for the reference
Step 4 (annotate) failure: 'python ../llvm-zorg/zorg/buildbot/builders/annotated/fuchsia-linux.py ...' (failure)
...
[235/2775] Copying CXX header __algorithm/for_each_segment.h
[236/2775] Copying CXX header __algorithm/generate.h
[237/2775] Copying CXX header __algorithm/generate_n.h
[238/2775] Generating header stdint.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/stdint.yaml
[239/2775] Building CXX object libc/src/stdio/baremetal/CMakeFiles/libc.src.stdio.baremetal.putchar.dir/putchar.cpp.obj
[240/2775] Building CXX object libc/src/stdio/baremetal/CMakeFiles/libc.src.stdio.baremetal.puts.dir/puts.cpp.obj
[241/2775] Building CXX object libc/src/string/CMakeFiles/libc.src.string.memmem.dir/memmem.cpp.obj
[242/2775] Building CXX object libc/src/string/CMakeFiles/libc.src.string.strcmp.dir/strcmp.cpp.obj
[243/2775] Building CXX object libc/src/string/CMakeFiles/libc.src.string.strncpy.dir/strncpy.cpp.obj
[244/2775] Building CXX object libc/startup/baremetal/CMakeFiles/libc.startup.baremetal.init.dir/init.cpp.obj
FAILED: libc/startup/baremetal/CMakeFiles/libc.startup.baremetal.init.dir/init.cpp.obj 
/var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-lo5x3a5v/bin/clang++ --target=aarch64-none-elf -DLIBC_NAMESPACE=__llvm_libc_21_0_0_git -D__LIBC_USE_FLOAT16_CONVERSION -I/var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc -isystem /var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-lo5x3a5v/include/aarch64-unknown-none-elf --target=aarch64-none-elf -Wno-atomic-alignment "-Dvfprintf(stream, format, vlist)=vprintf(format, vlist)" "-Dfprintf(stream, format, ...)=printf(format)" -D_LIBCPP_PRINT=1 -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -ffunction-sections -fdata-sections -flto -ffile-prefix-map=/var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-lo5x3a5v/runtimes/runtimes-aarch64-none-elf-bins=../../../../llvm-project -ffile-prefix-map=/var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/= -no-canonical-prefixes -Os -DNDEBUG -std=gnu++17 --target=aarch64-none-elf -DLIBC_QSORT_IMPL=LIBC_QSORT_HEAP_SORT -DLIBC_ADD_NULL_CHECKS "-DLIBC_MATH=(LIBC_MATH_SKIP_ACCURATE_PASS | LIBC_MATH_SMALL_TABLES)" -fpie -ffreestanding -DLIBC_FULL_BUILD -nostdlibinc -ffixed-point -fno-builtin -fno-exceptions -fno-lax-vector-conversions -fno-unwind-tables -fno-asynchronous-unwind-tables -fno-rtti -ftrivial-auto-var-init=pattern -fno-omit-frame-pointer -Wall -Wextra -Werror -Wconversion -Wno-sign-conversion -Wdeprecated -Wno-c99-extensions -Wno-gnu-imaginary-constant -Wno-pedantic -Wimplicit-fallthrough -Wwrite-strings -Wextra-semi -Wnewline-eof -Wnonportable-system-include-path -Wstrict-prototypes -Wthread-safety -Wglobal-constructors -DLIBC_COPT_PUBLIC_PACKAGING -MD -MT libc/startup/baremetal/CMakeFiles/libc.startup.baremetal.init.dir/init.cpp.obj -MF libc/startup/baremetal/CMakeFiles/libc.startup.baremetal.init.dir/init.cpp.obj.d -o libc/startup/baremetal/CMakeFiles/libc.startup.baremetal.init.dir/init.cpp.obj -c /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc/startup/baremetal/init.cpp
/var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc/startup/baremetal/init.cpp:16:8: error: unknown type name 'uintptr_t'
   16 | extern uintptr_t __preinit_array_start[];
      |        ^
/var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc/startup/baremetal/init.cpp:17:8: error: unknown type name 'uintptr_t'
   17 | extern uintptr_t __preinit_array_end[];
      |        ^
/var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc/startup/baremetal/init.cpp:18:8: error: unknown type name 'uintptr_t'
   18 | extern uintptr_t __init_array_start[];
      |        ^
/var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc/startup/baremetal/init.cpp:19:8: error: unknown type name 'uintptr_t'
   19 | extern uintptr_t __init_array_end[];
      |        ^
4 errors generated.
[245/2775] Copying CXX header __algorithm/ranges_ends_with.h
[246/2775] Copying CXX header __algorithm/ranges_equal.h
[247/2775] Copying CXX header __algorithm/ranges_equal_range.h
[248/2775] Copying CXX header __algorithm/ranges_fill.h
[249/2775] Copying CXX header __algorithm/ranges_fill_n.h
[250/2775] Copying CXX header __algorithm/ranges_find.h
[251/2775] Copying CXX header __algorithm/ranges_find_end.h
[252/2775] Copying CXX header __algorithm/ranges_find_first_of.h
[253/2775] Copying CXX header __algorithm/ranges_find_if.h
[254/2775] Copying CXX header __algorithm/ranges_find_if_not.h
[255/2775] Copying CXX header __algorithm/ranges_find_last.h
[256/2775] Copying CXX header __algorithm/ranges_fold.h
[257/2775] Copying CXX header __algorithm/ranges_generate.h
[258/2775] Generating header setjmp.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/setjmp.yaml
[259/2775] Building CXX object libc/src/string/CMakeFiles/libc.src.string.strncmp.dir/strncmp.cpp.obj
[260/2775] Building CXX object libc/startup/baremetal/CMakeFiles/libc.startup.baremetal.fini.dir/fini.cpp.obj
[261/2775] Copying CXX header __algorithm/ranges_for_each.h
[262/2775] Copying CXX header __algorithm/ranges_for_each_n.h
[263/2775] Copying CXX header __algorithm/ranges_generate_n.h
[264/2775] Copying CXX header __algorithm/ranges_includes.h
[265/2775] Copying CXX header __algorithm/ranges_inplace_merge.h
[266/2775] Copying CXX header __algorithm/ranges_is_partitioned.h
[267/2775] Copying CXX header __algorithm/ranges_is_heap.h
[268/2775] Copying CXX header __algorithm/ranges_is_permutation.h
Step 6 (build) failure: build (failure)
...
[235/2775] Copying CXX header __algorithm/for_each_segment.h
[236/2775] Copying CXX header __algorithm/generate.h
[237/2775] Copying CXX header __algorithm/generate_n.h
[238/2775] Generating header stdint.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/stdint.yaml
[239/2775] Building CXX object libc/src/stdio/baremetal/CMakeFiles/libc.src.stdio.baremetal.putchar.dir/putchar.cpp.obj
[240/2775] Building CXX object libc/src/stdio/baremetal/CMakeFiles/libc.src.stdio.baremetal.puts.dir/puts.cpp.obj
[241/2775] Building CXX object libc/src/string/CMakeFiles/libc.src.string.memmem.dir/memmem.cpp.obj
[242/2775] Building CXX object libc/src/string/CMakeFiles/libc.src.string.strcmp.dir/strcmp.cpp.obj
[243/2775] Building CXX object libc/src/string/CMakeFiles/libc.src.string.strncpy.dir/strncpy.cpp.obj
[244/2775] Building CXX object libc/startup/baremetal/CMakeFiles/libc.startup.baremetal.init.dir/init.cpp.obj
FAILED: libc/startup/baremetal/CMakeFiles/libc.startup.baremetal.init.dir/init.cpp.obj 
/var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-lo5x3a5v/bin/clang++ --target=aarch64-none-elf -DLIBC_NAMESPACE=__llvm_libc_21_0_0_git -D__LIBC_USE_FLOAT16_CONVERSION -I/var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc -isystem /var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-lo5x3a5v/include/aarch64-unknown-none-elf --target=aarch64-none-elf -Wno-atomic-alignment "-Dvfprintf(stream, format, vlist)=vprintf(format, vlist)" "-Dfprintf(stream, format, ...)=printf(format)" -D_LIBCPP_PRINT=1 -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -ffunction-sections -fdata-sections -flto -ffile-prefix-map=/var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-lo5x3a5v/runtimes/runtimes-aarch64-none-elf-bins=../../../../llvm-project -ffile-prefix-map=/var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/= -no-canonical-prefixes -Os -DNDEBUG -std=gnu++17 --target=aarch64-none-elf -DLIBC_QSORT_IMPL=LIBC_QSORT_HEAP_SORT -DLIBC_ADD_NULL_CHECKS "-DLIBC_MATH=(LIBC_MATH_SKIP_ACCURATE_PASS | LIBC_MATH_SMALL_TABLES)" -fpie -ffreestanding -DLIBC_FULL_BUILD -nostdlibinc -ffixed-point -fno-builtin -fno-exceptions -fno-lax-vector-conversions -fno-unwind-tables -fno-asynchronous-unwind-tables -fno-rtti -ftrivial-auto-var-init=pattern -fno-omit-frame-pointer -Wall -Wextra -Werror -Wconversion -Wno-sign-conversion -Wdeprecated -Wno-c99-extensions -Wno-gnu-imaginary-constant -Wno-pedantic -Wimplicit-fallthrough -Wwrite-strings -Wextra-semi -Wnewline-eof -Wnonportable-system-include-path -Wstrict-prototypes -Wthread-safety -Wglobal-constructors -DLIBC_COPT_PUBLIC_PACKAGING -MD -MT libc/startup/baremetal/CMakeFiles/libc.startup.baremetal.init.dir/init.cpp.obj -MF libc/startup/baremetal/CMakeFiles/libc.startup.baremetal.init.dir/init.cpp.obj.d -o libc/startup/baremetal/CMakeFiles/libc.startup.baremetal.init.dir/init.cpp.obj -c /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc/startup/baremetal/init.cpp
/var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc/startup/baremetal/init.cpp:16:8: error: unknown type name 'uintptr_t'
   16 | extern uintptr_t __preinit_array_start[];
      |        ^
/var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc/startup/baremetal/init.cpp:17:8: error: unknown type name 'uintptr_t'
   17 | extern uintptr_t __preinit_array_end[];
      |        ^
/var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc/startup/baremetal/init.cpp:18:8: error: unknown type name 'uintptr_t'
   18 | extern uintptr_t __init_array_start[];
      |        ^
/var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc/startup/baremetal/init.cpp:19:8: error: unknown type name 'uintptr_t'
   19 | extern uintptr_t __init_array_end[];
      |        ^
4 errors generated.
[245/2775] Copying CXX header __algorithm/ranges_ends_with.h
[246/2775] Copying CXX header __algorithm/ranges_equal.h
[247/2775] Copying CXX header __algorithm/ranges_equal_range.h
[248/2775] Copying CXX header __algorithm/ranges_fill.h
[249/2775] Copying CXX header __algorithm/ranges_fill_n.h
[250/2775] Copying CXX header __algorithm/ranges_find.h
[251/2775] Copying CXX header __algorithm/ranges_find_end.h
[252/2775] Copying CXX header __algorithm/ranges_find_first_of.h
[253/2775] Copying CXX header __algorithm/ranges_find_if.h
[254/2775] Copying CXX header __algorithm/ranges_find_if_not.h
[255/2775] Copying CXX header __algorithm/ranges_find_last.h
[256/2775] Copying CXX header __algorithm/ranges_fold.h
[257/2775] Copying CXX header __algorithm/ranges_generate.h
[258/2775] Generating header setjmp.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/setjmp.yaml
[259/2775] Building CXX object libc/src/string/CMakeFiles/libc.src.string.strncmp.dir/strncmp.cpp.obj
[260/2775] Building CXX object libc/startup/baremetal/CMakeFiles/libc.startup.baremetal.fini.dir/fini.cpp.obj
[261/2775] Copying CXX header __algorithm/ranges_for_each.h
[262/2775] Copying CXX header __algorithm/ranges_for_each_n.h
[263/2775] Copying CXX header __algorithm/ranges_generate_n.h
[264/2775] Copying CXX header __algorithm/ranges_includes.h
[265/2775] Copying CXX header __algorithm/ranges_inplace_merge.h
[266/2775] Copying CXX header __algorithm/ranges_is_partitioned.h
[267/2775] Copying CXX header __algorithm/ranges_is_heap.h
[268/2775] Copying CXX header __algorithm/ranges_is_permutation.h

Icohedron pushed a commit to Icohedron/llvm-project that referenced this pull request Feb 11, 2025
… if we're blending inplace/splatable shuffle inputs on AVX2 targets (llvm#126420)

More aggressively use broadcast instructions where possible

Fixes llvm#50315
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Messy code for insertelement on AVX2 vectors
3 participants