Enable EVEX feature: embedded broadcast for Vector128/256/512.Add() i…

…n limited cases (#84821) * Enable EVEX feature: embedded broadcast Embedded Broadcast is enabled in Vector256<float>.Add() with limited cases: 1. Vector256.Add(Vec, Vector256.Create(DCon)); 2. Vector256<float> VecCns = Vector256.Create(DCon); Vector256.Add(Vec, VecCns); 3. Vector256.Add(Vec, Vector256.Create(LCL_VAR)); 4. Vector256<float> VecCns = Vector256.Create(LCL_VAR); Vector256.Add(Vec, VecCns); Note: Case 2 4 can only be optimized when DOTNET_TieredCompilation = 0. * remove some irrelevent change from previous main. * Enable containment at Broadcast intrinsic to improve the embedded broadcast enabling works. * Convert the check logics on broadcast into a flag * bug fixes: 1. fixed the contain logic at lowering, to accomadate the situation when both operands for a EB compatible node are EB candidates. 2. fixed some unexpected EVEX.b set at some non-EVEX instructions on x86 * apply format patch. * Add "insOpts" data structure to xarch: insOpts may contain information on the EVEX.b bit, currently only embedded broaddcast * Add "OperIsBroadcastScalar" check: This check is to ensure the intrinsic is actually a broadcast scalar intrinsic, the reason to add this check is that gentree flags are using overlapping definition, GTF_BROADCAST_EMBEDDED has some conflicting definition, so we need to ensure the flag we checked does not come from other overlapping flags. * rebase the branch and resolve conflicts * changes based on the reivews: 1. removed the gentree flag GTF_EMBEDDED_BROADCAST. 2. mark the embedded broadcast node by making it contained. 3. improved logics in GetMemOpSize() to return the correct pointer size when embedded broadcast is enabled. 4. improved logics in genOperandDesc() to emit scalar when constant vector operand is found to be created from scalar. * apply format patch * bug fixes * bug fixes * aaply format patch * Enable embedded broadcast for Vector128<float>.Add * Enable embedded broadcast for Vector512<float>.Add * make double as embedded broadcast supported * Add EB support to AVX_BroadcastScalarToVector* * apply format patch * Enable embedded broadcast for double const vector * Enable embedded broadcast for integer Add. * Changes based on the review: 1. Change GenTreeHWIntrinsic::OperIsEmbBroadcastHWIntrinsic to OperIsEmbBroadcastCompatible 2. removed OperIsBroadcastScalar 3. formatting 4. correct errors in the comments. * removed the gentree flag: GTF_VECCON_FROMSCALAR * Bug fixes on embedded broadcast with AVX_Broadcast * enable embedded broadcast in R_R_A path * apply format patch * bug fixes: re-introduce "OperIsBroadcastScalar", there are some cases when non-broadcast node (e.g. Load, Read) contained by embedded broadcast and embedded broadcast is enabled unexpectedly, using this method can filter out those cases. * Changes based on reviews: 1. code style improvement 2. fixes typos and errors in the comments. 3. extract the operand swap logic when lowering Create node into a function: TryCanonizeEmbBroadcastCandicate() * unfold VecCon node when lowering if this node is eligible for embedded broadcast. * apply format patch * bug fixes: 1. added missing default branch 2. filter out some possible embedded broadcast cases for some better optimization * resolve the mishandling for the previous conflict. * move the unfolding logic to ContainChecks * Code changes based on the review * apply format patch * support embedded broadcast for GT_IND as the operand of a broadcast node. * bug fixes: Long type should only be on 64-bit system. * apply format patch * Introduce MakeHWIntrinsicSrcContained(): This function will handle the case that constant vector is the operand of embedded broadcast ops. If the constant vector is eligible for embedded broadcast, will unfold the constatn vector to the corresponding broadcast intrinsic form. * Code changes based on reviews: 1. a helper function to detect embedded broadcast compatible flag 2. contain logic improvement. 3. typo fixes. * Code changes based on review * apply format patch * Code changes based on review: 1. deleted irrelevant comments. Move the contain check up to cover more cases. * Code changes based on review: 1. Update comment to keep up with the changes in InstrDesc. 2. Removed un-needed argumnet in the irrelevant method.
dotnet · Jun 2, 2023 · 1e029d0 · 1e029d0
1 parent e126ca3
commit 1e029d0
Show file tree

Hide file tree

Showing 13 changed files with 566 additions and 62 deletions.
diff --git a/src/coreclr/jit/codegeninterface.h b/src/coreclr/jit/codegeninterface.h
@@ -127,7 +127,9 @@ class CodeGenInterface
 #define INST_FP 0x01 // is it a FP instruction?
 public:
     static bool instIsFP(instruction ins);
-
+#if defined(TARGET_XARCH)
+    static bool instIsEmbeddedBroadcastCompatible(instruction ins);
+#endif // TARGET_XARCH
     //-------------------------------------------------------------------------
     // Liveness-related fields & methods
 public:
@@ -764,6 +766,10 @@ class CodeGenInterface
 
     virtual const char* siStackVarName(size_t offs, size_t size, unsigned reg, unsigned stkOffs) = 0;
 #endif // LATE_DISASM
+
+#if defined(TARGET_XARCH)
+    bool IsEmbeddedBroadcastEnabled(instruction ins, GenTree* op);
+#endif
 };
 
 #endif // _CODEGEN_INTERFACE_H_
diff --git a/src/coreclr/jit/emit.h b/src/coreclr/jit/emit.h
@@ -781,6 +781,9 @@ class emitter
         unsigned _idCallRegPtr : 1; // IL indirect calls: addr in reg
         unsigned _idCallAddr : 1;   // IL indirect calls: can make a direct call to iiaAddr
         unsigned _idNoGC : 1;       // Some helpers don't get recorded in GC tables
+#if defined(TARGET_XARCH)
+        unsigned _idEvexbContext : 1; // does EVEX.b need to be set.
+#endif                                //  TARGET_XARCH
 
 #ifdef TARGET_ARM64
         opSize   _idOpSize : 3;    // operand size: 0=1 , 1=2 , 2=4 , 3=8, 4=16
@@ -814,8 +817,8 @@ class emitter
 
         ////////////////////////////////////////////////////////////////////////
         // Space taken up to here:
-        // x86:   46 bits
-        // amd64: 46 bits
+        // x86:   47 bits
+        // amd64: 47 bits
         // arm:   48 bits
         // arm64: 50 bits
         // loongarch64: 46 bits
@@ -830,8 +833,10 @@ class emitter
 #define ID_EXTRA_BITFIELD_BITS (16)
 #elif defined(TARGET_ARM64)
 #define ID_EXTRA_BITFIELD_BITS (18)
-#elif defined(TARGET_XARCH) || defined(TARGET_LOONGARCH64) || defined(TARGET_RISCV64)
+#elif defined(TARGET_LOONGARCH64) || defined(TARGET_RISCV64)
 #define ID_EXTRA_BITFIELD_BITS (14)
+#elif defined(TARGET_XARCH)
+#define ID_EXTRA_BITFIELD_BITS (15)
 #else
 #error Unsupported or unset target architecture
 #endif
@@ -866,8 +871,8 @@ class emitter
 
         ////////////////////////////////////////////////////////////////////////
         // Space taken up to here (with/without prev offset, assuming host==target):
-        // x86:   52/48 bits
-        // amd64: 53/48 bits
+        // x86:   53/49 bits
+        // amd64: 54/49 bits
         // arm:   54/50 bits
         // arm64: 57/52 bits
         // loongarch64: 53/48 bits
@@ -1529,6 +1534,19 @@ class emitter
             _idNoGC = val;
         }
 
+#ifdef TARGET_XARCH
+        bool idIsEvexbContext() const
+        {
+            return _idEvexbContext != 0;
+        }
+        void idSetEvexbContext()
+        {
+            assert(_idEvexbContext == 0);
+            _idEvexbContext = 1;
+            assert(_idEvexbContext == 1);
+        }
+#endif
+
 #ifdef TARGET_ARMARCH
         bool idIsLclVar() const
         {
@@ -3655,9 +3673,25 @@ inline unsigned emitter::emitGetInsCIargs(instrDesc* id)
 //
 emitAttr emitter::emitGetMemOpSize(instrDesc* id) const
 {
+
     emitAttr    defaultSize = id->idOpSize();
     instruction ins         = id->idIns();
+    if (id->idIsEvexbContext())
+    {
+        // should have the assumption that Evex.b now stands for the embedded broadcast context.
+        // reference: Section 2.7.5 in Intel 64 and ia-32 architectures software developer's manual volume 2.
+        ssize_t inputSize = GetInputSizeInBytes(id);
+        switch (inputSize)
+        {
+            case 4:
+                return EA_4BYTE;
+            case 8:
+                return EA_8BYTE;
 
+            default:
+                unreached();
+        }
+    }
     switch (ins)
     {
         case INS_pextrb:

diff --git a/src/coreclr/jit/emitxarch.cpp b/src/coreclr/jit/emitxarch.cpp
@@ -1231,9 +1231,10 @@ bool emitter::TakesEvexPrefix(const instrDesc* id) const
 #define DEFAULT_BYTE_EVEX_PREFIX_MASK 0xFFFFFFFF00000000ULL
 #define LBIT_IN_BYTE_EVEX_PREFIX 0x0000002000000000ULL
 #define LPRIMEBIT_IN_BYTE_EVEX_PREFIX 0x0000004000000000ULL
+#define EVEX_B_BIT 0x0000001000000000ULL
 
 //------------------------------------------------------------------------
-// AddEvexPrefix: Add default EVEX perfix with only LL' bits set.
+// AddEvexPrefix: Add default EVEX prefix with only LL' bits set.
 //
 // Arguments:
 //    ins -- processor instruction to check.
@@ -1268,6 +1269,22 @@ emitter::code_t emitter::AddEvexPrefix(instruction ins, code_t code, emitAttr at
     return code;
 }
 
+//------------------------------------------------------------------------
+// AddEvexPrefix: set Evex.b bit if EvexbContext is set in instruction descritor.
+//
+// Arguments:
+//    code -- opcode bits.
+//
+// Return Value:
+//    encoded code with Evex.b set if needed.
+//
+emitter::code_t emitter::AddEvexbBit(code_t code)
+{
+    assert(hasEvexPrefix(code));
+    code |= EVEX_B_BIT;
+    return code;
+}
+
 // Returns true if this instruction requires a VEX prefix
 // All AVX instructions require a VEX prefix
 bool emitter::TakesVexPrefix(instruction ins) const
@@ -6667,7 +6684,8 @@ void emitter::emitIns_R_S_I(instruction ins, emitAttr attr, regNumber reg1, int
     emitCurIGsize += sz;
 }
 
-void emitter::emitIns_R_R_A(instruction ins, emitAttr attr, regNumber reg1, regNumber reg2, GenTreeIndir* indir)
+void emitter::emitIns_R_R_A(
+    instruction ins, emitAttr attr, regNumber reg1, regNumber reg2, GenTreeIndir* indir, insOpts instOptions)
 {
     assert(IsAvx512OrPriorInstruction(ins));
     assert(IsThreeOperandAVXInstruction(ins));
@@ -6678,6 +6696,11 @@ void emitter::emitIns_R_R_A(instruction ins, emitAttr attr, regNumber reg1, regN
     id->idIns(ins);
     id->idReg1(reg1);
     id->idReg2(reg2);
+    if (instOptions == INS_OPTS_EVEX_b)
+    {
+        assert(UseEvexEncoding());
+        id->idSetEvexbContext();
+    }
 
     emitHandleMemOp(indir, id, (ins == INS_mulx) ? IF_RWR_RWR_ARD : emitInsModeFormat(ins, IF_RRD_RRD_ARD), ins);
 
@@ -6778,8 +6801,13 @@ void emitter::emitIns_R_AR_R(instruction ins,
     emitCurIGsize += sz;
 }
 
-void emitter::emitIns_R_R_C(
-    instruction ins, emitAttr attr, regNumber reg1, regNumber reg2, CORINFO_FIELD_HANDLE fldHnd, int offs)
+void emitter::emitIns_R_R_C(instruction          ins,
+                            emitAttr             attr,
+                            regNumber            reg1,
+                            regNumber            reg2,
+                            CORINFO_FIELD_HANDLE fldHnd,
+                            int                  offs,
+                            insOpts              instOptions)
 {
     assert(IsAvx512OrPriorInstruction(ins));
     assert(IsThreeOperandAVXInstruction(ins));
@@ -6797,6 +6825,11 @@ void emitter::emitIns_R_R_C(
     id->idReg1(reg1);
     id->idReg2(reg2);
     id->idAddr()->iiaFieldHnd = fldHnd;
+    if (instOptions == INS_OPTS_EVEX_b)
+    {
+        assert(UseEvexEncoding());
+        id->idSetEvexbContext();
+    }
 
     UNATIVE_OFFSET sz = emitInsSizeCV(id, insCodeRM(ins));
     id->idCodeSize(sz);
@@ -6829,7 +6862,8 @@ void emitter::emitIns_R_R_R(instruction ins, emitAttr attr, regNumber targetReg,
     emitCurIGsize += sz;
 }
 
-void emitter::emitIns_R_R_S(instruction ins, emitAttr attr, regNumber reg1, regNumber reg2, int varx, int offs)
+void emitter::emitIns_R_R_S(
+    instruction ins, emitAttr attr, regNumber reg1, regNumber reg2, int varx, int offs, insOpts instOptions)
 {
     assert(IsAvx512OrPriorInstruction(ins));
     assert(IsThreeOperandAVXInstruction(ins));
@@ -6842,6 +6876,11 @@ void emitter::emitIns_R_R_S(instruction ins, emitAttr attr, regNumber reg1, regN
     id->idReg2(reg2);
     id->idAddr()->iiaLclVar.initLclVarAddr(varx, offs);
 
+    if (instOptions == INS_OPTS_EVEX_b)
+    {
+        assert(UseEvexEncoding());
+        id->idSetEvexbContext();
+    }
 #ifdef DEBUG
     id->idDebugOnlyInfo()->idVarRefOffs = emitVarRefOffs;
 #endif
@@ -8134,14 +8173,15 @@ void emitter::emitIns_SIMD_R_R_I(instruction ins, emitAttr attr, regNumber targe
 //    indir     -- The GenTreeIndir used for the memory address
 //
 void emitter::emitIns_SIMD_R_R_A(
-    instruction ins, emitAttr attr, regNumber targetReg, regNumber op1Reg, GenTreeIndir* indir)
+    instruction ins, emitAttr attr, regNumber targetReg, regNumber op1Reg, GenTreeIndir* indir, insOpts instOptions)
 {
     if (UseSimdEncoding())
     {
-        emitIns_R_R_A(ins, attr, targetReg, op1Reg, indir);
+        emitIns_R_R_A(ins, attr, targetReg, op1Reg, indir, instOptions);
     }
     else
     {
+        assert(instOptions == INS_OPTS_NONE);
         emitIns_Mov(INS_movaps, attr, targetReg, op1Reg, /* canSkip */ true);
         emitIns_R_A(ins, attr, targetReg, indir);
     }
@@ -8159,15 +8199,21 @@ void emitter::emitIns_SIMD_R_R_A(
 //    fldHnd    -- The CORINFO_FIELD_HANDLE used for the memory address
 //    offs      -- The offset added to the memory address from fldHnd
 //
-void emitter::emitIns_SIMD_R_R_C(
-    instruction ins, emitAttr attr, regNumber targetReg, regNumber op1Reg, CORINFO_FIELD_HANDLE fldHnd, int offs)
+void emitter::emitIns_SIMD_R_R_C(instruction          ins,
+                                 emitAttr             attr,
+                                 regNumber            targetReg,
+                                 regNumber            op1Reg,
+                                 CORINFO_FIELD_HANDLE fldHnd,
+                                 int                  offs,
+                                 insOpts              instOptions)
 {
     if (UseSimdEncoding())
     {
-        emitIns_R_R_C(ins, attr, targetReg, op1Reg, fldHnd, offs);
+        emitIns_R_R_C(ins, attr, targetReg, op1Reg, fldHnd, offs, instOptions);
     }
     else
     {
+        assert(instOptions == INS_OPTS_NONE);
         emitIns_Mov(INS_movaps, attr, targetReg, op1Reg, /* canSkip */ true);
         emitIns_R_C(ins, attr, targetReg, fldHnd, offs);
     }
@@ -8222,14 +8268,15 @@ void emitter::emitIns_SIMD_R_R_R(
 //    offs      -- The offset added to the memory address from varx
 //
 void emitter::emitIns_SIMD_R_R_S(
-    instruction ins, emitAttr attr, regNumber targetReg, regNumber op1Reg, int varx, int offs)
+    instruction ins, emitAttr attr, regNumber targetReg, regNumber op1Reg, int varx, int offs, insOpts instOptions)
 {
     if (UseSimdEncoding())
     {
-        emitIns_R_R_S(ins, attr, targetReg, op1Reg, varx, offs);
+        emitIns_R_R_S(ins, attr, targetReg, op1Reg, varx, offs, instOptions);
     }
     else
     {
+        assert(instOptions == INS_OPTS_NONE);
         emitIns_Mov(INS_movaps, attr, targetReg, op1Reg, /* canSkip */ true);
         emitIns_R_S(ins, attr, targetReg, varx, offs);
     }
@@ -15717,7 +15764,7 @@ BYTE* emitter::emitOutputLJ(insGroup* ig, BYTE* dst, instrDesc* i)
 // Return Value:
 //    size in bytes.
 //
-ssize_t emitter::GetInputSizeInBytes(instrDesc* id)
+ssize_t emitter::GetInputSizeInBytes(instrDesc* id) const
 {
     insFlags inputSize = static_cast<insFlags>((CodeGenInterface::instInfo[id->idIns()] & Input_Mask));