Skip to content

Commit

Permalink
Enable EVEX feature: embedded broadcast for Vector128/256/512.Add() i…
Browse files Browse the repository at this point in the history
…n limited cases (#84821)

* Enable EVEX feature: embedded broadcast
Embedded Broadcast is enabled in Vector256<float>.Add() with limited cases:
1. Vector256.Add(Vec, Vector256.Create(DCon));
2. Vector256<float> VecCns = Vector256.Create(DCon);
   Vector256.Add(Vec, VecCns);

3. Vector256.Add(Vec, Vector256.Create(LCL_VAR));
4. Vector256<float> VecCns = Vector256.Create(LCL_VAR);
   Vector256.Add(Vec, VecCns);

Note: Case 2 4 can only be optimized when DOTNET_TieredCompilation = 0.

* remove some irrelevent change from previous main.

* Enable containment at Broadcast intrinsic
to improve the embedded broadcast enabling works.

* Convert the check logics on broadcast into a flag

* bug fixes:
1. fixed the contain logic at lowering, to accomadate the situation when
both operands for a EB compatible node are EB candidates.
2. fixed some unexpected EVEX.b set at some non-EVEX instructions on x86

* apply format patch.

* Add "insOpts" data structure to xarch:
 insOpts may contain information on the EVEX.b bit,
 currently only embedded broaddcast

* Add "OperIsBroadcastScalar" check:
This check is to ensure the intrinsic is actually a broadcast
scalar intrinsic, the reason to add this check is that gentree
flags are using overlapping definition, GTF_BROADCAST_EMBEDDED has
some conflicting definition, so we need to ensure the flag we checked
does not come from other overlapping flags.

* rebase the branch and resolve conflicts

* changes based on the reivews:
1. removed the gentree flag GTF_EMBEDDED_BROADCAST.
2. mark the embedded broadcast node by making it contained.
3. improved logics in GetMemOpSize() to return the correct pointer size
when embedded broadcast is enabled.
4. improved logics in genOperandDesc() to emit scalar when constant
vector operand is found to be created from scalar.

* apply format patch

* bug fixes

* bug fixes

* aaply format patch

* Enable embedded broadcast for Vector128<float>.Add

* Enable embedded broadcast for Vector512<float>.Add

* make double as embedded broadcast supported

* Add EB support to AVX_BroadcastScalarToVector*

* apply format patch

* Enable embedded broadcast for double const vector

* Enable embedded broadcast for integer Add.

* Changes based on the review:
1. Change GenTreeHWIntrinsic::OperIsEmbBroadcastHWIntrinsic
to OperIsEmbBroadcastCompatible
2. removed OperIsBroadcastScalar
3. formatting
4. correct errors in the comments.

* removed the gentree flag: GTF_VECCON_FROMSCALAR

* Bug fixes on embedded broadcast with AVX_Broadcast

* enable embedded broadcast in R_R_A path

* apply format patch

* bug fixes:
re-introduce "OperIsBroadcastScalar",
there are some cases when non-broadcast node (e.g. Load, Read)
contained by embedded broadcast and embedded broadcast
is enabled unexpectedly, using this method can filter out those cases.

* Changes based on reviews:
1. code style improvement
2. fixes typos and errors in the comments.
3. extract the operand swap logic when lowering Create node into
a function: TryCanonizeEmbBroadcastCandicate()

* unfold VecCon node when lowering if this node is
eligible for embedded broadcast.

* apply format patch

* bug fixes:
1. added missing default branch
2. filter out some possible embedded broadcast cases
for some better optimization

* resolve the mishandling for the previous conflict.

* move the unfolding logic to ContainChecks

* Code changes based on the review

* apply format patch

* support embedded broadcast for GT_IND
as the operand of a broadcast node.

* bug fixes:
Long type should only be on 64-bit system.

* apply format patch

* Introduce MakeHWIntrinsicSrcContained():
This function will handle the case that constant vector
is the operand of embedded broadcast ops.
If the constant vector is eligible for embedded broadcast,
will unfold the constatn vector to the corresponding broadcast
intrinsic form.

* Code changes based on reviews:
1. a helper function to detect embedded broadcast compatible flag
2. contain logic improvement.
3. typo fixes.

* Code changes based on review

* apply format patch

* Code changes based on review:
1. deleted irrelevant comments.

Move the contain check up to cover more cases.

* Code changes based on review:
1. Update comment to keep up with the changes in InstrDesc.
2. Removed un-needed argumnet in the irrelevant method.
  • Loading branch information
Ruihan-Yin authored Jun 2, 2023
1 parent e126ca3 commit 1e029d0
Show file tree
Hide file tree
Showing 13 changed files with 566 additions and 62 deletions.
8 changes: 7 additions & 1 deletion src/coreclr/jit/codegeninterface.h
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,9 @@ class CodeGenInterface
#define INST_FP 0x01 // is it a FP instruction?
public:
static bool instIsFP(instruction ins);

#if defined(TARGET_XARCH)
static bool instIsEmbeddedBroadcastCompatible(instruction ins);
#endif // TARGET_XARCH
//-------------------------------------------------------------------------
// Liveness-related fields & methods
public:
Expand Down Expand Up @@ -764,6 +766,10 @@ class CodeGenInterface

virtual const char* siStackVarName(size_t offs, size_t size, unsigned reg, unsigned stkOffs) = 0;
#endif // LATE_DISASM

#if defined(TARGET_XARCH)
bool IsEmbeddedBroadcastEnabled(instruction ins, GenTree* op);
#endif
};

#endif // _CODEGEN_INTERFACE_H_
44 changes: 39 additions & 5 deletions src/coreclr/jit/emit.h
Original file line number Diff line number Diff line change
Expand Up @@ -781,6 +781,9 @@ class emitter
unsigned _idCallRegPtr : 1; // IL indirect calls: addr in reg
unsigned _idCallAddr : 1; // IL indirect calls: can make a direct call to iiaAddr
unsigned _idNoGC : 1; // Some helpers don't get recorded in GC tables
#if defined(TARGET_XARCH)
unsigned _idEvexbContext : 1; // does EVEX.b need to be set.
#endif // TARGET_XARCH

#ifdef TARGET_ARM64
opSize _idOpSize : 3; // operand size: 0=1 , 1=2 , 2=4 , 3=8, 4=16
Expand Down Expand Up @@ -814,8 +817,8 @@ class emitter

////////////////////////////////////////////////////////////////////////
// Space taken up to here:
// x86: 46 bits
// amd64: 46 bits
// x86: 47 bits
// amd64: 47 bits
// arm: 48 bits
// arm64: 50 bits
// loongarch64: 46 bits
Expand All @@ -830,8 +833,10 @@ class emitter
#define ID_EXTRA_BITFIELD_BITS (16)
#elif defined(TARGET_ARM64)
#define ID_EXTRA_BITFIELD_BITS (18)
#elif defined(TARGET_XARCH) || defined(TARGET_LOONGARCH64) || defined(TARGET_RISCV64)
#elif defined(TARGET_LOONGARCH64) || defined(TARGET_RISCV64)
#define ID_EXTRA_BITFIELD_BITS (14)
#elif defined(TARGET_XARCH)
#define ID_EXTRA_BITFIELD_BITS (15)
#else
#error Unsupported or unset target architecture
#endif
Expand Down Expand Up @@ -866,8 +871,8 @@ class emitter

////////////////////////////////////////////////////////////////////////
// Space taken up to here (with/without prev offset, assuming host==target):
// x86: 52/48 bits
// amd64: 53/48 bits
// x86: 53/49 bits
// amd64: 54/49 bits
// arm: 54/50 bits
// arm64: 57/52 bits
// loongarch64: 53/48 bits
Expand Down Expand Up @@ -1529,6 +1534,19 @@ class emitter
_idNoGC = val;
}

#ifdef TARGET_XARCH
bool idIsEvexbContext() const
{
return _idEvexbContext != 0;
}
void idSetEvexbContext()
{
assert(_idEvexbContext == 0);
_idEvexbContext = 1;
assert(_idEvexbContext == 1);
}
#endif

#ifdef TARGET_ARMARCH
bool idIsLclVar() const
{
Expand Down Expand Up @@ -3655,9 +3673,25 @@ inline unsigned emitter::emitGetInsCIargs(instrDesc* id)
//
emitAttr emitter::emitGetMemOpSize(instrDesc* id) const
{

emitAttr defaultSize = id->idOpSize();
instruction ins = id->idIns();
if (id->idIsEvexbContext())
{
// should have the assumption that Evex.b now stands for the embedded broadcast context.
// reference: Section 2.7.5 in Intel 64 and ia-32 architectures software developer's manual volume 2.
ssize_t inputSize = GetInputSizeInBytes(id);
switch (inputSize)
{
case 4:
return EA_4BYTE;
case 8:
return EA_8BYTE;

default:
unreached();
}
}
switch (ins)
{
case INS_pextrb:
Expand Down
73 changes: 60 additions & 13 deletions src/coreclr/jit/emitxarch.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1231,9 +1231,10 @@ bool emitter::TakesEvexPrefix(const instrDesc* id) const
#define DEFAULT_BYTE_EVEX_PREFIX_MASK 0xFFFFFFFF00000000ULL
#define LBIT_IN_BYTE_EVEX_PREFIX 0x0000002000000000ULL
#define LPRIMEBIT_IN_BYTE_EVEX_PREFIX 0x0000004000000000ULL
#define EVEX_B_BIT 0x0000001000000000ULL

//------------------------------------------------------------------------
// AddEvexPrefix: Add default EVEX perfix with only LL' bits set.
// AddEvexPrefix: Add default EVEX prefix with only LL' bits set.
//
// Arguments:
// ins -- processor instruction to check.
Expand Down Expand Up @@ -1268,6 +1269,22 @@ emitter::code_t emitter::AddEvexPrefix(instruction ins, code_t code, emitAttr at
return code;
}

//------------------------------------------------------------------------
// AddEvexPrefix: set Evex.b bit if EvexbContext is set in instruction descritor.
//
// Arguments:
// code -- opcode bits.
//
// Return Value:
// encoded code with Evex.b set if needed.
//
emitter::code_t emitter::AddEvexbBit(code_t code)
{
assert(hasEvexPrefix(code));
code |= EVEX_B_BIT;
return code;
}

// Returns true if this instruction requires a VEX prefix
// All AVX instructions require a VEX prefix
bool emitter::TakesVexPrefix(instruction ins) const
Expand Down Expand Up @@ -6667,7 +6684,8 @@ void emitter::emitIns_R_S_I(instruction ins, emitAttr attr, regNumber reg1, int
emitCurIGsize += sz;
}

void emitter::emitIns_R_R_A(instruction ins, emitAttr attr, regNumber reg1, regNumber reg2, GenTreeIndir* indir)
void emitter::emitIns_R_R_A(
instruction ins, emitAttr attr, regNumber reg1, regNumber reg2, GenTreeIndir* indir, insOpts instOptions)
{
assert(IsAvx512OrPriorInstruction(ins));
assert(IsThreeOperandAVXInstruction(ins));
Expand All @@ -6678,6 +6696,11 @@ void emitter::emitIns_R_R_A(instruction ins, emitAttr attr, regNumber reg1, regN
id->idIns(ins);
id->idReg1(reg1);
id->idReg2(reg2);
if (instOptions == INS_OPTS_EVEX_b)
{
assert(UseEvexEncoding());
id->idSetEvexbContext();
}

emitHandleMemOp(indir, id, (ins == INS_mulx) ? IF_RWR_RWR_ARD : emitInsModeFormat(ins, IF_RRD_RRD_ARD), ins);

Expand Down Expand Up @@ -6778,8 +6801,13 @@ void emitter::emitIns_R_AR_R(instruction ins,
emitCurIGsize += sz;
}

void emitter::emitIns_R_R_C(
instruction ins, emitAttr attr, regNumber reg1, regNumber reg2, CORINFO_FIELD_HANDLE fldHnd, int offs)
void emitter::emitIns_R_R_C(instruction ins,
emitAttr attr,
regNumber reg1,
regNumber reg2,
CORINFO_FIELD_HANDLE fldHnd,
int offs,
insOpts instOptions)
{
assert(IsAvx512OrPriorInstruction(ins));
assert(IsThreeOperandAVXInstruction(ins));
Expand All @@ -6797,6 +6825,11 @@ void emitter::emitIns_R_R_C(
id->idReg1(reg1);
id->idReg2(reg2);
id->idAddr()->iiaFieldHnd = fldHnd;
if (instOptions == INS_OPTS_EVEX_b)
{
assert(UseEvexEncoding());
id->idSetEvexbContext();
}

UNATIVE_OFFSET sz = emitInsSizeCV(id, insCodeRM(ins));
id->idCodeSize(sz);
Expand Down Expand Up @@ -6829,7 +6862,8 @@ void emitter::emitIns_R_R_R(instruction ins, emitAttr attr, regNumber targetReg,
emitCurIGsize += sz;
}

void emitter::emitIns_R_R_S(instruction ins, emitAttr attr, regNumber reg1, regNumber reg2, int varx, int offs)
void emitter::emitIns_R_R_S(
instruction ins, emitAttr attr, regNumber reg1, regNumber reg2, int varx, int offs, insOpts instOptions)
{
assert(IsAvx512OrPriorInstruction(ins));
assert(IsThreeOperandAVXInstruction(ins));
Expand All @@ -6842,6 +6876,11 @@ void emitter::emitIns_R_R_S(instruction ins, emitAttr attr, regNumber reg1, regN
id->idReg2(reg2);
id->idAddr()->iiaLclVar.initLclVarAddr(varx, offs);

if (instOptions == INS_OPTS_EVEX_b)
{
assert(UseEvexEncoding());
id->idSetEvexbContext();
}
#ifdef DEBUG
id->idDebugOnlyInfo()->idVarRefOffs = emitVarRefOffs;
#endif
Expand Down Expand Up @@ -8134,14 +8173,15 @@ void emitter::emitIns_SIMD_R_R_I(instruction ins, emitAttr attr, regNumber targe
// indir -- The GenTreeIndir used for the memory address
//
void emitter::emitIns_SIMD_R_R_A(
instruction ins, emitAttr attr, regNumber targetReg, regNumber op1Reg, GenTreeIndir* indir)
instruction ins, emitAttr attr, regNumber targetReg, regNumber op1Reg, GenTreeIndir* indir, insOpts instOptions)
{
if (UseSimdEncoding())
{
emitIns_R_R_A(ins, attr, targetReg, op1Reg, indir);
emitIns_R_R_A(ins, attr, targetReg, op1Reg, indir, instOptions);
}
else
{
assert(instOptions == INS_OPTS_NONE);
emitIns_Mov(INS_movaps, attr, targetReg, op1Reg, /* canSkip */ true);
emitIns_R_A(ins, attr, targetReg, indir);
}
Expand All @@ -8159,15 +8199,21 @@ void emitter::emitIns_SIMD_R_R_A(
// fldHnd -- The CORINFO_FIELD_HANDLE used for the memory address
// offs -- The offset added to the memory address from fldHnd
//
void emitter::emitIns_SIMD_R_R_C(
instruction ins, emitAttr attr, regNumber targetReg, regNumber op1Reg, CORINFO_FIELD_HANDLE fldHnd, int offs)
void emitter::emitIns_SIMD_R_R_C(instruction ins,
emitAttr attr,
regNumber targetReg,
regNumber op1Reg,
CORINFO_FIELD_HANDLE fldHnd,
int offs,
insOpts instOptions)
{
if (UseSimdEncoding())
{
emitIns_R_R_C(ins, attr, targetReg, op1Reg, fldHnd, offs);
emitIns_R_R_C(ins, attr, targetReg, op1Reg, fldHnd, offs, instOptions);
}
else
{
assert(instOptions == INS_OPTS_NONE);
emitIns_Mov(INS_movaps, attr, targetReg, op1Reg, /* canSkip */ true);
emitIns_R_C(ins, attr, targetReg, fldHnd, offs);
}
Expand Down Expand Up @@ -8222,14 +8268,15 @@ void emitter::emitIns_SIMD_R_R_R(
// offs -- The offset added to the memory address from varx
//
void emitter::emitIns_SIMD_R_R_S(
instruction ins, emitAttr attr, regNumber targetReg, regNumber op1Reg, int varx, int offs)
instruction ins, emitAttr attr, regNumber targetReg, regNumber op1Reg, int varx, int offs, insOpts instOptions)
{
if (UseSimdEncoding())
{
emitIns_R_R_S(ins, attr, targetReg, op1Reg, varx, offs);
emitIns_R_R_S(ins, attr, targetReg, op1Reg, varx, offs, instOptions);
}
else
{
assert(instOptions == INS_OPTS_NONE);
emitIns_Mov(INS_movaps, attr, targetReg, op1Reg, /* canSkip */ true);
emitIns_R_S(ins, attr, targetReg, varx, offs);
}
Expand Down Expand Up @@ -15717,7 +15764,7 @@ BYTE* emitter::emitOutputLJ(insGroup* ig, BYTE* dst, instrDesc* i)
// Return Value:
// size in bytes.
//
ssize_t emitter::GetInputSizeInBytes(instrDesc* id)
ssize_t emitter::GetInputSizeInBytes(instrDesc* id) const
{
insFlags inputSize = static_cast<insFlags>((CodeGenInterface::instInfo[id->idIns()] & Input_Mask));

Expand Down
Loading

0 comments on commit 1e029d0

Please sign in to comment.