-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[IA][RISCV] Support VP loads/stores in InterleavedAccessPass #120490
Conversation
@llvm/pr-subscribers-llvm-transforms @llvm/pr-subscribers-backend-risc-v Author: Min-Yih Hsu (mshockwave) ChangesTeach InterleavedAccessPass to recognize the following patterns:
Upon recognizing these patterns, IA will collect the interleaved / deinterleaved operands and delegate them over to their respective newly-added TLI hooks. For RISC-V, these patterns are lowered into segmented loads/stores (except when we're interleaving constant splats, in which case a unit-strde store will be generated) Right now we only recognized power-of-two (de)interleave cases, in which (de)interleave4/8 are synthesized from a tree of (de)interleave2. Note that #89276 also did a similar things on recognizing power-of-two (de)interleave expressions in AArch64's TLI. Though it was using patterns of a single (de)interleave factor while I was using BFS on the tree and supports factor of 2, 4, and 8. Maybe we can consolidate this part of the logics together in the future. Patch is 89.79 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/120490.diff 7 Files Affected:
diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index 3751aac4df8ead..823f8aa8c9a7ef 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -94,6 +94,7 @@ class TargetRegisterClass;
class TargetRegisterInfo;
class TargetTransformInfo;
class Value;
+class VPIntrinsic;
namespace Sched {
@@ -3152,6 +3153,47 @@ class TargetLoweringBase {
return false;
}
+ /// Lower an interleaved load to target specific intrinsics. Return
+ /// true on success.
+ ///
+ /// \p Load is a vp.load instruction.
+ /// \p Mask is a mask value
+ /// \p DeinterleaveIntrin is vector.deinterleave intrinsic
+ /// \p Factor is the interleave factor.
+ /// \p DeinterleaveRes is a list of deinterleaved results.
+ virtual bool lowerInterleavedScalableLoad(
+ VPIntrinsic *Load, Value *Mask, IntrinsicInst *DeinterleaveIntrin,
+ unsigned Factor, ArrayRef<Value *> DeinterleaveRes) const {
+ return false;
+ }
+
+ /// Lower an interleaved store to target specific intrinsics. Return
+ /// true on success.
+ ///
+ /// \p Store is the vp.store instruction.
+ /// \p Mask is a mask value
+ /// \p InterleaveIntrin is vector.interleave intrinsic
+ /// \p Factor is the interleave factor.
+ /// \p InterleaveOps is a list of values being interleaved.
+ virtual bool lowerInterleavedScalableStore(
+ VPIntrinsic *Store, Value *Mask, IntrinsicInst *InterleaveIntrin,
+ unsigned Factor, ArrayRef<Value *> InterleaveOps) const {
+ return false;
+ }
+
+ /// Lower a deinterleave intrinsic to a target specific strided load
+ /// intrinsic. Return true on success.
+ ///
+ /// \p StridedLoad is the vp.strided.load instruction.
+ /// \p DI is the deinterleave intrinsic.
+ /// \p Factor is the interleave factor.
+ /// \p DeinterleaveRes is a list of deinterleaved results.
+ virtual bool lowerDeinterleaveIntrinsicToStridedLoad(
+ VPIntrinsic *StridedLoad, IntrinsicInst *DI, unsigned Factor,
+ ArrayRef<Value *> DeinterleaveRes) const {
+ return false;
+ }
+
/// Lower a deinterleave intrinsic to a target specific load intrinsic.
/// Return true on success. Currently only supports
/// llvm.vector.deinterleave2
diff --git a/llvm/lib/CodeGen/InterleavedAccessPass.cpp b/llvm/lib/CodeGen/InterleavedAccessPass.cpp
index 8b6e3180986c30..0f3b65b8d9af2f 100644
--- a/llvm/lib/CodeGen/InterleavedAccessPass.cpp
+++ b/llvm/lib/CodeGen/InterleavedAccessPass.cpp
@@ -60,6 +60,7 @@
#include "llvm/IR/Instruction.h"
#include "llvm/IR/Instructions.h"
#include "llvm/IR/IntrinsicInst.h"
+#include "llvm/IR/PatternMatch.h"
#include "llvm/InitializePasses.h"
#include "llvm/Pass.h"
#include "llvm/Support/Casting.h"
@@ -248,6 +249,186 @@ static bool isReInterleaveMask(ShuffleVectorInst *SVI, unsigned &Factor,
return false;
}
+// For an (de)interleave tree like this:
+//
+// A C B D
+// |___| |___|
+// |_____|
+// |
+// A B C D
+//
+// We will get ABCD at the end while the leave operands/results
+// are ACBD, which are also what we initially collected in
+// getVectorInterleaveFactor / getVectorDeinterleaveFactor. But TLI
+// hooks (e.g. lowerInterleavedScalableLoad) expect ABCD, so we need
+// to reorder them by interleaving these values.
+static void interleaveLeafValues(SmallVectorImpl<Value *> &Leaves) {
+ unsigned Factor = Leaves.size();
+ assert(isPowerOf2_32(Factor) && Factor <= 8 && Factor > 1);
+
+ if (Factor == 2)
+ return;
+
+ SmallVector<Value *, 8> Buffer;
+ if (Factor == 4) {
+ for (unsigned SrcIdx : {0, 2, 1, 3})
+ Buffer.push_back(Leaves[SrcIdx]);
+ } else {
+ // Factor of 8.
+ //
+ // A E C G B F D H
+ // |_| |_| |_| |_|
+ // |___| |___|
+ // |_______|
+ // |
+ // A B C D E F G H
+ for (unsigned SrcIdx : {0, 4, 2, 6, 1, 5, 3, 7})
+ Buffer.push_back(Leaves[SrcIdx]);
+ }
+
+ llvm::copy(Buffer, Leaves.begin());
+}
+
+static unsigned getVectorInterleaveFactor(IntrinsicInst *II,
+ SmallVectorImpl<Value *> &Operands) {
+ if (II->getIntrinsicID() != Intrinsic::vector_interleave2)
+ return 0;
+
+ unsigned Factor = 0;
+
+ // Visit with BFS
+ SmallVector<IntrinsicInst *, 8> Queue;
+ Queue.push_back(II);
+ while (!Queue.empty()) {
+ IntrinsicInst *Current = Queue.front();
+ Queue.erase(Queue.begin());
+
+ for (unsigned I = 0; I < 2; ++I) {
+ Value *Op = Current->getOperand(I);
+ if (auto *OpII = dyn_cast<IntrinsicInst>(Op))
+ if (OpII->getIntrinsicID() == Intrinsic::vector_interleave2) {
+ Queue.push_back(OpII);
+ continue;
+ }
+
+ ++Factor;
+ Operands.push_back(Op);
+ }
+ }
+
+ // Currently we only recognize power-of-two factors.
+ // FIXME: should we assert here instead?
+ if (Factor > 1 && isPowerOf2_32(Factor)) {
+ interleaveLeafValues(Operands);
+ return Factor;
+ }
+ return 0;
+}
+
+/// Check the interleaved mask
+///
+/// - if a value within the optional is non-nullptr, the value corresponds to
+/// deinterleaved mask
+/// - if a value within the option is nullptr, the value corresponds to all-true
+/// mask
+/// - return nullopt if mask cannot be deinterleaved
+static std::optional<Value *> getMask(Value *WideMask, unsigned Factor) {
+ using namespace llvm::PatternMatch;
+ if (auto *IMI = dyn_cast<IntrinsicInst>(WideMask)) {
+ SmallVector<Value *, 8> Operands;
+ if (unsigned MaskFactor = getVectorInterleaveFactor(IMI, Operands)) {
+ assert(!Operands.empty());
+ if (MaskFactor == Factor &&
+ std::equal(Operands.begin(), Operands.end(), Operands.begin()))
+ return Operands.front();
+ }
+ }
+ if (match(WideMask, m_AllOnes()))
+ return nullptr;
+ return std::nullopt;
+}
+
+static unsigned getVectorDeInterleaveFactor(IntrinsicInst *II,
+ SmallVectorImpl<Value *> &Results) {
+ using namespace PatternMatch;
+ if (II->getIntrinsicID() != Intrinsic::vector_deinterleave2 ||
+ !II->hasNUses(2))
+ return 0;
+
+ unsigned Factor = 0;
+
+ // Visit with BFS
+ SmallVector<IntrinsicInst *, 8> Queue;
+ Queue.push_back(II);
+ while (!Queue.empty()) {
+ IntrinsicInst *Current = Queue.front();
+ Queue.erase(Queue.begin());
+ assert(Current->hasNUses(2));
+
+ unsigned VisitedIdx = 0;
+ for (User *Usr : Current->users()) {
+ // We're playing safe here and matches only the expression
+ // consisting of a perfectly balanced binary tree in which all
+ // intermediate values are only used once.
+ if (!Usr->hasOneUse() || !isa<ExtractValueInst>(Usr))
+ return 0;
+
+ auto *EV = cast<ExtractValueInst>(Usr);
+ ArrayRef<unsigned> Indices = EV->getIndices();
+ if (Indices.size() != 1 || Indices[0] >= 2)
+ return 0;
+
+ // The idea is that we don't want to have two extractvalue
+ // on the same index. So we XOR (index + 1) onto VisitedIdx
+ // such that if there is any duplication, VisitedIdx will be
+ // zero.
+ VisitedIdx ^= Indices[0] + 1;
+ if (!VisitedIdx)
+ return 0;
+ // We have a legal index. At this point we're either going
+ // to continue the traversal or push the leaf values into Results.
+ // But in either cases we need to follow the order imposed by
+ // ExtractValue's indices and swap with the last element pushed
+ // into Queue/Results if necessary (This is also one of the main
+ // reasons using BFS instead of DFS here, btw).
+
+ // When VisitedIdx equals to 0b11, we're the last visted ExtractValue.
+ // So if the current index is 0, we need to swap. Conversely, when
+ // we're either the first visited ExtractValue or the last operand
+ // in Queue/Results is of index 0, there is no need to swap.
+ bool SwapWithLast = VisitedIdx == 0b11 && Indices[0] == 0;
+
+ // Continue the traversal.
+ if (match(EV->user_back(),
+ m_Intrinsic<Intrinsic::vector_deinterleave2>()) &&
+ EV->user_back()->hasNUses(2)) {
+ auto *EVUsr = cast<IntrinsicInst>(EV->user_back());
+ if (SwapWithLast)
+ Queue.insert(Queue.end() - 1, EVUsr);
+ else
+ Queue.push_back(EVUsr);
+ continue;
+ }
+
+ // Save the leaf value.
+ if (SwapWithLast)
+ Results.insert(Results.end() - 1, EV);
+ else
+ Results.push_back(EV);
+
+ ++Factor;
+ }
+ }
+
+ // Currently we only recognize power-of-two factors.
+ // FIXME: should we assert here instead?
+ if (Factor > 1 && isPowerOf2_32(Factor)) {
+ interleaveLeafValues(Results);
+ return Factor;
+ }
+ return 0;
+}
+
bool InterleavedAccessImpl::lowerInterleavedLoad(
LoadInst *LI, SmallVectorImpl<Instruction *> &DeadInsts) {
if (!LI->isSimple() || isa<ScalableVectorType>(LI->getType()))
@@ -480,6 +661,81 @@ bool InterleavedAccessImpl::lowerInterleavedStore(
bool InterleavedAccessImpl::lowerDeinterleaveIntrinsic(
IntrinsicInst *DI, SmallVectorImpl<Instruction *> &DeadInsts) {
+ using namespace PatternMatch;
+ SmallVector<Value *, 8> DeInterleaveResults;
+ unsigned Factor = getVectorDeInterleaveFactor(DI, DeInterleaveResults);
+
+ if (auto *VPLoad = dyn_cast<VPIntrinsic>(DI->getOperand(0));
+ Factor && VPLoad) {
+ if (!match(VPLoad, m_OneUse(m_Intrinsic<Intrinsic::vp_load>())))
+ return false;
+
+ // Check mask operand. Handle both all-true and interleaved mask.
+ Value *WideMask = VPLoad->getOperand(1);
+ std::optional<Value *> Mask = getMask(WideMask, Factor);
+ if (!Mask)
+ return false;
+
+ LLVM_DEBUG(dbgs() << "IA: Found a deinterleave intrinsic: " << *DI << "\n");
+
+ // Since lowerInterleaveLoad expects Shuffles and LoadInst, use special
+ // TLI function to emit target-specific interleaved instruction.
+ if (!TLI->lowerInterleavedScalableLoad(VPLoad, *Mask, DI, Factor,
+ DeInterleaveResults))
+ return false;
+
+ DeadInsts.push_back(DI);
+ DeadInsts.push_back(VPLoad);
+ return true;
+ }
+
+ // Match
+ // %x = vp.strided.load ;; VPStridedLoad
+ // %y = bitcast %x ;; BitCast
+ // %y' = inttoptr %y
+ // %z = deinterleave %y ;; DI
+ if (Factor && isa<BitCastInst, IntToPtrInst>(DI->getOperand(0))) {
+ auto *BitCast = cast<Instruction>(DI->getOperand(0));
+ if (!BitCast->hasOneUse())
+ return false;
+
+ Instruction *IntToPtrCast = nullptr;
+ if (auto *BC = dyn_cast<BitCastInst>(BitCast->getOperand(0))) {
+ IntToPtrCast = BitCast;
+ BitCast = BC;
+ }
+
+ // Match the type is
+ // <VF x (factor * elementTy)> bitcast to <(VF * factor) x elementTy>
+ Value *BitCastSrc = BitCast->getOperand(0);
+ auto *BitCastSrcTy = dyn_cast<VectorType>(BitCastSrc->getType());
+ auto *BitCastDstTy = cast<VectorType>(BitCast->getType());
+ if (!BitCastSrcTy || (BitCastSrcTy->getElementCount() * Factor !=
+ BitCastDstTy->getElementCount()))
+ return false;
+
+ if (auto *VPStridedLoad = dyn_cast<VPIntrinsic>(BitCast->getOperand(0))) {
+ if (VPStridedLoad->getIntrinsicID() !=
+ Intrinsic::experimental_vp_strided_load ||
+ !VPStridedLoad->hasOneUse())
+ return false;
+
+ LLVM_DEBUG(dbgs() << "IA: Found a deinterleave intrinsic: " << *DI
+ << "\n");
+
+ if (!TLI->lowerDeinterleaveIntrinsicToStridedLoad(
+ VPStridedLoad, DI, Factor, DeInterleaveResults))
+ return false;
+
+ DeadInsts.push_back(DI);
+ if (IntToPtrCast)
+ DeadInsts.push_back(IntToPtrCast);
+ DeadInsts.push_back(BitCast);
+ DeadInsts.push_back(VPStridedLoad);
+ return true;
+ }
+ }
+
LoadInst *LI = dyn_cast<LoadInst>(DI->getOperand(0));
if (!LI || !LI->hasOneUse() || !LI->isSimple())
@@ -502,6 +758,33 @@ bool InterleavedAccessImpl::lowerInterleaveIntrinsic(
if (!II->hasOneUse())
return false;
+ if (auto *VPStore = dyn_cast<VPIntrinsic>(*(II->users().begin()))) {
+ if (VPStore->getIntrinsicID() != Intrinsic::vp_store)
+ return false;
+
+ SmallVector<Value *, 8> InterleaveOperands;
+ unsigned Factor = getVectorInterleaveFactor(II, InterleaveOperands);
+ if (!Factor)
+ return false;
+
+ Value *WideMask = VPStore->getOperand(2);
+ std::optional<Value *> Mask = getMask(WideMask, Factor);
+ if (!Mask)
+ return false;
+
+ LLVM_DEBUG(dbgs() << "IA: Found an interleave intrinsic: " << *II << "\n");
+
+ // Since lowerInterleavedStore expects Shuffle and StoreInst, use special
+ // TLI function to emit target-specific interleaved instruction.
+ if (!TLI->lowerInterleavedScalableStore(VPStore, *Mask, II, Factor,
+ InterleaveOperands))
+ return false;
+
+ DeadInsts.push_back(VPStore);
+ DeadInsts.push_back(II);
+ return true;
+ }
+
StoreInst *SI = dyn_cast<StoreInst>(*(II->users().begin()));
if (!SI || !SI->isSimple())
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index b703eb90e8ef30..2dafbf737512a9 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -22190,6 +22190,416 @@ bool RISCVTargetLowering::lowerInterleaveIntrinsicToStore(
return true;
}
+/// Lower an interleaved vp.load into a vlsegN intrinsic.
+///
+/// E.g. Lower an interleaved vp.load (Factor = 2):
+/// %l = call <vscale x 64 x i8> @llvm.vp.load.nxv64i8.p0(ptr %ptr,
+/// %mask,
+/// i32 %wide.rvl)
+/// %dl = tail call { <vscale x 32 x i8>, <vscale x 32 x i8> }
+/// @llvm.vector.deinterleave2.nxv64i8(
+/// <vscale x 64 x i8> %l)
+/// %r0 = extractvalue { <vscale x 32 x i8>, <vscale x 32 x i8> } %dl, 0
+/// %r1 = extractvalue { <vscale x 32 x i8>, <vscale x 32 x i8> } %dl, 1
+///
+/// Into:
+/// %rvl = udiv %wide.rvl, 2
+/// %sl = call { <vscale x 32 x i8>, <vscale x 32 x i8> }
+/// @llvm.riscv.vlseg2.mask.nxv32i8.i64(<vscale x 32 x i8> undef,
+/// <vscale x 32 x i8> undef,
+/// ptr %ptr,
+/// %mask,
+/// i64 %rvl,
+/// i64 1)
+/// %r0 = extractvalue { <vscale x 32 x i8>, <vscale x 32 x i8> } %sl, 0
+/// %r1 = extractvalue { <vscale x 32 x i8>, <vscale x 32 x i8> } %sl, 1
+///
+/// NOTE: the deinterleave2 intrinsic won't be touched and is expected to be
+/// removed by the caller
+bool RISCVTargetLowering::lowerInterleavedScalableLoad(
+ VPIntrinsic *Load, Value *Mask, IntrinsicInst *DeinterleaveIntrin,
+ unsigned Factor, ArrayRef<Value *> DeInterleaveResults) const {
+ assert(Load->getIntrinsicID() == Intrinsic::vp_load &&
+ "Unexpected intrinsic");
+
+ auto *WideVTy = cast<VectorType>(Load->getType());
+ unsigned WideNumElements = WideVTy->getElementCount().getKnownMinValue();
+ assert(WideNumElements % Factor == 0 &&
+ "ElementCount of a wide load must be divisible by interleave factor");
+ auto *VTy =
+ VectorType::get(WideVTy->getScalarType(), WideNumElements / Factor,
+ WideVTy->isScalableTy());
+ // FIXME: Should pass alignment attribute from pointer, but vectorizer needs
+ // to emit it first.
+ auto &DL = Load->getModule()->getDataLayout();
+ Align Alignment = Align(DL.getTypeStoreSize(WideVTy->getScalarType()));
+ if (!isLegalInterleavedAccessType(
+ VTy, Factor, Alignment,
+ Load->getArgOperand(0)->getType()->getPointerAddressSpace(), DL))
+ return false;
+
+ IRBuilder<> Builder(Load);
+ Value *WideEVL = Load->getArgOperand(2);
+ auto *XLenTy = Type::getIntNTy(Load->getContext(), Subtarget.getXLen());
+ Value *EVL = Builder.CreateZExtOrTrunc(
+ Builder.CreateUDiv(WideEVL, ConstantInt::get(WideEVL->getType(), Factor)),
+ XLenTy);
+
+ static const Intrinsic::ID IntrMaskIds[] = {
+ Intrinsic::riscv_vlseg2_mask, Intrinsic::riscv_vlseg3_mask,
+ Intrinsic::riscv_vlseg4_mask, Intrinsic::riscv_vlseg5_mask,
+ Intrinsic::riscv_vlseg6_mask, Intrinsic::riscv_vlseg7_mask,
+ Intrinsic::riscv_vlseg8_mask,
+ };
+ static const Intrinsic::ID IntrIds[] = {
+ Intrinsic::riscv_vlseg2, Intrinsic::riscv_vlseg3, Intrinsic::riscv_vlseg4,
+ Intrinsic::riscv_vlseg5, Intrinsic::riscv_vlseg6, Intrinsic::riscv_vlseg7,
+ Intrinsic::riscv_vlseg8,
+ };
+
+ unsigned SEW = DL.getTypeSizeInBits(VTy->getElementType());
+ unsigned NumElts = VTy->getElementCount().getKnownMinValue();
+ Type *VecTupTy = TargetExtType::get(
+ Load->getContext(), "riscv.vector.tuple",
+ ScalableVectorType::get(Type::getInt8Ty(Load->getContext()),
+ NumElts * SEW / 8),
+ Factor);
+
+ Value *PoisonVal = PoisonValue::get(VecTupTy);
+ SmallVector<Value *> Operands;
+ Operands.append({PoisonVal, Load->getArgOperand(0)});
+
+ Function *VlsegNFunc;
+ if (Mask) {
+ VlsegNFunc = Intrinsic::getOrInsertDeclaration(
+ Load->getModule(), IntrMaskIds[Factor - 2],
+ {VecTupTy, Mask->getType(), EVL->getType()});
+ Operands.push_back(Mask);
+ } else {
+ VlsegNFunc = Intrinsic::getOrInsertDeclaration(
+ Load->getModule(), IntrIds[Factor - 2], {VecTupTy, EVL->getType()});
+ }
+
+ Operands.push_back(EVL);
+
+ // Tail-policy
+ if (Mask)
+ Operands.push_back(ConstantInt::get(XLenTy, 1));
+
+ Operands.push_back(ConstantInt::get(XLenTy, Log2_64(SEW)));
+
+ CallInst *VlsegN = Builder.CreateCall(VlsegNFunc, Operands);
+
+ SmallVector<Type *, 8> AggrTypes{Factor, VTy};
+ Value *Return =
+ PoisonValue::get(StructType::get(Load->getContext(), AggrTypes));
+ Function *VecExtractFunc = Intrinsic::getOrInsertDeclaration(
+ Load->getModule(), Intrinsic::riscv_tuple_extract, {VTy, VecTupTy});
+ for (unsigned i = 0; i < Factor; ++i) {
+ Value *VecExtract =
+ Builder.CreateCall(VecExtractFunc, {VlsegN, Builder.getInt32(i)});
+ Return = Builder.CreateInsertValue(Return, VecExtract, i);
+ }
+
+ for (auto [Idx, DIO] : enumerate(DeInterleaveResults)) {
+ // We have to create a brand new ExtractValue to replace each
+ // of these old ExtractValue instructions.
+ Value *NewEV =
+ Builder.CreateExtractValue(Return, {static_cast<unsigned>(Idx)});
+ DIO->replaceAllUsesWith(NewEV);
+ }
+ DeinterleaveIntrin->replaceAllUsesWith(
+ UndefValue::get(DeinterleaveIntrin->getType()));
+
+ return true;
+}
+
+/// If we're interleaving 2 constant splats, for instance `<vscale x 8 x i32>
+/// <splat of 666>` and `<vscale x 8 x i32> <splat of 777>`, we can create a
+/// larger splat
+/// `<vscale x 4 x i64> <splat of ((777 << 32) | 666)>` first before casting it
+/// into
+/// `<vscale x 8 x i32>`. This will resuling a simple unit stride store rather
+/// than a segment store, which is more expensive in this case.
+static Value *foldInterleaved2OfConstSplats(IntrinsicInst *InterleaveIntrin,
+ VectorType *VTy,
+ const TargetLowering *TLI,
+ Instruction *VPStore) {
+ // We only handle Factor = 2 for now.
+ assert(InterleaveIntrin->arg_size() == 2);
+ auto *SplatVal0 = dyn_cast_or_null<ConstantInt>(
+ getSplatValue(InterleaveIntrin->getArgOperand(0)));
+ auto *SplatVal1 = dyn_cast_or_null<ConstantInt>(
+ getSplatValue(InterleaveIntrin->getArgOperand(1)));
+ if (!SplatVal0 || !SplatVal1)
+ return nullptr;
+
+ auto &Ctx = VPStore->getContext();
+ auto &DL = VPStore->getModule()->getDataLayout();
+
+ auto *NewVTy = VectorType::getExtendedElementVectorType(VTy);
+ if (!TLI->isTypeLegal(TLI->getValueType(DL, NewVTy)))
+ return nullptr;
+
+ // InterleavedAccessPass will remove VPStore after this but we still want to
+ // preserve it, hence clone another one here.
+ auto *ClonedVPStore = VPStore->clone();
+ ClonedVPStore->insertBefore(VPStore);
+ IRBuilder<> Builder(ClonedVPStore);
+
+ ...
[truncated]
|
You can test this locally with the following command:git diff -U0 --pickaxe-regex -S '([^a-zA-Z0-9#_-]undef[^a-zA-Z0-9_-]|UndefValue::get)' bc74a1edbe5e6a3603e65efe06116fa72747acab 725dfad46b81ec4d1b336fb64c9d211ce825bf07 llvm/test/CodeGen/RISCV/rvv/vp-vector-interleaved-access.ll llvm/include/llvm/CodeGen/TargetLowering.h llvm/lib/CodeGen/InterleavedAccessPass.cpp llvm/lib/Target/RISCV/RISCVISelLowering.cpp llvm/lib/Target/RISCV/RISCVISelLowering.h The following files introduce new uses of undef:
Undef is now deprecated and should only be used in the rare cases where no replacement is possible. For example, a load of uninitialized memory yields In tests, avoid using For example, this is considered a bad practice: define void @fn() {
...
br i1 undef, ...
} Please use the following instead: define void @fn(i1 %cond) {
...
br i1 %cond, ...
} Please refer to the Undefined Behavior Manual for more information. |
; RV32-NEXT: li a1, 777 | ||
; RV32-NEXT: li a2, 666 | ||
; RV32-NEXT: sw a2, 8(sp) | ||
; RV32-NEXT: sw a1, 12(sp) | ||
; RV32-NEXT: addi a1, sp, 8 | ||
; RV32-NEXT: vsetvli a2, zero, e64, m8, ta, ma | ||
; RV32-NEXT: vlse64.v v8, (a1), zero |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it be better to use vmerge here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how vmerge helps. Can you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
li a1, 666
vsetvli a2, zero, e32, m8
vmv.v.x v8, a1
vmerge.vim v8, v8, 777, <mask of 01010101....>
llvm/test/CodeGen/RISCV/rvv/scalable-vectors-interleaved-access.ll
Outdated
Show resolved
Hide resolved
// to reorder them by interleaving these values. | ||
static void interleaveLeafValues(SmallVectorImpl<Value *> &Leaves) { | ||
unsigned Factor = Leaves.size(); | ||
assert(isPowerOf2_32(Factor) && Factor <= 8 && Factor > 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mshockwave
I have done a simpler approach to support interleaving of factor 4 by using pattern match.
In my opinion, I think it will be better to do either the simple approach to support factor 8, or your current approach but with support to all power_of_2 factors.
I think the BFS approach is better for all power_of_2 factors, while the simple approach (pattern match) is better for only limited factors like 4 or 8.
And I see that you already did the heavy work, so making it generic for all power_of_2 factors would be better.
That is my opinion :'D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or your current approach but with support to all power_of_2 factors.
so making it generic for all power_of_2 factors would be better.
Oh yeah absolutely, I added Factor <= 8
here simply because RISC-V (and I believe AArch64?) support at most 8. But that said, I'm happy to relax this restriction, in which case I literally only need to remove this condition and it should support all power-of-two out of the box.
Thanks for working on this, this will be important for EVL tail folding. I noticed though that we don't support factor > 2 [de]interleaves for non-VP scalable loads and stores on RISC-V yet. It seems strange that we would support it for VP ops but not for regular ops, could this PR be reworked to maybe add support for the latter first? I.e. consolidating the target-independent BFS here + the AArch64 TLI pattern matching. |
// A B C D E F G H | ||
for (unsigned SrcIdx : {0, 4, 2, 6, 1, 5, 3, 7}) | ||
Buffer.push_back(Leaves[SrcIdx]); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on our conversation above, I think you will need to change this ordering to be generic for all pow_of_2 factors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it has been generalized now.
// getVectorInterleaveFactor / getVectorDeinterleaveFactor. But TLI | ||
// hooks (e.g. lowerInterleavedScalableLoad) expect ABCD, so we need | ||
// to reorder them by interleaving these values. | ||
static void interleaveLeafValues(SmallVectorImpl<Value *> &Leaves) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const reference for Leaves
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaves (or SubLeaves as I just pushed another batch of changes) is both the input and the output container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. Sorry about that. I forget that std::copy has the destination at the end.
// to reorder them by interleaving these values. | ||
static void interleaveLeafValues(SmallVectorImpl<Value *> &Leaves) { | ||
unsigned Factor = Leaves.size(); | ||
assert(isPowerOf2_32(Factor) && Factor <= 8 && Factor > 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what guarantees less than or equal 8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just generalized interleaveLeafValues so it now handles all power-of-two cases.
// into Queue/Results if necessary (This is also one of the main | ||
// reasons using BFS instead of DFS here, btw). | ||
|
||
// When VisitedIdx equals to 0b11, we're the last visted ExtractValue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
visited*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm no longer using VisitedIdx in this version.
e4b64e8
to
4b703ed
Compare
I just stack this PR on top of #123106 so that we now only need to generate masked version of segmented load/store. |
I'm going to circle back to this later. I have addressed rest of your comments though |
4b703ed
to
42f7ba7
Compare
I have split the generalized power-of-two (de)interleave support for non-vp load/store into #123863 . I have also updated this PR to reflect the change. Together, now both AArch64 and RISC-V share the same factor calculation logics that is capable of analyzing any power-of-two (de)interleave intrinsics. |
Teach InterleavedAccessPass to recognize the following patterns: - vp.store an interleaved scalable vector - Deinterleaving a scalable vector loaded from vp.load Upon recognizing these patterns, IA will collect the interleaved / deinterleaved operands and delegate them over to their respective newly-added TLI hooks. For RISC-V, these patterns are lowered into segmented loads/stores (except when we're interleaving constant splats, in which case a unit-strde store will be generated) Right now we only recognized power-of-two (de)interleave cases, in which (de)interleave4/8 are synthesized from a tree of (de)interleave2. Co-authored-by: Nikolay Panchenko <[email protected]>
42f7ba7
to
156661a
Compare
if (!TLI->lowerInterleavedScalableLoad(VPLoad, *Mask, DI, | ||
DeinterleaveValues)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anything preventing this from being called with a fixed length vector? Or why do we need to restrict it to scalable vectors in the first place?
It looks like RISCVTargetLowering::lowerDeinterleaveIntrinsicToLoad already handles fixed length vectors.
I don't think we need to handle fixed length vectors in this PR necessarily, we could just check for them in the TII hook and return false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The vectorizer shouldn't generate deinterleave/interleave intrinsics for fixed vectors, but maybe we should guard against handwritten IR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Now RISCV's TLI implementation will check if it's scalable vectors or now.
/// \p DeinterleaveIntrin is vector.deinterleave intrinsic | ||
/// \p DeinterleaveRes is a list of deinterleaved results. | ||
virtual bool | ||
lowerInterleavedScalableLoad(VPIntrinsic *Load, Value *Mask, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this doesn't have to be strictly restricted to scalable vectors, would it make sense to rename this something like lowerDeinterleaveIntrinsicToVPLoad
to more closely match the lowerDeinterleaveIntrinsicToLoad
hook?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah lowerDeinterleaveIntrinsicToVPLoad
is more consistent with the existing naming. I'd updated it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also updated the list of arguments taken by lowerDeinterleaveIntrinsicToVPLoad / lowerInterleaveIntrinsicToVPStore and they no longer take the (de)interleaved intrinsic argument (because we're not really using it)
- Change what getMask returns - Change the name of TLI hooks - Limit the TLI hooks to scalable vectors only for now
And add negative test case for fixed vectors.
} | ||
} | ||
if (match(WideMask, m_AllOnes())) | ||
return WideMask; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This returns a different sized mask from the interleaved case, but it looks like the interleave intrinsics don't currently check the mask type which is why this works?
E.g. I tried out
%0 = tail call target("riscv.vector.tuple", <vscale x 1 x i8>, 2) @llvm.riscv.vlseg2.mask.triscv.vector.tuple_nxv1i8_2t.nxv2i1(target("riscv.vector.tuple", <vscale x 1 x i8>, 2) undef, ptr %base, <vscale x 4 x i1> %mask, i64 %vl, i64 1, i64 3)
And it seems to compile. I think we're missing a constraint on the intrinsics.
I think we need to return a new constant splat vector with the segment type here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see it's because after the tuple types in #97992 we don't have another vector to add attach a
LLVMScalarOrSameVectorWidth
type to. I'm not sure if there's an easy fix for this cc) @4vtomat
A potential solution could be allowing us to supply a custom code fragment to LLVMScalarOrSameVectorWidth
to retrieve the vector type we want to match against. For instance, if the first overloading type is a riscv.vector.tuple
, we can use this to construct the mask type:
LLVMScalarOrSameVectorWidth<0, [{ return cast<TargetExtType>(Ty)->getTypeParameter(0); }], llvm_i1_ty>;
where Ty
is an implicit Type *
variable representing the original intrinsic parameter type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good. I think that's out of the scope for this PR though, if we can just manually fix WideMask to return the segment type here I think that's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we can just manually fix WideMask to return the segment type here I think that's fine.
Yeah, I just added that, along with checking if the (non-all-ones) mask has the desired vector length nvm, we don't need such check.
if (isPowerOf2_32(N)) { | ||
KnownBits KB = llvm::computeKnownBits(V, DL); | ||
return KB.countMinTrailingZeros() >= Log2_32(N); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No else after return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
// FIXME: Should pass alignment attribute from pointer, but vectorizer needs | ||
// to emit it first. | ||
auto &DL = Load->getModule()->getDataLayout(); | ||
Align Alignment = Align(DL.getTypeStoreSize(WideVTy->getScalarType())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loop vectorizer doesn't emit VP interleaved loads and stores currently right? Should we proactively use Load->getParamAlign(0).value_or(DL->getABITypeAlign(WideVTy->getElementType()))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, it's fixed now.
Value *PoisonVal = PoisonValue::get(VecTupTy); | ||
SmallVector<Value *> Operands{PoisonVal, Load->getArgOperand(0)}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, move this down beside the rest of the Operands.push_backs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I use a normal array instead now.
Return = Builder.CreateInsertValue(Return, VecExtract, i); | ||
} | ||
|
||
for (auto [Idx, DIO] : enumerate(DeInterleaveResults)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit
for (auto [Idx, DIO] : enumerate(DeInterleaveResults)) { | |
for (auto [Idx, DIO] : enumerate(DeinterleaveResults)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
if (OrigEC.getKnownMinValue() % Factor) | ||
return nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an invariant? Can this be made into an assert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes they're invariant because getVector(De)InterleaveFactor + vp.load/store signature have make sure it'll be a multiple of Factor. It's fixed now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with nits, thanks
…0490) Teach InterleavedAccessPass to recognize the following patterns: - vp.store an interleaved scalable vector - Deinterleaving a scalable vector loaded from vp.load Upon recognizing these patterns, IA will collect the interleaved / deinterleaved operands and delegate them over to their respective newly-added TLI hooks. For RISC-V, these patterns are lowered into segmented loads/stores Right now we only recognized power-of-two (de)interleave cases, in which (de)interleave4/8 are synthesized from a tree of (de)interleave2. --------- Co-authored-by: Nikolay Panchenko <[email protected]>
Teach InterleavedAccessPass to recognize the following patterns:
Upon recognizing these patterns, IA will collect the interleaved / deinterleaved operands and delegate them over to their respective newly-added TLI hooks.
For RISC-V, these patterns are lowered into segmented loads/stores (except when we're interleaving constant splats, in which case a unit-strde store will be generated)
Right now we only recognized power-of-two (de)interleave cases, in which (de)interleave4/8 are synthesized from a tree of (de)interleave2.