-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Vector.Sum(Vector<T>) API for horizontal add #35626
Comments
Tagging subscribers to this area: @tannergooding |
CC. @pgovind |
An alternative name which matches the HWIntrinsic APIs would be: |
Would At least we should avoid "HorizontalAdd" as this is a similar but different operation in x86 instrinsics. |
namespace System.Numerics
{
public static partial class Vector
{
public static T Sum<T>(Vector<T> vector) where T : struct;
}
} |
The software implementation of this is fairly trivial, but the hardware acceleration will be slightly more involved. Anyone wanting to pick this up is free to just implement the software side and I can open a separate issue tracking acceleration. |
@tannergooding Can I take this? If so, please assign it to me :) |
@zlatanov sure thing. I've assigned it out. |
@zlatanov, are you still working on this? If so, do you have an ETA on about how much work is left? -- If you aren't working on it, or don't have an ETA, that is also fine. I'm happy to either pick this up myself or wait We are getting closer to when we'll need to start locking down on what can make it into .NET 6 and this is one I'd like to finish pushing through if possible 😄 |
@tannergooding yes, but I didn't have much time the last couple of weeks. I've written the software implementation and started looking around how to add the JIT related (c++) code. I found that probably the implementation should be added here SIMD_AS_HWINTRINSIC_ID(VectorT128, Sum, 1, {NI_VectorT128_Sum, NI_VectorT128_Sum, NI_VectorT128_Sum, NI_VectorT128_Sum, NI_VectorT128_Sum, NI_VectorT128_Sum, NI_VectorT128_Sum, NI_VectorT128_Sum, NI_VectorT128_Sum, NI_VectorT128_Sum}, SimdAsHWIntrinsicFlag::None)
SIMD_AS_HWINTRINSIC_ID(VectorT256, Sum, 1, {NI_VectorT256_Sum, NI_VectorT256_Sum, NI_VectorT256_Sum, NI_VectorT256_Sum, NI_VectorT256_Sum, NI_VectorT256_Sum, NI_VectorT256_Sum, NI_VectorT256_Sum, NI_VectorT256_Sum, NI_VectorT256_Sum}, SimdAsHWIntrinsicFlag::None)
and a switch handling inside I tried to debug it but it seems that something is missing or I am doing it wrong, because the vector doesn't use the intrinsic function, only the software implementation that I have in place. I didn't want to bother you and wanted to see if I can figure it myself, but I got busy and here we are :) Can you help me understand what I am doing wrong? I would love to be able to finish this. |
@tannergooding after adding a few traces to see how intrinsics are resolved by the JIT, I understood why what I wrote wasn't working. Basically my unit tests were wrong. I had the following code in the tests: private static object[] TestSum<T>(Func<T[], T> expected) where T : struct
{
Action test = () =>
{
T[] values = GenerateRandomValuesForVector<T>();
Vector<T> vector = new(values);
T sum = Vector.Sum(vector);
Assert.Equal(expected(values), sum);
};
return new object[] { typeof(T), test };
}
[Fact]
public void SumInt32() => TestSum<int>(x => x.Aggregate((a, b) => a + b));
[Fact]
public void SumInt64() => TestSum<long>(x => x.Aggregate((a, b) => a + b));
// etc However the generic use of the Vector in the |
You should ensure that:
Once that is done, then I'd expect your test to hit this line: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/simdashwintrinsic.cpp#L172 and then pass all the checks to reach: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/simdashwintrinsic.cpp#L275. That should then continue down to |
@tannergooding Can you look at this and tell me what I am doing wrong, please: case NI_VectorT256_Sum:
{
// HorizontalAdd combines pairs so we need log2(vectorLength) passes to sum all elements together.
unsigned vectorLength = getSIMDVectorLength(simdSize, simdBaseType);
int haddCount = genLog2(vectorLength);
GenTree* tmp1 = op1;
GenTree* tmp2;
for (int i = 0; i < haddCount; i++)
{
tmp1 = impCloneExpr(tmp1, &tmp2, clsHnd, (unsigned)CHECK_SPILL_ALL,
nullptr DEBUGARG("Clone op1 for Vector<T>.Sum"));
tmp1 = gtNewSimdAsHWIntrinsicNode(simdType, tmp1, tmp2, NI_AVX2_HorizontalAdd, simdBaseJitType,
simdSize);
}
tmp1 = gtNewSimdAsHWIntrinsicNode(TYP_SIMD16, tmp1, gtNewIconNode(0x01, TYP_INT),
NI_AVX_ExtractVector128, simdBaseJitType, simdSize);
return gtNewSimdAsHWIntrinsicNode(retType, tmp1, NI_Vector128_ToScalar, simdBaseJitType, 16);
} I am doing something incorrect in the for loop and I am unsure how to take a asm dump for the jitted code of my tests. |
Something incorrect as in you are getting the wrong answer? My guess is that's the case and you aren't accounting that most V256 operations operating not on 1x256-bit vector, but rather 2x128-bit vectors and so the last operation (of the |
Overview
This issue is about adding a new API to
System.Numerics.Vector
:This API would perform the (unchecked) sum of all
T
items in the givenVector<T>
register.Motivation
Today, the best way to do a horizontal sum of items in a
Vector<T>
register is to useVector.Dot
with a unit vestor (Vector<T>.One
), this has a number of disadvantages:Vector.Dot
requires occupying a second SIMD register that's not actually needed if we only care about doing a horizontal sum. Not having to use that second register at all will give the JIT more room to do other optimizations (and also avoid having to load that register at all).Vector.Dot
also adds the unnecessary multiplication instruction, which we don't need.This new
Vector.Sum
API would basically compile to the same overall code produced byVector.Dot
, mostly just skipping the initial multiplication and only taking one register as input.cc. @tannergooding
The text was updated successfully, but these errors were encountered: