Proposal: Vector.Sum(Vector<T>) API for horizontal add #35626

Sergio0694 · 2020-04-29T20:32:31Z

Overview

This issue is about adding a new API to System.Numerics.Vector:

namespace System.Numerics
{
    public static partial class Vector
    {
        public static T Sum<T>(Vector<T> vector) where T : struct;
    }
}

This API would perform the (unchecked) sum of all T items in the given Vector<T> register.

Motivation

Today, the best way to do a horizontal sum of items in a Vector<T> register is to use Vector.Dot with a unit vestor (Vector<T>.One), this has a number of disadvantages:

It is not intuitive for users (imagine looking for some "sum" methods and not seeing it). Also, many devs might not be familiar with how the dot product works exactly, or wouldn't think of this solution immediately when just going through the IntelliSense results.
Using Vector.Dot requires occupying a second SIMD register that's not actually needed if we only care about doing a horizontal sum. Not having to use that second register at all will give the JIT more room to do other optimizations (and also avoid having to load that register at all).
Using Vector.Dot also adds the unnecessary multiplication instruction, which we don't need.

This new Vector.Sum API would basically compile to the same overall code produced by Vector.Dot, mostly just skipping the initial multiplication and only taking one register as input.

cc. @tannergooding

The text was updated successfully, but these errors were encountered:

ghost · 2020-04-29T20:32:35Z

Tagging subscribers to this area: @tannergooding
Notify danmosemsft if you want to be subscribed.

tannergooding · 2020-04-29T20:40:53Z

CC. @pgovind

tannergooding · 2021-01-15T20:43:55Z

An alternative name which matches the HWIntrinsic APIs would be: AddAcross

gfoidl · 2021-01-15T20:52:40Z

AddAcross

Would SumAcross match more? As it results in the sum across the elements of the vector.
I see that AddAcross is used by AdvSimd, but this name makes me think "add across simd-lanes".

At least we should avoid "HorizontalAdd" as this is a similar but different operation in x86 instrinsics.

terrajobst · 2021-01-28T18:12:21Z

Video

Looks good as proposed

namespace System.Numerics
{
    public static partial class Vector
    {
        public static T Sum<T>(Vector<T> vector) where T : struct;
    }
}

tannergooding · 2021-01-28T18:52:30Z

The software implementation of this is fairly trivial, but the hardware acceleration will be slightly more involved.

Anyone wanting to pick this up is free to just implement the software side and I can open a separate issue tracking acceleration.
Likewise, I am happy to help walk anyone through how to add the appropriate hardware acceleration (it involves some work in C++).

zlatanov · 2021-04-28T22:29:45Z

@tannergooding Can I take this? If so, please assign it to me :)

tannergooding · 2021-05-03T23:38:24Z

@zlatanov sure thing. I've assigned it out.

tannergooding · 2021-05-28T14:42:47Z

@zlatanov, are you still working on this? If so, do you have an ETA on about how much work is left? -- If you aren't working on it, or don't have an ETA, that is also fine. I'm happy to either pick this up myself or wait

We are getting closer to when we'll need to start locking down on what can make it into .NET 6 and this is one I'd like to finish pushing through if possible 😄

zlatanov · 2021-05-28T14:52:45Z

@tannergooding yes, but I didn't have much time the last couple of weeks. I've written the software implementation and started looking around how to add the JIT related (c++) code. I found that probably the implementation should be added here simdashwintrinsiclistxarch.h and simdashwintrinsiclistarm64.h and have something like:

SIMD_AS_HWINTRINSIC_ID(VectorT128,  Sum,                                                    1,         {NI_VectorT128_Sum,                             NI_VectorT128_Sum,                              NI_VectorT128_Sum,                              NI_VectorT128_Sum,                              NI_VectorT128_Sum,                              NI_VectorT128_Sum,                              NI_VectorT128_Sum,                              NI_VectorT128_Sum,                              NI_VectorT128_Sum,                              NI_VectorT128_Sum},                             SimdAsHWIntrinsicFlag::None)

SIMD_AS_HWINTRINSIC_ID(VectorT256,  Sum,                                                    1,         {NI_VectorT256_Sum,                         NI_VectorT256_Sum,                          NI_VectorT256_Sum,                          NI_VectorT256_Sum,                          NI_VectorT256_Sum,                          NI_VectorT256_Sum,                          NI_VectorT256_Sum,                          NI_VectorT256_Sum,                          NI_VectorT256_Sum,                          NI_VectorT256_Sum},                         SimdAsHWIntrinsicFlag::None)

and a switch handling inside impSimdAsHWIntrinsicSpecial.

I tried to debug it but it seems that something is missing or I am doing it wrong, because the vector doesn't use the intrinsic function, only the software implementation that I have in place.

I didn't want to bother you and wanted to see if I can figure it myself, but I got busy and here we are :)

Can you help me understand what I am doing wrong? I would love to be able to finish this.

zlatanov · 2021-05-28T17:22:49Z

@tannergooding after adding a few traces to see how intrinsics are resolved by the JIT, I understood why what I wrote wasn't working. Basically my unit tests were wrong.

I had the following code in the tests:

private static object[] TestSum<T>(Func<T[], T> expected) where T : struct
{
    Action test = () =>
    {
        T[] values = GenerateRandomValuesForVector<T>();
        Vector<T> vector = new(values);

        T sum = Vector.Sum(vector);
        Assert.Equal(expected(values), sum);
    };

    return new object[] { typeof(T), test };
}

[Fact]
public void SumInt32() => TestSum<int>(x => x.Aggregate((a, b) => a + b));

[Fact]
public void SumInt64() => TestSum<long>(x => x.Aggregate((a, b) => a + b));

// etc

However the generic use of the Vector in the TestSum method wasn't being considered for hardware acceleration. Can you provide insights why this is so?

tannergooding · 2021-05-28T17:45:09Z

You should ensure that:

The method in Vector.cs is marked [Intrinsic]
The relevant table entry exists in https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/simdashwintrinsiclistxarch.h and https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/simdashwintrinsiclistarm64.h
Noting that xarch has both a VectorT128 and VectorT256 table, entries need to exist in both
Noting that types which won't be accelerated, ever, should be NI_Illegal
Noting that types which are specially accelerated (either conditionally or via more than one instruction) should be the same as the ID (that is, NI_VectorT128_Sum and NI_VectorT256_Sum, respectively)

Once that is done, then I'd expect your test to hit this line: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/simdashwintrinsic.cpp#L172 and then pass all the checks to reach: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/simdashwintrinsic.cpp#L275. That should then continue down to case 1 (https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/simdashwintrinsic.cpp#L616) where it should switch on the intrinsic (https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/simdashwintrinsic.cpp#L645) and you can add your custom logic for acceleration.

zlatanov · 2021-05-31T16:34:49Z

@tannergooding Can you look at this and tell me what I am doing wrong, please:

case NI_VectorT256_Sum:
{
    // HorizontalAdd combines pairs so we need log2(vectorLength) passes to sum all elements together.
    unsigned vectorLength = getSIMDVectorLength(simdSize, simdBaseType);
    int      haddCount    = genLog2(vectorLength);

    GenTree* tmp1 = op1;
    GenTree* tmp2;

    for (int i = 0; i < haddCount; i++)
    {
        tmp1 = impCloneExpr(tmp1, &tmp2, clsHnd, (unsigned)CHECK_SPILL_ALL,
                            nullptr DEBUGARG("Clone op1 for Vector<T>.Sum"));
        tmp1 = gtNewSimdAsHWIntrinsicNode(simdType, tmp1, tmp2, NI_AVX2_HorizontalAdd, simdBaseJitType,
                                            simdSize);
    }

    tmp1 = gtNewSimdAsHWIntrinsicNode(TYP_SIMD16, tmp1, gtNewIconNode(0x01, TYP_INT),
                                        NI_AVX_ExtractVector128, simdBaseJitType, simdSize);

    return gtNewSimdAsHWIntrinsicNode(retType, tmp1, NI_Vector128_ToScalar, simdBaseJitType, 16);
}

I am doing something incorrect in the for loop and I am unsure how to take a asm dump for the jitted code of my tests.

tannergooding · 2021-05-31T16:39:38Z

Something incorrect as in you are getting the wrong answer?

My guess is that's the case and you aren't accounting that most V256 operations operating not on 1x256-bit vector, but rather 2x128-bit vectors

and so the last operation (of the log2(vectorLength) ops) needs to be a (GetLower() + GetUpper()).ToScalar()

Dotnet-GitSync-Bot added area-System.Numerics untriaged New issue has not been triaged by the area owner labels Apr 29, 2020

tannergooding added api-suggestion Early API idea and discussion, it is NOT ready for implementation and removed untriaged New issue has not been triaged by the area owner labels Apr 29, 2020

tannergooding added this to the Future milestone Jun 23, 2020

tannergooding added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Jan 15, 2021

tannergooding removed this from the Future milestone Jan 28, 2021

terrajobst added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Jan 28, 2021

tannergooding added the help wanted [up-for-grabs] Good issue for external contributors label Jan 28, 2021

zlatanov mentioned this issue Apr 28, 2021

WebSocket Compression #49304

Merged

tannergooding assigned zlatanov May 3, 2021

zlatanov mentioned this issue Jun 1, 2021

Vector.Sum(Vector<T>) API implementation for horizontal add. #53527

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Jun 1, 2021

tannergooding closed this as completed in #53527 Jun 11, 2021

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jun 11, 2021

ghost locked as resolved and limited conversation to collaborators Jul 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Vector.Sum(Vector<T>) API for horizontal add #35626

Proposal: Vector.Sum(Vector<T>) API for horizontal add #35626

Sergio0694 commented Apr 29, 2020

ghost commented Apr 29, 2020

tannergooding commented Apr 29, 2020

tannergooding commented Jan 15, 2021

gfoidl commented Jan 15, 2021

terrajobst commented Jan 28, 2021 •

edited by dotnet-api-review bot

Loading

tannergooding commented Jan 28, 2021

zlatanov commented Apr 28, 2021

tannergooding commented May 3, 2021

tannergooding commented May 28, 2021 •

edited

Loading

zlatanov commented May 28, 2021

zlatanov commented May 28, 2021

tannergooding commented May 28, 2021

zlatanov commented May 31, 2021

tannergooding commented May 31, 2021 •

edited

Loading

Proposal: Vector.Sum(Vector<T>) API for horizontal add #35626

Proposal: Vector.Sum(Vector<T>) API for horizontal add #35626

Comments

Sergio0694 commented Apr 29, 2020

Overview

Motivation

ghost commented Apr 29, 2020

tannergooding commented Apr 29, 2020

tannergooding commented Jan 15, 2021

gfoidl commented Jan 15, 2021

terrajobst commented Jan 28, 2021 • edited by dotnet-api-review bot Loading

tannergooding commented Jan 28, 2021

zlatanov commented Apr 28, 2021

tannergooding commented May 3, 2021

tannergooding commented May 28, 2021 • edited Loading

zlatanov commented May 28, 2021

zlatanov commented May 28, 2021

tannergooding commented May 28, 2021

zlatanov commented May 31, 2021

tannergooding commented May 31, 2021 • edited Loading

terrajobst commented Jan 28, 2021 •

edited by dotnet-api-review bot

Loading

tannergooding commented May 28, 2021 •

edited

Loading

tannergooding commented May 31, 2021 •

edited

Loading