Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Vector.Sum(Vector<T>) API for horizontal add #35626

Closed
Sergio0694 opened this issue Apr 29, 2020 · 14 comments · Fixed by #53527
Closed

Proposal: Vector.Sum(Vector<T>) API for horizontal add #35626

Sergio0694 opened this issue Apr 29, 2020 · 14 comments · Fixed by #53527
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Numerics help wanted [up-for-grabs] Good issue for external contributors

Comments

@Sergio0694
Copy link
Contributor

Overview

This issue is about adding a new API to System.Numerics.Vector:

namespace System.Numerics
{
    public static partial class Vector
    {
        public static T Sum<T>(Vector<T> vector) where T : struct;
    }
}

This API would perform the (unchecked) sum of all T items in the given Vector<T> register.

Motivation

Today, the best way to do a horizontal sum of items in a Vector<T> register is to use Vector.Dot with a unit vestor (Vector<T>.One), this has a number of disadvantages:

  • It is not intuitive for users (imagine looking for some "sum" methods and not seeing it). Also, many devs might not be familiar with how the dot product works exactly, or wouldn't think of this solution immediately when just going through the IntelliSense results.
  • Using Vector.Dot requires occupying a second SIMD register that's not actually needed if we only care about doing a horizontal sum. Not having to use that second register at all will give the JIT more room to do other optimizations (and also avoid having to load that register at all).
  • Using Vector.Dot also adds the unnecessary multiplication instruction, which we don't need.

This new Vector.Sum API would basically compile to the same overall code produced by Vector.Dot, mostly just skipping the initial multiplication and only taking one register as input.

cc. @tannergooding

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.Numerics untriaged New issue has not been triaged by the area owner labels Apr 29, 2020
@ghost
Copy link

ghost commented Apr 29, 2020

Tagging subscribers to this area: @tannergooding
Notify danmosemsft if you want to be subscribed.

@tannergooding tannergooding added api-suggestion Early API idea and discussion, it is NOT ready for implementation and removed untriaged New issue has not been triaged by the area owner labels Apr 29, 2020
@tannergooding
Copy link
Member

CC. @pgovind

@tannergooding tannergooding added this to the Future milestone Jun 23, 2020
@tannergooding tannergooding added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Jan 15, 2021
@tannergooding
Copy link
Member

An alternative name which matches the HWIntrinsic APIs would be: AddAcross

@gfoidl
Copy link
Member

gfoidl commented Jan 15, 2021

AddAcross

Would SumAcross match more? As it results in the sum across the elements of the vector.
I see that AddAcross is used by AdvSimd, but this name makes me think "add across simd-lanes".

At least we should avoid "HorizontalAdd" as this is a similar but different operation in x86 instrinsics.

@tannergooding tannergooding removed this from the Future milestone Jan 28, 2021
@terrajobst terrajobst added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Jan 28, 2021
@terrajobst
Copy link
Contributor

terrajobst commented Jan 28, 2021

Video

  • Looks good as proposed
namespace System.Numerics
{
    public static partial class Vector
    {
        public static T Sum<T>(Vector<T> vector) where T : struct;
    }
}

@tannergooding tannergooding added the help wanted [up-for-grabs] Good issue for external contributors label Jan 28, 2021
@tannergooding
Copy link
Member

The software implementation of this is fairly trivial, but the hardware acceleration will be slightly more involved.

Anyone wanting to pick this up is free to just implement the software side and I can open a separate issue tracking acceleration.
Likewise, I am happy to help walk anyone through how to add the appropriate hardware acceleration (it involves some work in C++).

@zlatanov
Copy link
Contributor

@tannergooding Can I take this? If so, please assign it to me :)

@tannergooding
Copy link
Member

@zlatanov sure thing. I've assigned it out.

@tannergooding
Copy link
Member

tannergooding commented May 28, 2021

@zlatanov, are you still working on this? If so, do you have an ETA on about how much work is left? -- If you aren't working on it, or don't have an ETA, that is also fine. I'm happy to either pick this up myself or wait

We are getting closer to when we'll need to start locking down on what can make it into .NET 6 and this is one I'd like to finish pushing through if possible 😄

@zlatanov
Copy link
Contributor

@tannergooding yes, but I didn't have much time the last couple of weeks. I've written the software implementation and started looking around how to add the JIT related (c++) code. I found that probably the implementation should be added here simdashwintrinsiclistxarch.h and simdashwintrinsiclistarm64.h and have something like:

SIMD_AS_HWINTRINSIC_ID(VectorT128,  Sum,                                                    1,         {NI_VectorT128_Sum,                             NI_VectorT128_Sum,                              NI_VectorT128_Sum,                              NI_VectorT128_Sum,                              NI_VectorT128_Sum,                              NI_VectorT128_Sum,                              NI_VectorT128_Sum,                              NI_VectorT128_Sum,                              NI_VectorT128_Sum,                              NI_VectorT128_Sum},                             SimdAsHWIntrinsicFlag::None)

SIMD_AS_HWINTRINSIC_ID(VectorT256,  Sum,                                                    1,         {NI_VectorT256_Sum,                         NI_VectorT256_Sum,                          NI_VectorT256_Sum,                          NI_VectorT256_Sum,                          NI_VectorT256_Sum,                          NI_VectorT256_Sum,                          NI_VectorT256_Sum,                          NI_VectorT256_Sum,                          NI_VectorT256_Sum,                          NI_VectorT256_Sum},                         SimdAsHWIntrinsicFlag::None)

and a switch handling inside impSimdAsHWIntrinsicSpecial.

I tried to debug it but it seems that something is missing or I am doing it wrong, because the vector doesn't use the intrinsic function, only the software implementation that I have in place.

I didn't want to bother you and wanted to see if I can figure it myself, but I got busy and here we are :)

Can you help me understand what I am doing wrong? I would love to be able to finish this.

@zlatanov
Copy link
Contributor

@tannergooding after adding a few traces to see how intrinsics are resolved by the JIT, I understood why what I wrote wasn't working. Basically my unit tests were wrong.

I had the following code in the tests:

private static object[] TestSum<T>(Func<T[], T> expected) where T : struct
{
    Action test = () =>
    {
        T[] values = GenerateRandomValuesForVector<T>();
        Vector<T> vector = new(values);

        T sum = Vector.Sum(vector);
        Assert.Equal(expected(values), sum);
    };

    return new object[] { typeof(T), test };
}

[Fact]
public void SumInt32() => TestSum<int>(x => x.Aggregate((a, b) => a + b));

[Fact]
public void SumInt64() => TestSum<long>(x => x.Aggregate((a, b) => a + b));

// etc

However the generic use of the Vector in the TestSum method wasn't being considered for hardware acceleration. Can you provide insights why this is so?

@tannergooding
Copy link
Member

You should ensure that:

  1. The method in Vector.cs is marked [Intrinsic]
  2. The relevant table entry exists in https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/simdashwintrinsiclistxarch.h and https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/simdashwintrinsiclistarm64.h
  3. Noting that xarch has both a VectorT128 and VectorT256 table, entries need to exist in both
  4. Noting that types which won't be accelerated, ever, should be NI_Illegal
  5. Noting that types which are specially accelerated (either conditionally or via more than one instruction) should be the same as the ID (that is, NI_VectorT128_Sum and NI_VectorT256_Sum, respectively)

Once that is done, then I'd expect your test to hit this line: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/simdashwintrinsic.cpp#L172 and then pass all the checks to reach: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/simdashwintrinsic.cpp#L275. That should then continue down to case 1 (https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/simdashwintrinsic.cpp#L616) where it should switch on the intrinsic (https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/simdashwintrinsic.cpp#L645) and you can add your custom logic for acceleration.

@zlatanov
Copy link
Contributor

@tannergooding Can you look at this and tell me what I am doing wrong, please:

case NI_VectorT256_Sum:
{
    // HorizontalAdd combines pairs so we need log2(vectorLength) passes to sum all elements together.
    unsigned vectorLength = getSIMDVectorLength(simdSize, simdBaseType);
    int      haddCount    = genLog2(vectorLength);

    GenTree* tmp1 = op1;
    GenTree* tmp2;

    for (int i = 0; i < haddCount; i++)
    {
        tmp1 = impCloneExpr(tmp1, &tmp2, clsHnd, (unsigned)CHECK_SPILL_ALL,
                            nullptr DEBUGARG("Clone op1 for Vector<T>.Sum"));
        tmp1 = gtNewSimdAsHWIntrinsicNode(simdType, tmp1, tmp2, NI_AVX2_HorizontalAdd, simdBaseJitType,
                                            simdSize);
    }

    tmp1 = gtNewSimdAsHWIntrinsicNode(TYP_SIMD16, tmp1, gtNewIconNode(0x01, TYP_INT),
                                        NI_AVX_ExtractVector128, simdBaseJitType, simdSize);

    return gtNewSimdAsHWIntrinsicNode(retType, tmp1, NI_Vector128_ToScalar, simdBaseJitType, 16);
}

I am doing something incorrect in the for loop and I am unsure how to take a asm dump for the jitted code of my tests.

@tannergooding
Copy link
Member

tannergooding commented May 31, 2021

Something incorrect as in you are getting the wrong answer?

My guess is that's the case and you aren't accounting that most V256 operations operating not on 1x256-bit vector, but rather 2x128-bit vectors

and so the last operation (of the log2(vectorLength) ops) needs to be a (GetLower() + GetUpper()).ToScalar()

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jun 1, 2021
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jun 11, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Jul 11, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Numerics help wanted [up-for-grabs] Good issue for external contributors
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants