Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: AVX-512 IFMA Intrinsics #96476

Open
MineCake147E opened this issue Jan 4, 2024 · 6 comments
Open

[API Proposal]: AVX-512 IFMA Intrinsics #96476

MineCake147E opened this issue Jan 4, 2024 · 6 comments
Labels
api-approved API was approved in API review, it can be implemented area-System.Runtime.Intrinsics avx512 Related to the AVX-512 architecture
Milestone

Comments

@MineCake147E
Copy link
Contributor

MineCake147E commented Jan 4, 2024

Background and motivation

AVX-512 IFMA is supported by Intel in the Cannon Lake and newer architectures, and by AMD in Zen 4.
These instructions are known to be useful for cryptography and large number processing, and as a faster compromised alternative for VPMULLQ instruction that finishes 5x slower on Intel CPUs compared to AMD Zen 4, as VPMADD52LUQ finishes in only 4 clock cycles.

API Proposal

namespace System.Runtime.Intrinsics.X86
{
    public abstract class Avx512Ifma : Avx512F
    {
        public static bool IsSupported { get; }
        public static Vector512<ulong> MultiplyAdd52Low(Vector512<ulong> a, Vector512<ulong> b, Vector512<ulong> c);
        public static Vector512<ulong> MultiplyAdd52High(Vector512<ulong> a, Vector512<ulong> b, Vector512<ulong> c);
        public abstract class VL : Avx512F.VL
        {
            public static new bool IsSupported { get; }
            public static Vector256<ulong> MultiplyAdd52Low(Vector256<ulong> a, Vector256<ulong> b, Vector256<ulong> c);
            public static Vector256<ulong> MultiplyAdd52High(Vector256<ulong> a, Vector256<ulong> b, Vector256<ulong> c);
            public static Vector128<ulong> MultiplyAdd52Low(Vector128<ulong> a, Vector128<ulong> b, Vector128<ulong> c);
            public static Vector128<ulong> MultiplyAdd52High(Vector128<ulong> a, Vector128<ulong> b, Vector128<ulong> c);
        }
    }
}

API Usage

zmm0 = Avx512Ifma.MultiplyAdd52Low(zmm0, zmm2, zmm3);
zmm1 = Avx512Ifma.MultiplyAdd52High(zmm1, zmm2, zmm3);

An example of vectorized Montgomery reduction implementations using the equivalent C++ intrinsics:

https://github.com/intel/hexl/blob/2d196fdd71f24511bd7e0e23dc07d37c888f53e7/hexl/util/avx512-util.hpp#L384-L411

Alternative Designs

Risks

None

@MineCake147E MineCake147E added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Jan 4, 2024
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jan 4, 2024
@ghost
Copy link

ghost commented Jan 4, 2024

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

AVX-512 IFMA is supported by Intel in the Cannon Lake and newer architectures, and by AMD in Zen 4.
These instructions are known to be useful for cryptography and large number processing, and as a faster compromised alternative for VPMULLQ instruction that finishes 5x slower on Intel CPUs compared to AMD Zen 4, as VPMADD52LUQ finishes in only 4 clock cycles.

API Proposal

namespace System.Runtime.Intrinsics.X86
{
    public abstract class Avx512Ifma : Avx512F
    {
        public static bool IsSupported { get; }
        public static Vector512<ulong> FusedMultiplyUInt52LowAddUInt64(Vector512<ulong> a, Vector512<ulong> b, Vector512<ulong> c);
        public static Vector512<ulong> FusedMultiplyUInt52HighAddUInt64(Vector512<ulong> a, Vector512<ulong> b, Vector512<ulong> c);
        public abstract class VL : Avx512F.VL
        {
            public static new bool IsSupported { get; }
            public static Vector256<ulong> FusedMultiplyUInt52LowAddUInt64(Vector256<ulong> a, Vector256<ulong> b, Vector256<ulong> c);
            public static Vector256<ulong> FusedMultiplyUInt52HighAddUInt64(Vector256<ulong> a, Vector256<ulong> b, Vector256<ulong> c);
            public static Vector128<ulong> FusedMultiplyUInt52LowAddUInt64(Vector128<ulong> a, Vector128<ulong> b, Vector128<ulong> c);
            public static Vector128<ulong> FusedMultiplyUInt52HighAddUInt64(Vector128<ulong> a, Vector128<ulong> b, Vector128<ulong> c);
        }
    }
}

API Usage

zmm0 = Avx512Ifma.FusedMultiplyUInt52LowAddUInt64(zmm0, zmm2, zmm3);
zmm1 = Avx512Ifma.FusedMultiplyUInt52HighAddUInt64(zmm1, zmm2, zmm3);

An example of vectorized Montgomery reduction implementations using the equivalent C++ intrinsics:

https://github.com/intel/hexl/blob/2d196fdd71f24511bd7e0e23dc07d37c888f53e7/hexl/util/avx512-util.hpp#L384-L411

Alternative Designs

  • Alternative Names
    • VPMADD52LUQ (FusedMultiplyUInt52LowAddUInt64)
      • FusedMultiplyAddLowUInt52
    • VPMADD52HUQ (FusedMultiplyUInt52HighAddUInt64)
      • FusedMultiplyAddHighUInt52

Risks

None

Author: MineCake147E
Assignees: -
Labels:

api-suggestion, area-System.Runtime.Intrinsics

Milestone: -

@BruceForstall BruceForstall added the avx512 Related to the AVX-512 architecture label Jan 4, 2024
@BruceForstall
Copy link
Member

@dotnet/avx512-contrib

@tannergooding
Copy link
Member

The names here could be "better". I'd expect simply MultiplyAdd52Low and MultiplyAdd52High or similar would be sufficient and more closely matches the "name" portion of the underlying C/C++ intrinsic _mm512_madd52lo_epu64 and _mm512_madd52hi_epu64 (the name portion is madd52lo and madd52hi).

These are not "fused" operations as the result of doing this as separate scalar is the same as if they are done "combined" (which is unlike floating-point)

@MineCake147E
Copy link
Contributor Author

I updated the proposal accordingly.

@tannergooding tannergooding added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation untriaged New issue has not been triaged by the area owner labels Jan 16, 2024
@terrajobst
Copy link
Contributor

terrajobst commented Feb 29, 2024

Video

  • We should rename the parameters to addend, left, and right
namespace System.Runtime.Intrinsics.X86
{
    public abstract class Avx512Ifma : Avx512F
    {
        public static bool IsSupported { get; }

        public static Vector512<ulong> MultiplyAdd52Low(Vector512<ulong> addend, Vector512<ulong> left, Vector512<ulong> right);
        public static Vector512<ulong> MultiplyAdd52High(Vector512<ulong> addend, Vector512<ulong> left, Vector512<ulong> right);

        public abstract class VL : Avx512F.VL
        {
            public static new bool IsSupported { get; }
            public static Vector256<ulong> MultiplyAdd52Low(Vector256<ulong> addend, Vector256<ulong> left, Vector256<ulong> right);
            public static Vector256<ulong> MultiplyAdd52High(Vector256<ulong> addend, Vector256<ulong> left, Vector256<ulong> right);
            public static Vector128<ulong> MultiplyAdd52Low(Vector128<ulong> addend, Vector128<ulong> left, Vector128<ulong> right);
            public static Vector128<ulong> MultiplyAdd52High(Vector128<ulong> addend, Vector128<ulong> left, Vector128<ulong> right);
        }
    }
}

@terrajobst terrajobst added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Feb 29, 2024
@stephentoub stephentoub added this to the 10.0.0 milestone Jul 19, 2024
@saucecontrol
Copy link
Member

saucecontrol commented Nov 6, 2024

Similar to #86849, this should probably be changed to:

namespace System.Runtime.Intrinsics.X86;

// approved in https://github.com/dotnet/runtime/issues/98833
public abstract class AvxIfma : Avx2
{
    // new nested class
    [Intrinsic]
    public new abstract class V512
    {
        public static new bool IsSupported { get => IsSupported; }

        public static Vector512<ulong> MultiplyAdd52Low(Vector512<ulong> addend, Vector512<ulong> left, Vector512<ulong> right);
        public static Vector512<ulong> MultiplyAdd52High(Vector512<ulong> addend, Vector512<ulong> left, Vector512<ulong> right);
    }
}

Since the parent is not yet implemented, we also have the option of changing that name to just Ifma, since it wouldn't directly correlate to the AVX_IFMA cpuid bit any longer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-approved API was approved in API review, it can be implemented area-System.Runtime.Intrinsics avx512 Related to the AVX-512 architecture
Projects
None yet
Development

No branches or pull requests

6 participants