[API Proposal]: AVX-512 IFMA Intrinsics #96476

MineCake147E · 2024-01-04T10:07:09Z

Background and motivation

AVX-512 IFMA is supported by Intel in the Cannon Lake and newer architectures, and by AMD in Zen 4.
These instructions are known to be useful for cryptography and large number processing, and as a faster compromised alternative for VPMULLQ instruction that finishes 5x slower on Intel CPUs compared to AMD Zen 4, as VPMADD52LUQ finishes in only 4 clock cycles.

API Proposal

namespace System.Runtime.Intrinsics.X86
{
    public abstract class Avx512Ifma : Avx512F
    {
        public static bool IsSupported { get; }
        public static Vector512<ulong> MultiplyAdd52Low(Vector512<ulong> a, Vector512<ulong> b, Vector512<ulong> c);
        public static Vector512<ulong> MultiplyAdd52High(Vector512<ulong> a, Vector512<ulong> b, Vector512<ulong> c);
        public abstract class VL : Avx512F.VL
        {
            public static new bool IsSupported { get; }
            public static Vector256<ulong> MultiplyAdd52Low(Vector256<ulong> a, Vector256<ulong> b, Vector256<ulong> c);
            public static Vector256<ulong> MultiplyAdd52High(Vector256<ulong> a, Vector256<ulong> b, Vector256<ulong> c);
            public static Vector128<ulong> MultiplyAdd52Low(Vector128<ulong> a, Vector128<ulong> b, Vector128<ulong> c);
            public static Vector128<ulong> MultiplyAdd52High(Vector128<ulong> a, Vector128<ulong> b, Vector128<ulong> c);
        }
    }
}

API Usage

zmm0 = Avx512Ifma.MultiplyAdd52Low(zmm0, zmm2, zmm3);
zmm1 = Avx512Ifma.MultiplyAdd52High(zmm1, zmm2, zmm3);

An example of vectorized Montgomery reduction implementations using the equivalent C++ intrinsics:

https://github.com/intel/hexl/blob/2d196fdd71f24511bd7e0e23dc07d37c888f53e7/hexl/util/avx512-util.hpp#L384-L411

Alternative Designs

Risks

None

The text was updated successfully, but these errors were encountered:

ghost · 2024-01-04T10:07:16Z

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

AVX-512 IFMA is supported by Intel in the Cannon Lake and newer architectures, and by AMD in Zen 4.
These instructions are known to be useful for cryptography and large number processing, and as a faster compromised alternative for VPMULLQ instruction that finishes 5x slower on Intel CPUs compared to AMD Zen 4, as VPMADD52LUQ finishes in only 4 clock cycles.

API Proposal

namespace System.Runtime.Intrinsics.X86
{
    public abstract class Avx512Ifma : Avx512F
    {
        public static bool IsSupported { get; }
        public static Vector512<ulong> FusedMultiplyUInt52LowAddUInt64(Vector512<ulong> a, Vector512<ulong> b, Vector512<ulong> c);
        public static Vector512<ulong> FusedMultiplyUInt52HighAddUInt64(Vector512<ulong> a, Vector512<ulong> b, Vector512<ulong> c);
        public abstract class VL : Avx512F.VL
        {
            public static new bool IsSupported { get; }
            public static Vector256<ulong> FusedMultiplyUInt52LowAddUInt64(Vector256<ulong> a, Vector256<ulong> b, Vector256<ulong> c);
            public static Vector256<ulong> FusedMultiplyUInt52HighAddUInt64(Vector256<ulong> a, Vector256<ulong> b, Vector256<ulong> c);
            public static Vector128<ulong> FusedMultiplyUInt52LowAddUInt64(Vector128<ulong> a, Vector128<ulong> b, Vector128<ulong> c);
            public static Vector128<ulong> FusedMultiplyUInt52HighAddUInt64(Vector128<ulong> a, Vector128<ulong> b, Vector128<ulong> c);
        }
    }
}

API Usage

zmm0 = Avx512Ifma.FusedMultiplyUInt52LowAddUInt64(zmm0, zmm2, zmm3);
zmm1 = Avx512Ifma.FusedMultiplyUInt52HighAddUInt64(zmm1, zmm2, zmm3);

An example of vectorized Montgomery reduction implementations using the equivalent C++ intrinsics:

https://github.com/intel/hexl/blob/2d196fdd71f24511bd7e0e23dc07d37c888f53e7/hexl/util/avx512-util.hpp#L384-L411

Alternative Designs

Alternative Names
- VPMADD52LUQ (FusedMultiplyUInt52LowAddUInt64)
  - FusedMultiplyAddLowUInt52
- VPMADD52HUQ (FusedMultiplyUInt52HighAddUInt64)
  - FusedMultiplyAddHighUInt52

Risks

None

Author:	MineCake147E
Assignees:	-
Labels:	`api-suggestion`, `area-System.Runtime.Intrinsics`
Milestone:	-

BruceForstall · 2024-01-04T20:07:15Z

@dotnet/avx512-contrib

tannergooding · 2024-01-04T20:47:18Z

The names here could be "better". I'd expect simply MultiplyAdd52Low and MultiplyAdd52High or similar would be sufficient and more closely matches the "name" portion of the underlying C/C++ intrinsic _mm512_madd52lo_epu64 and _mm512_madd52hi_epu64 (the name portion is madd52lo and madd52hi).

These are not "fused" operations as the result of doing this as separate scalar is the same as if they are done "combined" (which is unlike floating-point)

MineCake147E · 2024-01-15T09:03:46Z

I updated the proposal accordingly.

terrajobst · 2024-02-29T19:00:42Z

Video

We should rename the parameters to addend, left, and right

namespace System.Runtime.Intrinsics.X86
{
    public abstract class Avx512Ifma : Avx512F
    {
        public static bool IsSupported { get; }

        public static Vector512<ulong> MultiplyAdd52Low(Vector512<ulong> addend, Vector512<ulong> left, Vector512<ulong> right);
        public static Vector512<ulong> MultiplyAdd52High(Vector512<ulong> addend, Vector512<ulong> left, Vector512<ulong> right);

        public abstract class VL : Avx512F.VL
        {
            public static new bool IsSupported { get; }
            public static Vector256<ulong> MultiplyAdd52Low(Vector256<ulong> addend, Vector256<ulong> left, Vector256<ulong> right);
            public static Vector256<ulong> MultiplyAdd52High(Vector256<ulong> addend, Vector256<ulong> left, Vector256<ulong> right);
            public static Vector128<ulong> MultiplyAdd52Low(Vector128<ulong> addend, Vector128<ulong> left, Vector128<ulong> right);
            public static Vector128<ulong> MultiplyAdd52High(Vector128<ulong> addend, Vector128<ulong> left, Vector128<ulong> right);
        }
    }
}

saucecontrol · 2024-11-06T02:31:43Z

Similar to #86849, this should probably be changed to:

namespace System.Runtime.Intrinsics.X86;

// approved in https://github.com/dotnet/runtime/issues/98833
public abstract class AvxIfma : Avx2
{
    // new nested class
    [Intrinsic]
    public new abstract class V512
    {
        public static new bool IsSupported { get => IsSupported; }

        public static Vector512<ulong> MultiplyAdd52Low(Vector512<ulong> addend, Vector512<ulong> left, Vector512<ulong> right);
        public static Vector512<ulong> MultiplyAdd52High(Vector512<ulong> addend, Vector512<ulong> left, Vector512<ulong> right);
    }
}

Since the parent is not yet implemented, we also have the option of changing that name to just Ifma, since it wouldn't directly correlate to the AVX_IFMA cpuid bit any longer.

MineCake147E added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Jan 4, 2024

dotnet-issue-labeler bot added the area-System.Runtime.Intrinsics label Jan 4, 2024

ghost added the untriaged New issue has not been triaged by the area owner label Jan 4, 2024

BruceForstall added the avx512 Related to the AVX-512 architecture label Jan 4, 2024

tannergooding added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation untriaged New issue has not been triaged by the area owner labels Jan 16, 2024

DeepakRajendrakumaran mentioned this issue Feb 22, 2024

[API Proposal]: : AVX-IFMA Intrinsics #98833

Open

terrajobst added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Feb 29, 2024

tannergooding mentioned this issue Jun 11, 2024

[API Proposal]: Expose AVX10 converged vector ISA #98069

Closed

stephentoub added this to the 10.0.0 milestone Jul 19, 2024

BruceForstall mentioned this issue Dec 12, 2024

Intel architecture improvements for .NET 10 #108869

Open

37 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[API Proposal]: AVX-512 IFMA Intrinsics #96476

[API Proposal]: AVX-512 IFMA Intrinsics #96476

MineCake147E commented Jan 4, 2024 •

edited

Loading

ghost commented Jan 4, 2024

Background and motivation

API Proposal

API Usage

Alternative Designs

Risks

BruceForstall commented Jan 4, 2024

tannergooding commented Jan 4, 2024

MineCake147E commented Jan 15, 2024

terrajobst commented Feb 29, 2024 •

edited by dotnet-api-review bot

Loading

saucecontrol commented Nov 6, 2024 •

edited

Loading

[API Proposal]: AVX-512 IFMA Intrinsics #96476

[API Proposal]: AVX-512 IFMA Intrinsics #96476

Comments

MineCake147E commented Jan 4, 2024 • edited Loading

Background and motivation

API Proposal

API Usage

Alternative Designs

Risks

ghost commented Jan 4, 2024

Background and motivation

API Proposal

API Usage

Alternative Designs

Risks

BruceForstall commented Jan 4, 2024

tannergooding commented Jan 4, 2024

MineCake147E commented Jan 15, 2024

terrajobst commented Feb 29, 2024 • edited by dotnet-api-review bot Loading

saucecontrol commented Nov 6, 2024 • edited Loading

MineCake147E commented Jan 4, 2024 •

edited

Loading

terrajobst commented Feb 29, 2024 •

edited by dotnet-api-review bot

Loading

saucecontrol commented Nov 6, 2024 •

edited

Loading