-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[API Proposal]: AVX-512 IFMA Intrinsics #96476
Comments
Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics Issue DetailsBackground and motivation
API Proposalnamespace System.Runtime.Intrinsics.X86
{
public abstract class Avx512Ifma : Avx512F
{
public static bool IsSupported { get; }
public static Vector512<ulong> FusedMultiplyUInt52LowAddUInt64(Vector512<ulong> a, Vector512<ulong> b, Vector512<ulong> c);
public static Vector512<ulong> FusedMultiplyUInt52HighAddUInt64(Vector512<ulong> a, Vector512<ulong> b, Vector512<ulong> c);
public abstract class VL : Avx512F.VL
{
public static new bool IsSupported { get; }
public static Vector256<ulong> FusedMultiplyUInt52LowAddUInt64(Vector256<ulong> a, Vector256<ulong> b, Vector256<ulong> c);
public static Vector256<ulong> FusedMultiplyUInt52HighAddUInt64(Vector256<ulong> a, Vector256<ulong> b, Vector256<ulong> c);
public static Vector128<ulong> FusedMultiplyUInt52LowAddUInt64(Vector128<ulong> a, Vector128<ulong> b, Vector128<ulong> c);
public static Vector128<ulong> FusedMultiplyUInt52HighAddUInt64(Vector128<ulong> a, Vector128<ulong> b, Vector128<ulong> c);
}
}
} API Usagezmm0 = Avx512Ifma.FusedMultiplyUInt52LowAddUInt64(zmm0, zmm2, zmm3);
zmm1 = Avx512Ifma.FusedMultiplyUInt52HighAddUInt64(zmm1, zmm2, zmm3); An example of vectorized Montgomery reduction implementations using the equivalent C++ intrinsics: Alternative Designs
RisksNone
|
@dotnet/avx512-contrib |
The names here could be "better". I'd expect simply These are not "fused" operations as the result of doing this as separate scalar is the same as if they are done "combined" (which is unlike floating-point) |
I updated the proposal accordingly. |
namespace System.Runtime.Intrinsics.X86
{
public abstract class Avx512Ifma : Avx512F
{
public static bool IsSupported { get; }
public static Vector512<ulong> MultiplyAdd52Low(Vector512<ulong> addend, Vector512<ulong> left, Vector512<ulong> right);
public static Vector512<ulong> MultiplyAdd52High(Vector512<ulong> addend, Vector512<ulong> left, Vector512<ulong> right);
public abstract class VL : Avx512F.VL
{
public static new bool IsSupported { get; }
public static Vector256<ulong> MultiplyAdd52Low(Vector256<ulong> addend, Vector256<ulong> left, Vector256<ulong> right);
public static Vector256<ulong> MultiplyAdd52High(Vector256<ulong> addend, Vector256<ulong> left, Vector256<ulong> right);
public static Vector128<ulong> MultiplyAdd52Low(Vector128<ulong> addend, Vector128<ulong> left, Vector128<ulong> right);
public static Vector128<ulong> MultiplyAdd52High(Vector128<ulong> addend, Vector128<ulong> left, Vector128<ulong> right);
}
}
} |
Similar to #86849, this should probably be changed to: namespace System.Runtime.Intrinsics.X86;
// approved in https://github.com/dotnet/runtime/issues/98833
public abstract class AvxIfma : Avx2
{
// new nested class
[Intrinsic]
public new abstract class V512
{
public static new bool IsSupported { get => IsSupported; }
public static Vector512<ulong> MultiplyAdd52Low(Vector512<ulong> addend, Vector512<ulong> left, Vector512<ulong> right);
public static Vector512<ulong> MultiplyAdd52High(Vector512<ulong> addend, Vector512<ulong> left, Vector512<ulong> right);
}
} Since the parent is not yet implemented, we also have the option of changing that name to just |
Background and motivation
AVX-512 IFMA
is supported by Intel in the Cannon Lake and newer architectures, and by AMD in Zen 4.These instructions are known to be useful for cryptography and large number processing, and as a faster compromised alternative for
VPMULLQ
instruction that finishes 5x slower on Intel CPUs compared to AMD Zen 4, asVPMADD52LUQ
finishes in only 4 clock cycles.API Proposal
API Usage
An example of vectorized Montgomery reduction implementations using the equivalent C++ intrinsics:
https://github.com/intel/hexl/blob/2d196fdd71f24511bd7e0e23dc07d37c888f53e7/hexl/util/avx512-util.hpp#L384-L411
Alternative Designs
Risks
None
The text was updated successfully, but these errors were encountered: