-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[API Proposal]: Add support for AVX-512 VNNI hardware instructions #86849
Comments
Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics Issue DetailsBackground and motivationThere already is support for AVX VNNI hardware instruction set with support for 128-/256-bit vectors and it would be good to have same support for 512-bit vectors. (versions for them are available for 512-bit vectors, see https://en.wikipedia.org/wiki/AVX-512?useskin=vector#VNNI) API Proposalnamespace System.Runtime.Intrinsics.X86
{
[Intrinsic]
[RequiresPreviewFeatures("Avx512Vnni is in preview.")]
public abstract class Avx512Vnni : Avx512F
{
public static new bool IsSupported { get => IsSupported; }
[Intrinsic]
public new abstract class X64 : Avx2.X64
{
public static new bool IsSupported { get => IsSupported; }
}
/// <summary>
/// __m512i _mm512_dpbusd_epi32 (__m512i src, __m512i a, __m512i b)
/// VPDPBUSD zmm, zmm, zmm/m512
/// </summary>
public static Vector512<int> MultiplyWideningAndAdd(Vector512<int> addend, Vector512<byte> left, Vector512<sbyte> right) => MultiplyWideningAndAdd(addend, left, right);
/// <summary>
/// __m512i _mm512_dpwssd_epi32 (__m512i src, __m512i a, __m512i b)
/// VPDPWSSD zmm, zmm, zmm/m512
/// </summary>
public static Vector512<int> MultiplyWideningAndAdd(Vector512<int> addend, Vector512<short> left, Vector512<short> right) => MultiplyWideningAndAdd(addend, left, right);
/// <summary>
/// __m512i _mm512_dpbusds_epi32 (__m512i src, __m512i a, __m512i b)
/// VPDPBUSDS zmm, zmm, zmm/m512
/// </summary>
public static Vector512<int> MultiplyWideningAndAddSaturate(Vector512<int> addend, Vector512<byte> left, Vector512<sbyte> right) => MultiplyWideningAndAddSaturate(addend, left, right);
/// <summary>
/// __m512i _mm512_dpwssds_epi32 (__m512i src, __m512i a, __m512i b)
/// VPDPWSSDS zmm, zmm, zmm/m512
/// </summary>
public static Vector512<int> MultiplyWideningAndAddSaturate(Vector512<int> addend, Vector512<short> left, Vector512<short> right) => MultiplyWideningAndAddSaturate(addend, left, right);
}
} API Usage// Example ripped from my Adler32 rolling hash implementation code
// where 256-bit vector are currently used instead but easily can be widened to 512-bit vector as shown below
// Also non-revelant stuff has been cut off for brevity
while (IsAddressLessThan(ref dataRef, ref end)) {
Vector512<byte> bytes = Vector512.LoadUnsafe(ref dataRef);
vadlerBA += vadlerA;
if (Avx512Vnni.IsSupported) {
vadlerBmult = Avx512Vnni.MultiplyWideningAndAdd(vadlerBmult.AsInt32(), bytes, mults_vector).AsUInt32();
} else {
vadlerBmult += Avx512BW.MultiplyAddAdjacent(Avx512BW.MultiplyAddAdjacent(bytes, mults_vector), Vector256<short>.One).AsUInt32();
}
vadlerA += Avx512BW.SumAbsoluteDifferences(bytes, zero).AsUInt64();
dataRef = ref Add(ref dataRef, Vector512<byte>.Count);
} Alternative DesignsMaybe it would be better to add Vector512 versions of functions into existing AvxVnni static class but I am not sure if that would be a good idea as these instructions use EVEX encoding and may not be available on some intel processors with hybrid core architecture (Adler Lake and its successors). RisksN/A
|
I don't this this will pass; AVX-VNNI and AVX512-VNNI are distinct instruction sets; in fact the 512-bit instructions came before the 128 and 256-bit ones. |
Yes, this would need to be its own class Provided our CI hardware supports it (and I believe it does), it would not need to be in preview. I don't think |
@MadProbe, couple fixes are needed....
The latter will be identical to |
This issue has been marked |
All done, sorry for late edit |
Looks good as proposed namespace System.Runtime.Intrinsics.X86
{
[Intrinsic]
public abstract class Avx512Vnni : Avx512F
{
public static new bool IsSupported { get => IsSupported; }
[Intrinsic]
public new abstract class X64 : Avx512F.X64
{
public static new bool IsSupported { get => IsSupported; }
}
[Intrinsic]
public new abstract class VL : Avx512F.VL
{
public static new bool IsSupported { get => IsSupported; }
/// <summary>
/// __m128i _mm_dpbusd_epi32 (__m128i src, __m128i a, __m128i b)
/// VPDPBUSD xmm, xmm, xmm/m128
/// </summary>
public static Vector128<int> MultiplyWideningAndAdd(Vector128<int> addend, Vector128<byte> left, Vector128<sbyte> right) => MultiplyWideningAndAdd(addend, left, right);
/// <summary>
/// __m128i _mm_dpwssd_epi32 (__m128i src, __m128i a, __m128i b)
/// VPDPWSSD xmm, xmm, xmm/m128
/// </summary>
public static Vector128<int> MultiplyWideningAndAdd(Vector128<int> addend, Vector128<short> left, Vector128<short> right) => MultiplyWideningAndAdd(addend, left, right);
/// <summary>
/// __m256i _mm256_dpbusd_epi32 (__m256i src, __m256i a, __m256i b)
/// VPDPBUSD ymm, ymm, ymm/m256
/// </summary>
public static Vector256<int> MultiplyWideningAndAdd(Vector256<int> addend, Vector256<byte> left, Vector256<sbyte> right) => MultiplyWideningAndAdd(addend, left, right);
/// <summary>
/// __m256i _mm256_dpwssd_epi32 (__m256i src, __m256i a, __m256i b)
/// VPDPWSSD ymm, ymm, ymm/m256
/// </summary>
public static Vector256<int> MultiplyWideningAndAdd(Vector256<int> addend, Vector256<short> left, Vector256<short> right) => MultiplyWideningAndAdd(addend, left, right);
/// <summary>
/// __m128i _mm_dpbusds_epi32 (__m128i src, __m128i a, __m128i b)
/// VPDPBUSDS xmm, xmm, xmm/m128
/// </summary>
public static Vector128<int> MultiplyWideningAndAddSaturate(Vector128<int> addend, Vector128<byte> left, Vector128<sbyte> right) => MultiplyWideningAndAddSaturate(addend, left, right);
/// <summary>
/// __m128i _mm_dpwssds_epi32 (__m128i src, __m128i a, __m128i b)
/// VPDPWSSDS xmm, xmm, xmm/m128
/// </summary>
public static Vector128<int> MultiplyWideningAndAddSaturate(Vector128<int> addend, Vector128<short> left, Vector128<short> right) => MultiplyWideningAndAddSaturate(addend, left, right);
/// <summary>
/// __m256i _mm256_dpbusds_epi32 (__m256i src, __m256i a, __m256i b)
/// VPDPBUSDS ymm, ymm, ymm/m256
/// </summary>
public static Vector256<int> MultiplyWideningAndAddSaturate(Vector256<int> addend, Vector256<byte> left, Vector256<sbyte> right) => MultiplyWideningAndAddSaturate(addend, left, right);
/// <summary>
/// __m256i _mm256_dpwssds_epi32 (__m256i src, __m256i a, __m256i b)
/// VPDPWSSDS ymm, ymm, ymm/m256
/// </summary>
public static Vector256<int> MultiplyWideningAndAddSaturate(Vector256<int> addend, Vector256<short> left, Vector256<short> right) => MultiplyWideningAndAddSaturate(addend, left, right);
}
/// <summary>
/// __m512i _mm512_dpbusd_epi32 (__m512i src, __m512i a, __m512i b)
/// VPDPBUSD zmm, zmm, zmm/m512
/// </summary>
public static Vector512<int> MultiplyWideningAndAdd(Vector512<int> addend, Vector512<byte> left, Vector512<sbyte> right) => MultiplyWideningAndAdd(addend, left, right);
/// <summary>
/// __m512i _mm512_dpwssd_epi32 (__m512i src, __m512i a, __m512i b)
/// VPDPWSSD zmm, zmm, zmm/m512
/// </summary>
public static Vector512<int> MultiplyWideningAndAdd(Vector512<int> addend, Vector512<short> left, Vector512<short> right) => MultiplyWideningAndAdd(addend, left, right);
/// <summary>
/// __m512i _mm512_dpbusds_epi32 (__m512i src, __m512i a, __m512i b)
/// VPDPBUSDS zmm, zmm, zmm/m512
/// </summary>
public static Vector512<int> MultiplyWideningAndAddSaturate(Vector512<int> addend, Vector512<byte> left, Vector512<sbyte> right) => MultiplyWideningAndAddSaturate(addend, left, right);
/// <summary>
/// __m512i _mm512_dpwssds_epi32 (__m512i src, __m512i a, __m512i b)
/// VPDPWSSDS zmm, zmm, zmm/m512
/// </summary>
public static Vector512<int> MultiplyWideningAndAddSaturate(Vector512<int> addend, Vector512<short> left, Vector512<short> right) => MultiplyWideningAndAddSaturate(addend, left, right);
}
} |
I am interested why is this proposal is tagged with |
It was marked that way because a non area owner set the milestone and the milestone being "correct" needed confirmation from the area owners |
To follow the newer pattern we've established with AVX10, this should probably be changed to namespace System.Runtime.Intrinsics.X86;
// existing class
public abstract class AvxVnni : Avx2
{
// new nested class
[Intrinsic]
public new abstract class V512
{
public static new bool IsSupported { get => IsSupported; }
/// <summary>
/// __m512i _mm512_dpbusd_epi32 (__m512i src, __m512i a, __m512i b)
/// VPDPBUSD zmm, zmm, zmm/m512
/// </summary>
public static Vector512<int> MultiplyWideningAndAdd(Vector512<int> addend, Vector512<byte> left, Vector512<sbyte> right) => MultiplyWideningAndAdd(addend, left, right);
/// <summary>
/// __m512i _mm512_dpwssd_epi32 (__m512i src, __m512i a, __m512i b)
/// VPDPWSSD zmm, zmm, zmm/m512
/// </summary>
public static Vector512<int> MultiplyWideningAndAdd(Vector512<int> addend, Vector512<short> left, Vector512<short> right) => MultiplyWideningAndAdd(addend, left, right);
/// <summary>
/// __m512i _mm512_dpbusds_epi32 (__m512i src, __m512i a, __m512i b)
/// VPDPBUSDS zmm, zmm, zmm/m512
/// </summary>
public static Vector512<int> MultiplyWideningAndAddSaturate(Vector512<int> addend, Vector512<byte> left, Vector512<sbyte> right) => MultiplyWideningAndAddSaturate(addend, left, right);
/// <summary>
/// __m512i _mm512_dpwssds_epi32 (__m512i src, __m512i a, __m512i b)
/// VPDPWSSDS zmm, zmm, zmm/m512
/// </summary>
public static Vector512<int> MultiplyWideningAndAddSaturate(Vector512<int> addend, Vector512<short> left, Vector512<short> right) => MultiplyWideningAndAddSaturate(addend, left, right);
}
} which eliminates the duplication between |
I don't think so. AVX512-VNNI predates AVX-VNNI. There are many CPUs that don't support AVX-VNNI but do support AVX512-VNNI. Even though there's no difference aside instruction encoding, the name confuses me a lot. |
This isn’t relevant nor impacted by the proposed API surface change. AVX512 is a legacy and effectively deprecated api set that we would likely have not exposed as is f we knew in advance about the transition to AVX10.1 and the new scheme that would exist moving forward. Doing this removes duplication without negatively impacting any consumer of the api, minus a minor nuance that the class name minorly differs from the CPUID bit name and keeps it consistent with the intended “converged isa” schema that’s been defined for the future by Intel under Avx10.1 |
I think if this were going through review today, it would probably be named just
This is also confusing because you either:
|
The problem is: AMD. |
NOTE: Some of the below discussion is largely based on "reasonable speculation" supported by typically seen timeframes. It shouldn't be taken as fact, a promise of the future, etc
I'd say doubtful to this. The
AMD doesn't cause any problem here. My statement was that AVX512 is legacy/deprecated and would not be exposed "as is", not that it wouldn't be exposed. Had we known in advance, we likely would've exposed this functionality following a split that more closely models how things functionally exist today while still fitting the generalized schema for the future. That is, rather than defining Thus, we likely would've had a schema where:
This would have been a divergence from what was formally spec'd in CPUID, but it would in turn hide some of the messy nuance that exists and make it much easier for developers to write code that works in production for both hardware. With the setup we have today, we instead have this awkward duplication between many of the
This is effectively related to timing as designing a CPU and integrating all the functionality takes years (for many reasons). You can see this in between when a specification is announced and when it actually shows up in hardware with there being, on average, a 2-3 year delay between when an ISA specification is revealed and when we first see hardware start to implement it. Sometimes its a bit less and other times longer, depending on exactly what is required, how similar it is to past support, if additional or differing silicon is required, etc. In the case of Zen5, it was first announced in some official capacity back in 2018, had confirmation on the fabrication process back in 2022, and it shipped in 2024. FP16 had been announced in mid 2021 (shipping with This will likely be rectified in Zen6 considering the same overall timing (which as per the top note is purely speculation).
Support for The only thing currently missing is Particularly with the new -- As an additional note, one of the likely reasons that |
I see, that makes sense. I apologize for my misunderstanding.
I think you meant to write Sapphire Rapids, right? |
Yes, too many similar microarchitecture names and I messed this particular one up 😄 |
Background and motivation
There already is support for AVX VNNI hardware instruction set with support for 128-/256-bit vectors and it would be good to have same support for 512-bit vectors. (versions for them are available for 512-bit vectors, see https://en.wikipedia.org/wiki/AVX-512?useskin=vector#VNNI)
Also this feature is in preview to be consistent with existing AvxVnni API
API Proposal
API Usage
The motivation for this proposal is largely the same as that of AvxVnni. These instructions are used the same way that AvxVnni is and may be universally used in any algorithm as long as you know where to use them and have good performance improvements against multiple instruction counterparts with same output.
Alternative Designs
Maybe it would be better to add Vector512 versions of functions into existing AvxVnni static class but I am not sure if that would be a good idea as these instructions use EVEX encoding and may not be available on some intel processors with hybrid core architecture (Adler Lake and its successors).
Risks
N/A
The text was updated successfully, but these errors were encountered: