-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Arm64] "Move" Intrinsics #35037
Comments
Tagging subscribers to this area: @tannergooding |
CC. @echesakovMSFT, @CarolEidt, @TamarChristinaArm |
For |
Same for |
Added the For |
Actually, it looks like the result should always be
|
|
This would be incrementing the register by |
Yup, that's correct. |
namespace System.Runtime.Intrinsics.Arm
{
public abstract class AdvSimd
{
/// <summary>
/// Duplicate general-purpose register to vector
/// For each element result[elem] = value
/// Corresponds to vector forms of DUP and VDUP
/// </summary>
Vector64<byte> DuplicateToVector64(byte value);
Vector64<short> DuplicateToVector64(short value);
Vector64<int> DuplicateToVector64(int value);
Vector64<sbyte> DuplicateToVector64(sbyte value);
Vector64<ushort> DuplicateToVector64(ushort value);
Vector64<uint> DuplicateToVector64(uint value);
Vector128<byte> DuplicateToVector128(byte value);
Vector128<short> DuplicateToVector128(short value);
Vector128<int> DuplicateToVector128(int value);
Vector128<sbyte> DuplicateToVector128(sbyte value);
Vector128<ushort> DuplicateToVector128(ushort value);
Vector128<uint> DuplicateToVector128(uint value);
/// <summary>
/// Duplicate vector element to vector
/// For each element result[elem] = value[index]
/// Corresponds to vector forms of DUP and VDUP
/// </summary>
Vector64<byte> DuplicateSelectedScalarToVector64(Vector64<byte> value, byte index);
Vector64<short> DuplicateSelectedScalarToVector64(Vector64<short> value, byte index);
Vector64<int> DuplicateSelectedScalarToVector64(Vector64<int> value, byte index);
Vector64<float> DuplicateSelectedScalarToVector64(Vector64<float> value, byte index);
Vector64<sbyte> DuplicateSelectedScalarToVector64(Vector64<sbyte> value, byte index);
Vector64<ushort> DuplicateSelectedScalarToVector64(Vector64<ushort> value, byte index);
Vector64<uint> DuplicateSelectedScalarToVector64(Vector64<uint> value, byte index);
Vector64<byte> DuplicateSelectedScalarToVector64(Vector128<byte> value, byte index);
Vector64<short> DuplicateSelectedScalarToVector64(Vector128<short> value, byte index);
Vector64<int> DuplicateSelectedScalarToVector64(Vector128<int> value, byte index);
Vector64<float> DuplicateSelectedScalarToVector64(Vector128<float> value, byte index);
Vector64<sbyte> DuplicateSelectedScalarToVector64(Vector128<sbyte> value, byte index);
Vector64<ushort> DuplicateSelectedScalarToVector64(Vector128<ushort> value, byte index);
Vector64<uint> DuplicateSelectedScalarToVector64(Vector128<uint> value, byte index);
Vector128<byte> DuplicateSelectedScalarToVector128(Vector64<byte> value, byte index);
Vector128<short> DuplicateSelectedScalarToVector128(Vector64<short> value, byte index);
Vector128<int> DuplicateSelectedScalarToVector128(Vector64<int> value, byte index);
Vector128<float> DuplicateSelectedScalarToVector128(Vector64<float> value, byte index);
Vector128<sbyte> DuplicateSelectedScalarToVector128(Vector64<sbyte> value, byte index);
Vector128<ushort> DuplicateSelectedScalarToVector128(Vector64<ushort> value, byte index);
Vector128<uint> DuplicateSelectedScalarToVector128(Vector64<uint> value, byte index);
Vector128<byte> DuplicateSelectedScalarToVector128(Vector128<byte> value, byte index);
Vector128<double> DuplicateSelectedScalarToVector128(Vector128<double> value, byte index);
Vector128<short> DuplicateSelectedScalarToVector128(Vector128<short> value, byte index);
Vector128<int> DuplicateSelectedScalarToVector128(Vector128<int> value, byte index);
Vector128<long> DuplicateSelectedScalarToVector128(Vector128<long> value, byte index);
Vector128<float> DuplicateSelectedScalarToVector128(Vector128<float> value, byte index);
Vector128<sbyte> DuplicateSelectedScalarToVector128(Vector128<sbyte> value, byte index);
Vector128<ushort> DuplicateSelectedScalarToVector128(Vector128<ushort> value, byte index);
Vector128<uint> DuplicateSelectedScalarToVector128(Vector128<uint> value, byte index);
Vector128<ulong> DuplicateSelectedScalarToVector128(Vector128<ulong> value, byte index);
public abstract class Arm64
{
/// <summary>
/// Duplicate general-purpose register to vector
/// For each element result[elem] = value
/// Corresponds to vector forms of DUP
/// </summary>
Vector128<long> DuplicateToVector64(long value);
Vector128<ulong> DuplicateToVector64(ulong value);
/// <summary>
/// Insert vector element from another vector element
/// result[resultIndex] = value[valueIndex]
/// Corresponds to vector forms of INS
/// </summary>
Vector128<byte> InsertSelectedScalar(Vector128<byte> result, byte resultIndex, Vector64<byte> value, byte valueIndex);
Vector128<short> InsertSelectedScalar(Vector128<short> result, byte resultIndex, Vector64<short> value, byte valueIndex);
Vector128<int> InsertSelectedScalar(Vector128<int> result, byte resultIndex, Vector64<int> value, byte valueIndex);
Vector128<float> InsertSelectedScalar(Vector128<float> result, byte resultIndex, Vector64<float> value, byte valueIndex);
Vector128<sbyte> InsertSelectedScalar(Vector128<sbyte> result, byte resultIndex, Vector64<sbyte> value, byte valueIndex);
Vector128<ushort> InsertSelectedScalar(Vector128<ushort> result, byte resultIndex, Vector64<ushort> value, byte valueIndex);
Vector128<uint> InsertSelectedScalar(Vector128<uint> result, byte resultIndex, Vector64<uint> value, byte valueIndex);
Vector128<byte> InsertSelectedScalar(Vector128<byte> result, byte resultIndex, Vector128<byte> value, byte valueIndex);
Vector128<double> InsertSelectedScalar(Vector128<double> result, byte resultIndex, Vector128<double> value, byte valueIndex);
Vector128<short> InsertSelectedScalar(Vector128<short> result, byte resultIndex, Vector128<short> value, byte valueIndex);
Vector128<int> InsertSelectedScalar(Vector128<int> result, byte resultIndex, Vector128<int> value, byte valueIndex);
Vector128<long> InsertSelectedScalar(Vector128<long> result, byte resultIndex, Vector128<long> value, byte valueIndex);
Vector128<float> InsertSelectedScalar(Vector128<float> result, byte resultIndex, Vector128<float> value, byte valueIndex);
Vector128<sbyte> InsertSelectedScalar(Vector128<sbyte> result, byte resultIndex, Vector128<sbyte> value, byte valueIndex);
Vector128<ushort> InsertSelectedScalar(Vector128<ushort> result, byte resultIndex, Vector128<ushort> value, byte valueIndex);
Vector128<uint> InsertSelectedScalar(Vector128<uint> result, byte resultIndex, Vector128<uint> value, byte valueIndex);
Vector128<ulong> InsertSelectedScalar(Vector128<ulong> result, byte resultIndex, Vector128<ulong> value, byte valueIndex);
}
}
} |
@echesakovMSFT, @kunalspathak. If neither of you are working on this one, I'm going to pick it up. It will unblock being able to do #35857 for ARM64. |
@tannergooding Please go ahead - I un-assigned myself |
@TamarChristinaArm, could you indicate what instruction is used for:
https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics?page=3&search=DUP indicates it is supported on ARM32 but It also looks like the variants that take a |
I am wondering what is the motivation to implement |
|
Got it. Yeah, I was little confused as a .NET developer given the set of APIs we expose and the end result of some of them being identical. |
The APIs on The set of APIs exposed was based on allowing interaction with the vectors even when HWIntrinsics aren't supported so you can trivially access the various elements from the debugger or to provide a trivial software fallback. |
No, that looks like an oversight. the same register pair trick can be used to implement them on A32. |
Could you elaborate? The normal register trick is for dealing with just the upper half, while this requires broadcasting to the entire vector which would require, afaict, a dup and then a move? |
Could you also explain how |
The lets take e.g.
|
They map to
ACLE doesn't define these because we don't have a if your values are in the
Same as the above. the indices determine which one of the pairs are used. They all become
So it's split about half way in complexity, for the ones that go through GPR there's no difference between that and the user using If it's for ease of use then that may be an argument. |
I'm not connecting how this would achieve a
Did you get these two reversed? It sounds like, similarly to the above, that |
If that is the case, It's something we should definitely raise to API review to make sure they are aware and agree. We haven't really exposed "convenience" methods that don't map to a single hardware instruction (ignoring needing to move data into the correct register or from memory). Instead, all the intrinsics are essentially 1-to-1 mappings which means in the ideal case you get Most of the
|
You have two versions where you can have a
The instruction has a
No, if you have a Vector128 and you're inserting a long, the only thing you have to do is overwrite the right The problem with the 32-bit variant is you don't have
The
|
Ah, ok. I think I understand now. I missed that Thanks for the explanation (and sorry for the confusion)! |
Yeah there are quite a lot of VMOV variants :)
Yeah, though you also need a
No worries, glad it's clear :) The helpers do indeed need an API design to see if it's worth it. I suspect we didn't do them in ACLE because it wasn't worth the complication. |
The text was updated successfully, but these errors were encountered: