Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify Intel intrinsics against upstream definitions #251

Merged
merged 1 commit into from
Dec 29, 2017

Conversation

alexcrichton
Copy link
Member

This commit adds a new crate for testing that the intrinsics listed in this
crate do indeed match the upstream definition of each intrinsic. A
pre-downloaded XML description of all Intel intrinsics is checked in which is
then parsed in the stdsimd-verify crate to verify that everything we write
down is matched against the upstream definitions.

Currently the checks are pretty loose to get this compiling but a few intrinsics
were fixed as a result of this. For example:

  • _mm256_extract_epi8 - AVX2 intrinsic erroneously listed under AVX
  • _mm256_extract_epi16 - AVX2 intrinsic erroneously listed under AVX
  • _mm256_extract_epi32 - AVX2 intrinsic erroneously listed under AVX
  • _mm256_extract_epi64 - AVX2 intrinsic erroneously listed under AVX
  • _mm_tzcnt_32 - erroneously had u32 in the name
  • _mm_tzcnt_64 - erroneously had u64 in the name
  • _mm_cvtsi64_si128 - erroneously available on 32-bit platforms
  • _mm_cvtsi64x_si128 - erroneously available on 32-bit platforms
  • _mm_cvtsi128_si64 - erroneously available on 32-bit platforms
  • _mm_cvtsi128_si64x - erroneously available on 32-bit platforms
  • _mm_extract_epi64 - erroneously available on 32-bit platforms
  • _mm_insert_epi64 - erroneously available on 32-bit platforms
  • _mm256_extract_epi16 - erroneously returned i32 instead of i16
  • _mm256_extract_epi8 - erroneously returned i32 instead of i8
  • _mm_shuffle_ps - the mask argument was erroneously i32 instead of u32
  • _popcnt32 - the signededness of the argument and return were flipped
  • _popcnt64 - the signededness of the argument was flipped and the argument
    was too large bit-wise
  • _mm_tzcnt_32 - the return value's sign was flipped
  • _mm_tzcnt_64 - the return value's sign was flipped
  • A good number of intrinsics used imm8: i8 or imm8: u8 instead of imm8: i32 which Intel was using. (we were also internally inconsistent)
  • A number of intrinsics working with __m64 were instead working with i64/u64,
    so they're now corrected to operate with the vector types instead.

Currently the verifications performed are:

  • Each name in Rust is defined in the XML document
  • The arguments/return values all agree.
  • The CPUID features listed in the XML document are all enabled in Rust as well.

The type matching right now is pretty loose and has a lot of questionable
changes. Future commits will touch these up to be more strict and require closer
adherence with Intel's own types. Otherwise types like i32x8 (or any integers
with 256 bits) all match up to __m256i right now, althoguh this may want to
change in the future.

Finally we're also not testing the instruction listed in the XML right now.
There's a huge number of discrepancies between the instruction listed in the XML
and the instruction listed in assert_instr, and those'll need to be taken care
of in a future commit.

Closes #240

@gnzlbg
Copy link
Contributor

gnzlbg commented Dec 22, 2017 via email

@alexcrichton alexcrichton force-pushed the verify branch 7 times, most recently from be1b710 to 7b5f300 Compare December 24, 2017 16:02
@gnzlbg gnzlbg self-requested a review December 26, 2017 10:51
@@ -44,16 +44,16 @@ pub unsafe fn _lzcnt_u64(x: u64) -> u64 {
#[inline(always)]
#[target_feature = "+popcnt"]
#[cfg_attr(test, assert_instr(popcnt))]
pub unsafe fn _popcnt32(x: u32) -> u32 {
x.count_ones()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was on purpose, bit manipulation intrinsics prefer unsigned integers like the ones in std:: do. Whatever we decide to do, we should choose a guideline, and follow it consistently (many other intrinsics that take __m128i take for example u32x4 instead of i32x4).

We should open an issue for this.

@@ -607,77 +607,77 @@ pub unsafe fn _mm256_xor_ps(a: f32x8, b: f32x8) -> f32x8 {
}

/// Equal (ordered, non-signaling)
pub const _CMP_EQ_OQ: u8 = 0x00;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unsigned part was on purpose (bit flags consistently used unsigned integers), why we use 8 instead of 32 bits here, I don't know.

Whatever we do, we should do it consistently, and have a guideline for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So IIRC as mentioned below, arguments that can only be 8-bit long, 16 bit long, etc. were made _8 (e.g. i8) on purpose. This happens on most intrinsics taking an imm8 or imm16.

The Intel intrinsics use here int/long/... very loosely, so this was part of offering better types by default.

@@ -1138,7 +1096,7 @@ pub unsafe fn _mm_permute_pd(a: f64x2, imm8: i32) -> f64x2 {
#[inline(always)]
#[target_feature = "+avx"]
#[cfg_attr(test, assert_instr(vperm2f128, imm8 = 0x5))]
pub unsafe fn _mm256_permute2f128_ps(a: f32x8, b: f32x8, imm8: i8) -> f32x8 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguments that can only be 8 bit long like this imm8 were make i8 or u8 on purpose.

@@ -27,8 +27,8 @@ pub unsafe fn _bextr_u32(a: u32, start: u32, len: u32) -> u32 {
#[target_feature = "+bmi"]
#[cfg_attr(test, assert_instr(bextr))]
#[cfg(not(target_arch = "x86"))]
pub unsafe fn _bextr_u64(a: u64, start: u64, len: u64) -> u64 {
_bextr2_u64(a, (start & 0xff_u64) | ((len & 0xff_u64) << 8_u64))
pub unsafe fn _bextr_u64(a: u64, start: u32, len: u32) -> u64 {
Copy link
Contributor

@gnzlbg gnzlbg Dec 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should probably be u8 instead of u32 since they can only take values in the [0, 64) range.

#[inline(always)]
#[target_feature = "+bmi"]
#[cfg_attr(test, assert_instr(tzcnt))]
pub unsafe fn _tzcnt_u16(x: u16) -> u16 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GCC and clang define these:

 #define _tzcnt_u16(a)     (__tzcnt_u16((a)))

E.g. see here: https://clang.llvm.org/doxygen/bmiintrin_8h_source.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, but this isn't on Intel's intrinsic list, which is what this crate is for.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the Intel intrinsic list only defines _tzcnt_u{32,64}.

MSVC does not define it here: https://msdn.microsoft.com/en-us/library/hh977023.aspx

and the Intel C++ compiler doesn't seem to define it either. The GCC compiler defines it The GCC definition is here: https://github.com/gcc-mirror/gcc/blob/master/gcc/config/i386/bmiintrin.h#L38M

Perhaps, but this isn't on Intel's intrinsic list, which is what this crate is for

Intel Intrinsics list doesn't define many x86 intrinsics either (e.g. all of the AMD ones).

Anyways, I am fine with removing it, just saying that two widely used C and C++ compilers offer it, which is why it was offered here.

@@ -163,8 +153,8 @@ pub unsafe fn _tzcnt_u64(x: u64) -> u64 {
#[inline(always)]
#[target_feature = "+bmi"]
#[cfg_attr(test, assert_instr(tzcnt))]
pub unsafe fn _mm_tzcnt_u32(x: u32) -> u32 {
x.trailing_zeros()
pub unsafe fn _mm_tzcnt_32(x: u32) -> i32 {
Copy link
Contributor

@gnzlbg gnzlbg Dec 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The renaming is correct, but these should return an unsigned integer (it returns a size). This integer is in range [0,32), so per what we were currently doing somewhere else it should be an u8, but I have opened an issue to clarify that.

Also, clang, GCC, and the Intel intrinsic list also define

 #define _tzcnt_u32(a)     (__tzcnt_u32((a)))

and in this module the non-_mm prefixed versions are missing (so the typo might have come from here).

@@ -173,8 +163,8 @@ pub unsafe fn _mm_tzcnt_u32(x: u32) -> u32 {
#[inline(always)]
#[target_feature = "+bmi"]
#[cfg_attr(test, assert_instr(tzcnt))]
pub unsafe fn _mm_tzcnt_u64(x: u64) -> u64 {
x.trailing_zeros() as u64
pub unsafe fn _mm_tzcnt_64(x: u64) -> i64 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, non-_mm-version missing, size and signedness of the result do not match what we were doing everywhere else.

@@ -55,8 +55,8 @@ pub unsafe fn _bzhi_u32(a: u32, index: u32) -> u32 {
#[target_feature = "+bmi2"]
#[cfg_attr(test, assert_instr(bzhi))]
#[cfg(not(target_arch = "x86"))]
pub unsafe fn _bzhi_u64(a: u64, index: u64) -> u64 {
x86_bmi2_bzhi_64(a, index)
pub unsafe fn _bzhi_u64(a: u64, index: u32) -> u64 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

index is in range [0, 64), so size of the index might be wrong

@@ -764,7 +764,7 @@ pub unsafe fn _mm_setzero_ps() -> f32x4 {
#[inline(always)]
#[target_feature = "+sse"]
#[cfg_attr(test, assert_instr(shufps, mask = 3))]
pub unsafe fn _mm_shuffle_ps(a: f32x4, b: f32x4, mask: i32) -> f32x4 {
pub unsafe fn _mm_shuffle_ps(a: f32x4, b: f32x4, mask: u32) -> f32x4 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that in general the Intel intrinsics "try" to use unsigned integers for bit manipulations because that is what makes sense in C++ due to unsigned integers having modulo arithmetic (but signed integers having undefined behavior on overflow).

The Intel intrinsics don't do this consistently (probably because some of them are 20 years old and this guideline wasn't clear then) but we should probably follow what the new intrinsics do.

@@ -906,7 +906,7 @@ pub unsafe fn _mm_movemask_ps(a: f32x4) -> i32 {
assert_instr(unpcklps))]
// TODO: This function is actually not limited to floats, but that's what
// what matches the C type most closely: (__m128, *const __m64) -> __m128
pub unsafe fn _mm_loadh_pi(a: f32x4, p: *const f32) -> f32x4 {
pub unsafe fn _mm_loadh_pi(a: f32x4, p: *const u8x8) -> f32x4 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be p: *const f32x2 because this loads two floats from p into the upper two floats of a.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, in case we can't agree on that, then I think _m64 would have been better than u8x8.

@@ -957,7 +957,7 @@ pub unsafe fn _mm_loadh_pi(a: f32x4, p: *const f32) -> f32x4 {
#[cfg_attr(all(test, target_arch = "x86", not(target_feature = "sse2")),
assert_instr(movss))]
// TODO: Like _mm_loadh_pi, this also isn't limited to floats.
pub unsafe fn _mm_loadl_pi(a: f32x4, p: *const f32) -> f32x4 {
pub unsafe fn _mm_loadl_pi(a: f32x4, p: *const u8x8) -> f32x4 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: p: *const f32x2.

@@ -1070,14 +1070,14 @@ pub unsafe fn _mm_loadr_ps(p: *const f32) -> f32x4 {
// On i586 (no SSE2) it just generates plain MOV instructions.
#[cfg_attr(all(test, any(target_arch = "x86_64", target_feature = "sse2")),
assert_instr(movhpd))]
pub unsafe fn _mm_storeh_pi(p: *mut u64, a: f32x4) {
pub unsafe fn _mm_storeh_pi(p: *mut u8x8, a: f32x4) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, we should make the storeh functions consistent with the loadh. That is: p: *mut f32x2.

assert_instr(movnti))]
pub unsafe fn _mm_stream_pi(mem_addr: *mut i64, a: i64) {
// #[cfg_attr(test, assert_instr(movntq))] // FIXME: llvm codegens wrong instr
pub unsafe fn _mm_stream_pi(mem_addr: *mut i8x8, a: i8x8) {
Copy link
Contributor

@gnzlbg gnzlbg Dec 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be using _m64 here (which might probably fix the FIXME since that is the magic x86_mmx type that represents integers in MMX registers).

@@ -61,7 +61,7 @@ pub unsafe fn _mm_blendv_epi8(a: i8x16, b: i8x16, mask: i8x16) -> i8x16 {
#[inline(always)]
#[target_feature = "+sse4.1"]
#[cfg_attr(test, assert_instr(pblendw, imm8 = 0xF0))]
pub unsafe fn _mm_blend_epi16(a: i16x8, b: i16x8, imm8: u8) -> i16x8 {
pub unsafe fn _mm_blend_epi16(a: i16x8, b: i16x8, imm8: i32) -> i16x8 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is an imm8 from a bitmask I think it is more consistent with what we were previously doing to use an u8 (because of bitmask, and because of its value range).

@@ -91,7 +91,7 @@ pub unsafe fn _mm_blendv_ps(a: f32x4, b: f32x4, mask: f32x4) -> f32x4 {
#[inline(always)]
#[target_feature = "+sse4.1"]
#[cfg_attr(test, assert_instr(blendpd, imm2 = 0b10))]
pub unsafe fn _mm_blend_pd(a: f64x2, b: f64x2, imm2: u8) -> f64x2 {
pub unsafe fn _mm_blend_pd(a: f64x2, b: f64x2, imm2: i32) -> f64x2 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Intel Intrinsics guide calls the last argument imm8 (so u8).

@@ -8,49 +8,49 @@ use stdsimd_test::assert_instr;
use v128::*;

/// String contains unsigned 8-bit characters *(Default)*
pub const _SIDD_UBYTE_OPS: i8 = 0b0000_0000;
pub const _SIDD_UBYTE_OPS: i32 = 0b0000_0000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should have been u8 instead of i8.

@@ -36,7 +36,7 @@ extern "C" {
#[inline(always)]
#[target_feature = "+xsave"]
#[cfg_attr(test, assert_instr(xsave))]
pub unsafe fn _xsave(mem_addr: *mut u8, save_mask: u64) -> () {
pub unsafe fn _xsave(mem_addr: *mut u8, save_mask: u64) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clippy should have complained about these, would be nice to know why it didn't and fill an issue upstream.

// no particular instruction to test
pub unsafe fn _mm_movepi64_pi64(a: i64x2) -> i64 {
a.extract(0)
// #[cfg_attr(test, assert_instr(movdq2q))] // FIXME: llvm codegens wrong instr?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should go through all FIXMEs and TODOs of the x86 intrinsics before stabilization. There was one before that say "add test on windows" with little explanation about exactly what was broken there.

pub unsafe fn _mm_movepi64_pi64(a: i64x2) -> i64 {
a.extract(0)
// #[cfg_attr(test, assert_instr(movdq2q))] // FIXME: llvm codegens wrong instr?
pub unsafe fn _mm_movepi64_pi64(a: i64x2) -> i8x8 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong, this function returns a single 64-bit integer. It should have been __m64 instead of i64 though (probably __m64 wasn't implemented back then).

We should go over all intrinsics that return __m64 and make sure that we don't make them worse by returning something else. In particular, chaining one intrinsic that returns __m64 with one that takes __m64 as an argument probably generates assembly that operates directly on that MMX register. However, u8x8 is not an x86_mmx LLVM type, so by using a different type that does not translates to x86_mmx we might be in-curing a cost that the C compilers don't have.

pub unsafe fn _mm_movpi64_epi64(a: i64) -> i64x2 {
i64x2::new(a, 0)
// #[cfg_attr(test, assert_instr(movq2dq))] // FIXME: llvm codegens wrong instr?
pub unsafe fn _mm_movpi64_epi64(a: i8x8) -> i64x2 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function should take a __m64 instead of i64 or i8x8 type. This would probably fix the FIXME.

@@ -47,7 +47,7 @@ pub unsafe fn _mm_shuffle_pi8(a: u8x8, b: u8x8) -> u8x8 {
#[inline(always)]
#[target_feature = "+ssse3"]
#[cfg_attr(test, assert_instr(palignr, n = 15))]
pub unsafe fn _mm_alignr_pi8(a: u8x8, b: u8x8, n: u8) -> u8x8 {
pub unsafe fn _mm_alignr_pi8(a: u8x8, b: u8x8, n: i32) -> u8x8 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n: u8 was by design here as well.

for intrinsic in data.intrinsics.iter() {
// This intrinsic has multiple definitions in the XML, so just ignore it.
if intrinsic.name == "_mm_prefetch" {
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _mm_prefetch intrinsics does something different if the target has AVX-512, SSE, or not even SSE. I think it is ok to ignore these for now (and AVX-512 overloads some intrinsics to have slightly different behavior).


// Verify that all `#[target_feature]` annotations are correct, ensuring
// that we've actually enabled the right instruction set for this
// intrinsic.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we check here that we have put the intrinsic in the right module (e.g. by saving the name of the module in which the intrinsics are implemented)? (there were some avx2 intrinsics in the avx module)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, but I'll leave that to a future commit.

.collect::<String>();

// Normalize `bmi1` to `bmi` as apparently that's what we're calling
// it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to change the name of the target feature here to match intel's name before stabilization.


_ => panic!("failed to equate: `{}` and {:?} for {}", intel, t, intrinsic),
}
}
Copy link
Contributor

@gnzlbg gnzlbg Dec 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work here.

The three main points that worry me a bit is that:

  • this equates int to 32-bit integer, but an int in C can be a 64-bit integer (and actually are in some platforms). I think assuming int to mean 32-bit integer is fine, but
  • Intel often writes that the argument is an 8-bit integer type, even though the argument type is int, or the specification mentions that the value is at most N, meaning that a 2-bit, 4-bit, 8-bit, 16-bit integer types might make more sense to represent these,
  • this assumes that int maps to a signed integer, but in many intrinsics intel uses int to denote a count or a bit index, which can't never be negative.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but an int in C can be a 64-bit integer (and actually are in some platforms).

Not on Intel AFAIK.

Copy link
Contributor

@gnzlbg gnzlbg Dec 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't depend on the CPU vendor, but on the platform (which must conform to the C standard). For example:

  • on x86_64-unknown-linux-gnu ints are 32-bit, and long are 64-bit,
  • on x86_64-pc-windows-msvc ints and longs are 32-bit and long longs are 64-bit, and
  • on x86_64-cray-linux-gnu ints are also 64-bit.

All of these conform to the C standard, and there are x86_64-cray-linux-gnu targets with both Intel and AMD CPUs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Er what I mean is this is that this doesn't matter for us. Rust doesn't have any platforms where i32 isn't int, so we can just cross this bridge when we get there. There's no real need to add hundreds of lines of "portability code" that aren't actually buying us anything.

@gnzlbg
Copy link
Contributor

gnzlbg commented Dec 26, 2017

So this looks awesome and it has surfaced lots of bugs and inconsistencies in the conventions that we were trying to follow.

I think we should clarify whether we want to keep following these conventions and improve this a bit to enforce them:

  • we were use unsigned integer types when bit manipulation was intended (including bitflags, bitmasks, etc.): why? I don't really know. If I had to guess I'd say that this is because unsigned integers in C and C++ have modulo arithmetic, and the newer intrinsics tend to follow this convention even though some of the older ones don't. In Rust, however, arithmetic and signedness are orthogonal to each other.
  • we were using appropriately sized types (as long as it was reasonable): why? for increased type safety (e.g. trying to pass a 0xFFFF to a flag taking an u8 raises a warning). Intel specifies most of these arguments as having type int (which can be any integer type from 8-bit to 64 bits, depending on the platform), but then the specification text adds extra semantic information like it always being positive, being in some range [0, {2,4,8,16,...255, 16386,...}). Many intrinsics use these information to translate int not only to i32, but every combination of {i,u}{8,16,32}.
  • we did not have a properly working __m64 for a long time some we have a lot of intrinsics with broken code gen that just use i64. With these intrinsics we should be extra careful because __m64 means an MMX register and LLVM calls this particular type x86_mmx. That is, i8x8 has 64 bits but it is not necessarily an x86_mmx type. The easiest way to strongly type these is to just transmute internally to __m64 but we should check that chaining these intrinsics doesn't break codegen.

@alexcrichton
Copy link
Member Author

In general as the libs team has discussed SIMD over time it's becoming more and more apparent that we're not going to deviate at all from Intel's upstream definitions. I expect the uNxN types to all go away on Intel eventually in favor of __m256 and such natively as we stabilize intrinsics. As such I've switched all arguments to exactly what intel says (i32 where intel says int). In general I don't think it really matters about all the constant arguments, they're all constant anyway so the type shouldn't propagate anywhere else in the program.

I'd personally prefer to not have special cases when reading Intel's own definitions and trying to bend over backwards to map those to our own "possibly more correct" types. If we can come up with a strong technical argument for why we'd make such a transformation I thin kthat makes sense, but "we happened to do it yesterday" I don't think is sufficient.

For __m64 related things I believe i8x8 (or any 64-bit vector) is all defined the exact same way as __m64 (or at least that's what I thought it was done as). In that sense we'll likely eventually move to just __m64 so that'll just get updated naturally. The main change here was switching from a literal i64 to an mmx type.

@gnzlbg
Copy link
Contributor

gnzlbg commented Dec 27, 2017

I'll be ok with:

    1. offering __m{64. 128, 256}{_,i,d}, types without any operations (like we do today)
    1. specifying all intrinsics to use those just like C does
    1. (optional as an after-step): offering the {i,u,f}{8,16,32,64}x{2,4,8,16,32} types, with conversions from/to __m{64. 128, 256}{_,i,d} (to avoid stabilizing repr(simd) and the LLVM generic operations like simd_cast).

That is, we have the same intrinsic API that C does, but higher-level libraries and users can use portable {i,u,f}{8,16,32,64}x{2,4,8,16,32} types in their APIs, but internally they will need to convert those to the appropriate __m{64. 128, 256}{_,i,d}.

The points 1. and 2. are non-controversial because that's what C does and this is a low-level API, so we can deliver those "soon".

The point 3 would be a convenience for libraries like faster. They can build their own portable types on top of __m{64. 128, 256}{_,i,d} but that is really painful for library authors and we really should be using the generic LLVM intrinsics here anyways. In particular, LLVM 5 has experimental supports for portable reductions of vector types (sum, max, etc.), and once those stabilize we will want the {i,u,f}{8,16,32,64}x{2,4,8,16,32} types to offer those as well.

All in all, I'll agree to all of this, but it's a huge change API wise, and it hasn't been discussed much in the open, I think. The quickest thing would be to create an issue here to discuss it, but this is already at a point where we (or I) can write an RFC about this so maybe we should do that instead.

In the mean time I don't know what to do with this PR, I think we should merge it, but changing all those APIs is a huge breaking change.


(i32 where intel says int)

We probably should just add an int type alias somewhere in the library so that, if platform compatibility ends up giving us trouble in the future, we can at least hack around that in a single place.

I'd personally prefer to not have special cases when reading Intel's own definitions and trying to bend over backwards to map those to our own "possibly more correct" types. If we can come up with a strong technical argument for why we'd make such a transformation I thin kthat makes sense, but "we happened to do it yesterday" I don't think is sufficient.

Basically we will need to review every single intrinsic, and motivate why it uses one type instead of another. That's an absurd ton of work. From the pov of "we want to stabilize this someday", trying to deviate from the Intel intrinsics in any way makes this whole thing "un-rewieable/un-RFC'able".

For __m64 related things I believe i8x8 (or any 64-bit vector) is all defined the exact same way as __m64 (or at least that's what I thought it was done as).

IIRC when you added x86_mmx to the compiler the logic was that it had to be a 64-bit vector containing a single element, see: https://github.com/rust-lang/rust/blob/3cc68bac7c89a81ec83cbd8f0aff9db001425c50/src/librustc_trans/type_of.rs#L36 .

@gnzlbg
Copy link
Contributor

gnzlbg commented Dec 27, 2017

@BurntSushi help

@AdamNiederer
Copy link
Contributor

AdamNiederer commented Dec 28, 2017

The points 1. and 2. are non-controversial because that's what C does and this is a low-level API, so we can deliver those "soon".

It's worth noting, however, that the ambiguity of intel's vector types isn't necessarily duplicated across all platforms. Having any of the AArch64 intrinsics take an __m128i would be just as wrong as having an SSE2 intrinsic take a i32x4 (well, int32x4_t). MSA defines the same type as v4i32.

I'd be willing to steward a package containing the more stringently-typed definitions of these intrinsics, if that helps with the decision at all. Perhaps we could have a solution similar to the http crate with the "strong" vector types and intrinsic definitions?

In the mean time I don't know what to do with this PR, I think we should merge it, but changing all those APIs is a huge breaking change.

I maintain 2 of the 5 packages which depend on stdsimd at the moment, and I'd recommend breaking the API as much as possible while we're still figuring this stuff out and few people actually use the crate.

@alexcrichton alexcrichton force-pushed the verify branch 2 times, most recently from 4c3473b to 134792b Compare December 28, 2017 17:16
@alexcrichton
Copy link
Member Author

@gnzlbg

That is, we have the same intrinsic API that C does, but higher-level libraries and users can use portable {i,u,f}{8,16,32,64}x{2,4,8,16,32} types in their APIs, but

Yeah that's what I'd imagine as well, the "more portable" types would probably be much more laggard in stabilization as well.

All in all, I'll agree to all of this, but it's a huge change API wise ... but changing all those APIs is a huge breaking change.

Agreed it's a big change! We'd want to be somewhat careful about it I think and probably yes have a bit of discussion first. FWIW though I don't consider this library having any stability guarantees right now though.

We probably should just add an int type alias somewhere in the library so that, if platform compatibility ends up giving us trouble in the future, we can at least hack around that in a single place.

I'd rather not, we have yet to even see from a distance a realistic platform for Rust where this is true. This has lots of complications when moving into libcore as well, let's just not.

Basically we will need to review every single intrinsic, and motivate why it uses one type instead of another. That's an absurd ton of work. From the pov of "we want to stabilize this someday", trying to deviate from the Intel intrinsics in any way makes this whole thing "un-rewieable/un-RFC'able".

Correct, this is why "hey it's the same as intel" is so crucial to be able to rubber-stamp APIs and be confident in such a stamp


@AdamNiederer

It's worth noting, however, that the ambiguity of intel's vector types isn't necessarily duplicated across all platforms

Certainly! Something like __m128 would not be portable, it'd only be available on x86. Platforms like AArch64 would have their own whole suite of types.

I'd be willing to steward a package containing the more stringently-typed definitions of these intrinsics, if that helps with the decision at all. Perhaps we could have a solution similar to the http crate with the "strong" vector types and intrinsic definitions?

That'd be great! It's certainly something we always wanted to see on top of a library like this!

@alexcrichton
Copy link
Member Author

In the mean time I don't know what to do with this PR

I plan on addressing some of the feedback and then merging. I don't think we'll break much usage by switching the types from things like u8 to i32 as they're all constants anyway, and otherwise it's mostly just mmx stuff that got fixed which we've seen claims about anyway that they shouldn't be exposed anyway!

@alexcrichton alexcrichton force-pushed the verify branch 2 times, most recently from c884f4e to d8dc256 Compare December 28, 2017 17:35
@BurntSushi
Copy link
Member

BurntSushi commented Dec 28, 2017

I'm in favor of this. I basically don't think "breakage" at this point is something we should be concerning ourselves with. This crate is still very clearly in an experimental phase while we iron out the API. Folks using it today should be doing it with that knowledge, and with the knowledge that their experience should help inform us of future API decisions. We should semver bump and move on. :)

I don't have too many other thoughts (other than the fact that this PR is awesome). I am slowly coming around to @alexcrichton's view that we should revert our efforts to provide more stringent types on the vector intrinsics. I am pretty disappointed myself since we put so much work into it, but I suppose it's just a sunk cost. I tend to think that having these be checked in an automated fashion is a pretty huge advantage. With that said, I do think we should have a contingency. i.e., The stabilization effort should lay out plans for what to do in the event that we did indeed get some intrinsics wrong.

@gnzlbg
Copy link
Contributor

gnzlbg commented Dec 29, 2017

@alexcrichton So all in all I'd say this is good to once it's green again.

The only thing that is a bit sketchy is switching so many APIs to use i8x8 when we already know we want to switch them to __m64 in the near future but that doesn't need to be done in this PR.


@AdamNiederer

Perhaps we could have a solution similar to the http crate with the "strong" vector types and intrinsic definitions?

I think we should just provide all vector types in coresimd for all platforms.

The x86 intrinsics are only specified in terms of the "weak" __mXYZw types, while most of the non-x86 intrinsics (ARM, MIPS, PowerPC, ...) are specified in terms of the "strong" txWxL vector types (with different notation, e.g., v4i32, but otherwise equal semantics).

Implementing these types needs unstable intrinsics that we would prefer never to stabilize, and we need to implement both "weak" and "strong" vector types in coresimd anyways.

Given that they are already implemented, and that they currently work with each other (this is not hard), I don't currently see what we would gain from splitting the types into a different crate.

In particular, because coresimd would then need to depend on this crate, and this crate would then need to be part of core.

Maybe my views on this will change as we implement more ARM, MIPS, and PowerPC intrinsics in coresimd and new problems arise, but I expect the differently-named "strong" vector types in these platforms to be just type-aliases to a single set of "strong" vector types.

Also, for stabilizing the intrinsics of a particular platform we only need to stabilize the vector types that the platform uses. That is, to stabilize the x86 intrinsics we only need to stabilize the "weak" vector types, and the "strong" ones can remain unstable.


@BurntSushi

I am pretty disappointed myself since we put so much work into it, but I suppose it's just a sunk cost.

Well its not 100% sunk: we learned something. Once the RFC for this arrives somebody is going to ask "why aren't the Intel intrinsics more strongly typed?" and now we have a pretty good answer for that ("they were more strongly typed for a while; it did not work out").

I tend to think that having these be checked in an automated fashion is a pretty huge advantage.

Agreed.

@alexcrichton
Copy link
Member Author

@gnzlbg

The only thing that is a bit sketchy is switching so many APIs to use i8x8 when we already know we want to switch them to __m64 in the near future but that doesn't need to be done in this PR.

Oh I think I moved them over to __m64 where I could.

@alexcrichton alexcrichton force-pushed the verify branch 4 times, most recently from 88ee2fe to 601f76d Compare December 29, 2017 17:22
This commit adds a new crate for testing that the intrinsics listed in this
crate do indeed match the upstream definition of each intrinsic. A
pre-downloaded XML description of all Intel intrinsics is checked in which is
then parsed in the `stdsimd-verify` crate to verify that everything we write
down is matched against the upstream definitions.

Currently the checks are pretty loose to get this compiling but a few intrinsics
were fixed as a result of this. For example:

* `_mm256_extract_epi8` - AVX2 intrinsic erroneously listed under AVX
* `_mm256_extract_epi16` - AVX2 intrinsic erroneously listed under AVX
* `_mm256_extract_epi32` - AVX2 intrinsic erroneously listed under AVX
* `_mm256_extract_epi64` - AVX2 intrinsic erroneously listed under AVX
* `_mm_tzcnt_32` - erroneously had `u32` in the name
* `_mm_tzcnt_64` - erroneously had `u64` in the name
* `_mm_cvtsi64_si128` - erroneously available on 32-bit platforms
* `_mm_cvtsi64x_si128` - erroneously available on 32-bit platforms
* `_mm_cvtsi128_si64` - erroneously available on 32-bit platforms
* `_mm_cvtsi128_si64x` - erroneously available on 32-bit platforms
* `_mm_extract_epi64` - erroneously available on 32-bit platforms
* `_mm_insert_epi64` - erroneously available on 32-bit platforms
* `_mm256_extract_epi16` - erroneously returned i32 instead of i16
* `_mm256_extract_epi8` - erroneously returned i32 instead of i8
* `_mm_shuffle_ps` - the mask argument was erroneously i32 instead of u32
* `_popcnt32` - the signededness of the argument and return were flipped
* `_popcnt64` - the signededness of the argument was flipped and the argument
  was too large bit-wise
* `_mm_tzcnt_32` - the return value's sign was flipped
* `_mm_tzcnt_64` - the return value's sign was flipped
* A good number of intrinsics used `imm8: i8` or `imm8: u8` instead of `imm8:
  i32` which Intel was using. (we were also internally inconsistent)
* A number of intrinsics working with `__m64` were instead working with i64/u64,
  so they're now corrected to operate with the vector types instead.

Currently the verifications performed are:

* Each name in Rust is defined in the XML document
* The arguments/return values all agree.
* The CPUID features listed in the XML document are all enabled in Rust as well.

The type matching right now is pretty loose and has a lot of questionable
changes. Future commits will touch these up to be more strict and require closer
adherence with Intel's own types. Otherwise types like `i32x8` (or any integers
with 256 bits) all match up to `__m256i` right now, althoguh this may want to
change in the future.

Finally we're also not testing the instruction listed in the XML right now.
There's a huge number of discrepancies between the instruction listed in the XML
and the instruction listed in `assert_instr`, and those'll need to be taken care
of in a future commit.

Closes rust-lang#240
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Find inconsistencies between the intel intrinsics XML file and the Rust code
4 participants