-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
arm64jit: Avoid fused multiplies in vcrsp.t #18249
Conversation
With this change, issues in Harvest Moon with teleporting animals seem to disappear. It was causing some differences in signs of zeros in results, and slightly different result values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Fused muls are quite scary, we probably should avoid them since they likely don't happen on the real PSP even internally in ops like this..
I'm pretty convinced by the way prefixes happen on this instruction that it's actually just mapped to a series of vdots. I actually think a lot of instructions are prefix hacks for vdot. In this case, the prefix bits are overridden such that the T swizzle is forced to yxwz, and the x (i.e. second position only) is negated, and this only applies to the third result. No prefixes apply to the first or second result. I think vcrsp.t actually just generates 3 vdots, in reverse order. I even think that it would have overlap hazards on hardware for this reason, but I haven't checked that part. So in other words, I think it does that exponent-line-up thing that vdots do, which might lose precision compared to multiplies rather than gaining as in fused. -[Unknown] |
yeah, agree that they probably utilize that exponent-lining vdot hardware whenever they can in ops like this, since it's gonna be fewer cycles than a sequence of singular ops. I wonder how fast it's possible to get a carefully asm-optimized implementation of those... Though IIRC we still have some slight inaccuracies even in our best emulation of them? |
Maybe I need to check again. It was either accurate to all tests or off by the lowest bit or two in a percentage of them... -[Unknown] |
Maybe we can get fp64 to have a look in the latter case, hehe (which is what I seem to remember...) |
Sure, I can have a look, assuming there's data to look at. On subject of FMA, I'm not sure how important (if at all) floating-point determinism is for most of the C++ code, but |
FWIW, I did an SSE2 reimplementation of
Note: Edit: you can probably get more with AVX2, but maybe not much. Can do a PR if you want this. |
So I went back and checked and my memory was indeed wrong: it's still off on the rounding sometimes, at least per my earlier notes. A challenge here is the range of inputs. A vdot takes 8 total floats as inputs to produce a single output. Part of my validation strategy (since it was all int math anyway) was actually that I simply compiled the emulated vdot as MIPS PSP code, and validated it directly there, without generating huge data files. But that doesn't help anyone without access to a Allegrex processor. I suppose for generating data, it makes the most sense to cover all the exponent combinations with a few mantissa values, although some good mantissa cases are needed to properly exercise rounding. Could also exclude exponents not within 32 of the base, outside a few samples to prove that rule and maybe some random values for coverage. So basically 64K for first lane, then times 1024^3 for the others? That's still a ton... Maybe the better thing is to just export out a chunk of values the current one gets wrong, but that's probably harder to test against...
Most of the code, probably not that critical - a lot of it ends up going to the GPU where we don't have determinism anyway, so it's a losing battle. It does matter for CPU emulation, and would be interesting eventually to have determinism in the software renderer.
Well, this seems good, though not sure if we can rely on the rounding mode or not. But it's actually way more interesting to make neon versions that SSE here, because that's really where it probably has the biggest observed performance impact to use the correct dots. I wonder if adding some Anyway, if it matches all the values, probably worth a PR. If we can prove on MSVC that it's safe to assume nearest then that's almost double the speed, which is good. -[Unknown] |
As for the 8 inputs, I think that can be dimension-reduced a lot, assuming things are symmetric, which they should be - depending on which part we are less confident in. Below I name the two input vectors a and b, and exclude infs and nans from consideration. If it's the actual multiplies, it should be enough to only vary say the a.x and b.x, and hold the other ones at 0. If it's the rounding when aligning the mantissa and shifting bits out before summing, we should be able to create a focused set of inputs for that too.. say, hold a.x and b.x at some values producing a messy product, then set a.y to 1 and vary b.y through a large range. Although may be hard to isolate that from the rounding after the sum? Just some thoughts.. |
Combining the above suggestions with some of my own ideas:
I wonder if it makes sense to
I neither know much NEON, nor have somewhere to test it (other than trying NEON_2_SSE.h or SIMDe locally). I may try pure GCC vector extensions version (which are annoyingly slightly different on GCC vs. clang).
It does vary a lot depending on system (e.g. speed-up is 4% for |
That version actually has botched testing for NaNs, but it looks easy to fix. |
Hmm, clang seems ok at vector extension version of multiply-high, GCC... not so much (on both x86 and ARM). Both autovectorize C++ version poorly. Both look ok for ARM variable per-lane shift. |
See hrydgard#18249. Speedup for this function ranges 10%..100%, depending on system. Updated verification and speed measurements: https://godbolt.org/z/W1z3sj6hz
Note: this is for the old/current arm64 jit, not the new IR-based arm64 jit, which didn't have this problem.
With this change, issues in Harvest Moon with teleporting animals seem to disappear. It was causing some differences in signs of zeros in results, and slightly different result values.
I'm not sure what value was doing it, but none of the differences involved NANs or infinities, so I assume it was one of the above two cases. For posterity and anyone interested, it was:
Harvest Moon (US)
PC=08910b34
vcrsp.t C000, C100, C110
Adding a FlushAll before and after made it easy to confirm it wasn't regcache related. Some of the different results (filtering out cases where the string format is identical, even if the bytes are different):
-[Unknown]