arm64jit: Avoid fused multiplies in vcrsp.t #18249

unknownbrackets · 2023-09-27T03:14:22Z

Note: this is for the old/current arm64 jit, not the new IR-based arm64 jit, which didn't have this problem.

With this change, issues in Harvest Moon with teleporting animals seem to disappear. It was causing some differences in signs of zeros in results, and slightly different result values.

I'm not sure what value was doing it, but none of the differences involved NANs or infinities, so I assume it was one of the above two cases. For posterity and anyone interested, it was:

Harvest Moon (US)
PC=08910b34
vcrsp.t C000, C100, C110

Adding a FlushAll before and after made it easy to confirm it wasn't regcache related. Some of the different results (filtering out cases where the string format is identical, even if the bytes are different):

Different results encountered: 5.791998, 0.000000, -0.000000 (should be 5.791998, 0.000000, 0.000000)
Different results encountered: 2.152147, 0.000000, -0.000000 (should be 2.152147, 0.000000, 0.000000)
Different results encountered: 2.075424, 0.000000, -0.000000 (should be 2.075424, 0.000000, 0.000000)
Different results encountered: -0.000000, 0.000000, -0.000000 (should be -0.000000, 0.000000, 0.000000)
Different results encountered: -0.000000, -0.000000, -0.000000 (should be -0.000000, -0.000000, 0.000000)
Different results encountered: 0.000000, 0.000001, -0.000000 (should be 0.000000, 0.000001, 0.000000)

-[Unknown]

With this change, issues in Harvest Moon with teleporting animals seem to disappear. It was causing some differences in signs of zeros in results, and slightly different result values.

hrydgard

Interesting. Fused muls are quite scary, we probably should avoid them since they likely don't happen on the real PSP even internally in ops like this..

unknownbrackets · 2023-09-27T06:52:58Z

I'm pretty convinced by the way prefixes happen on this instruction that it's actually just mapped to a series of vdots. I actually think a lot of instructions are prefix hacks for vdot.

In this case, the prefix bits are overridden such that the T swizzle is forced to yxwz, and the x (i.e. second position only) is negated, and this only applies to the third result. No prefixes apply to the first or second result. I think vcrsp.t actually just generates 3 vdots, in reverse order. I even think that it would have overlap hazards on hardware for this reason, but I haven't checked that part.

So in other words, I think it does that exponent-line-up thing that vdots do, which might lose precision compared to multiplies rather than gaining as in fused.

-[Unknown]

hrydgard · 2023-09-27T06:59:26Z

yeah, agree that they probably utilize that exponent-lining vdot hardware whenever they can in ops like this, since it's gonna be fewer cycles than a sequence of singular ops.

I wonder how fast it's possible to get a carefully asm-optimized implementation of those... Though IIRC we still have some slight inaccuracies even in our best emulation of them?

unknownbrackets · 2023-09-27T07:08:36Z

Maybe I need to check again. It was either accurate to all tests or off by the lowest bit or two in a percentage of them...

-[Unknown]

hrydgard · 2023-09-27T07:14:39Z

Maybe we can get fp64 to have a look in the latter case, hehe (which is what I seem to remember...)

fp64 · 2023-09-30T19:50:58Z

Sure, I can have a look, assuming there's data to look at.

On subject of FMA, I'm not sure how important (if at all) floating-point determinism is for most of the C++ code, but -ffp-contract=off is not currently set (it can default to fast even without -ffast-math; though it probably only affects things if FMA itself is enabled, e.g. -march=native).

fp64 · 2023-10-01T19:25:48Z

I wonder how fast it's possible to get a carefully asm-optimized implementation of those...

FWIW, I did an SSE2 reimplementation of vfpu_dot, and the speedup is visible, but not that impressive (this is about the best case, the win smaller on some CPUs):

CPU: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
Testing correctness... ok.
Testing speed.
vfpu_dot:            58.54 ns/call
vfpu_dot_sse2<0>:    43.70 ns/call
vfpu_dot_sse2<1>:    35.29 ns/call

Note: vfpu_dot_sse2<1> assumes round-to-nearest-ties-to-even, vfpu_dot_sse2<0> does its own rounding (using same code as vfpu_dot).

Edit: you can probably get more with AVX2, but maybe not much.

Can do a PR if you want this.

unknownbrackets · 2023-10-02T06:44:14Z

Sure, I can have a look, assuming there's data to look at.

So I went back and checked and my memory was indeed wrong: it's still off on the rounding sometimes, at least per my earlier notes.

A challenge here is the range of inputs. A vdot takes 8 total floats as inputs to produce a single output. Part of my validation strategy (since it was all int math anyway) was actually that I simply compiled the emulated vdot as MIPS PSP code, and validated it directly there, without generating huge data files. But that doesn't help anyone without access to a Allegrex processor.

I suppose for generating data, it makes the most sense to cover all the exponent combinations with a few mantissa values, although some good mantissa cases are needed to properly exercise rounding. Could also exclude exponents not within 32 of the base, outside a few samples to prove that rule and maybe some random values for coverage. So basically 64K for first lane, then times 1024^3 for the others? That's still a ton...

Maybe the better thing is to just export out a chunk of values the current one gets wrong, but that's probably harder to test against...

On subject of FMA, I'm not sure how important (if at all) floating-point determinism is for most of the C++ code, but -ffp-contract=off is not currently set (it can default to fast even without -ffast-math; though it probably only affects things if FMA itself is enabled, e.g. -march=native).

Most of the code, probably not that critical - a lot of it ends up going to the GPU where we don't have determinism anyway, so it's a losing battle. It does matter for CPU emulation, and would be interesting eventually to have determinism in the software renderer.

FWIW, I did an SSE2 reimplementation of vfpu_dot, and the speedup is visible, but not that impressive (this is about the best case, the win smaller on some CPUs):

Well, this seems good, though not sure if we can rely on the rounding mode or not. But it's actually way more interesting to make neon versions that SSE here, because that's really where it probably has the biggest observed performance impact to use the correct dots. I wonder if adding some vector_size attributes might be enough to coax clang into vectorizing arm (though probably not entirely optimally) for us...

Anyway, if it matches all the values, probably worth a PR. If we can prove on MSVC that it's safe to assume nearest then that's almost double the speed, which is good.

-[Unknown]

hrydgard · 2023-10-02T08:06:05Z

As for the 8 inputs, I think that can be dimension-reduced a lot, assuming things are symmetric, which they should be - depending on which part we are less confident in. Below I name the two input vectors a and b, and exclude infs and nans from consideration.

If it's the actual multiplies, it should be enough to only vary say the a.x and b.x, and hold the other ones at 0. If it's the rounding when aligning the mantissa and shifting bits out before summing, we should be able to create a focused set of inputs for that too.. say, hold a.x and b.x at some values producing a messy product, then set a.y to 1 and vary b.y through a large range. Although may be hard to isolate that from the rounding after the sum?

Just some thoughts..

fp64 · 2023-10-02T12:38:57Z

A challenge here is the range of inputs.

Combining the above suggestions with some of my own ideas:

completely random bitpatterns for a, b
random bitpatterns in small exponent range (both all-[1;0f;8.0f] and all-[2^-65;2^-63] to test denormals)
a=[1,1,1,1], b=random small range (including all-denormals)
a=[x,0,0,0], b=[y,0,0,0], random small-exp-range x,y
a=[x,z,0,0], b=[y,w,0,0], random small-exp-range x,y,z,w (also the case where x*y and z*w overflow, but the result doesn't)
special values (+0,-0,inf,nan,2^-126,2^-139,1,1+(2^-23), etc.), including [1,1,1,1]*[-0,-0,-0,-0]
a=[x,1,0,0], b=[y,z,0,0] (FMA-like)
the values current implementation gets wrong
permutations (same for a and b) to test if result is order-independent (e.g. dpps is specced to (A+B)+(C+D))

not sure if we can rely on the rounding mode or not.

I wonder if it makes sense to if(fegetround()==FE_TONEAREST) or just check a result of adding 2 volatile floats (well, 2 pairs).

But it's actually way more interesting to make neon versions that SSE here

I neither know much NEON, nor have somewhere to test it (other than trying NEON_2_SSE.h or SIMDe locally). I may try pure GCC vector extensions version (which are annoyingly slightly different on GCC vs. clang).

then that's almost double the speed, which is good.

It does vary a lot depending on system (e.g. speed-up is 4% for vfpu_dot_sse2<0>, and 12% for vfpu_dot_sse2<1> on my machine).

fp64 · 2023-10-02T13:50:22Z

That version actually has botched testing for NaNs, but it looks easy to fix.

fp64 · 2023-10-02T14:38:00Z

Hmm, clang seems ok at vector extension version of multiply-high, GCC... not so much (on both x86 and ARM). Both autovectorize C++ version poorly. Both look ok for ARM variable per-lane shift.
https://godbolt.org/z/PvhsP9YhT

See hrydgard#18249. Speedup for this function ranges 10%..100%, depending on system. Updated verification and speed measurements: https://godbolt.org/z/W1z3sj6hz

arm64jit: Avoid fused multiplies in vcrsp.t.

ded18ff

With this change, issues in Harvest Moon with teleporting animals seem to disappear. It was causing some differences in signs of zeros in results, and slightly different result values.

unknownbrackets added the arm64jit Occurs with JIT on 64-bit ARM devices, but not another CPU backend. label Sep 27, 2023

unknownbrackets mentioned this pull request Sep 27, 2023

Chicken flickering / teleporting around #18203

Closed

5 tasks

hrydgard approved these changes Sep 27, 2023

View reviewed changes

hrydgard added this to the v1.16.5 milestone Sep 27, 2023

hrydgard merged commit d6a8bfd into hrydgard:master Sep 27, 2023

unknownbrackets deleted the arm64jit-vcrsp branch September 27, 2023 06:53

fp64 added a commit to fp64/ppsspp that referenced this pull request Oct 2, 2023

Add SSE2 version of vfpu_dot

23e2d0f

See hrydgard#18249. Speedup for this function ranges 10%..100%, depending on system. Updated verification and speed measurements: https://godbolt.org/z/W1z3sj6hz

fp64 mentioned this pull request Oct 2, 2023

Add SSE2 version of vfpu_dot #18289

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arm64jit: Avoid fused multiplies in vcrsp.t #18249

arm64jit: Avoid fused multiplies in vcrsp.t #18249

unknownbrackets commented Sep 27, 2023 •

edited

Loading

hrydgard left a comment

unknownbrackets commented Sep 27, 2023 •

edited

Loading

hrydgard commented Sep 27, 2023

unknownbrackets commented Sep 27, 2023

hrydgard commented Sep 27, 2023

fp64 commented Sep 30, 2023

fp64 commented Oct 1, 2023 •

edited

Loading

unknownbrackets commented Oct 2, 2023

hrydgard commented Oct 2, 2023

fp64 commented Oct 2, 2023 •

edited

Loading

fp64 commented Oct 2, 2023

fp64 commented Oct 2, 2023

arm64jit: Avoid fused multiplies in vcrsp.t #18249

arm64jit: Avoid fused multiplies in vcrsp.t #18249

Conversation

unknownbrackets commented Sep 27, 2023 • edited Loading

hrydgard left a comment

Choose a reason for hiding this comment

unknownbrackets commented Sep 27, 2023 • edited Loading

hrydgard commented Sep 27, 2023

unknownbrackets commented Sep 27, 2023

hrydgard commented Sep 27, 2023

fp64 commented Sep 30, 2023

fp64 commented Oct 1, 2023 • edited Loading

unknownbrackets commented Oct 2, 2023

hrydgard commented Oct 2, 2023

fp64 commented Oct 2, 2023 • edited Loading

fp64 commented Oct 2, 2023

fp64 commented Oct 2, 2023

unknownbrackets commented Sep 27, 2023 •

edited

Loading

unknownbrackets commented Sep 27, 2023 •

edited

Loading

fp64 commented Oct 1, 2023 •

edited

Loading

fp64 commented Oct 2, 2023 •

edited

Loading