Add a FAQ explaining why there is no fast-math mode. #260

sunfishcode · 2015-07-11T11:22:04Z

This PR documents why WebAssembly has no fast-math mode. Just as x86, ARM, MIPS, and POWER all have no fast-math mode, allowing compiled programs to have consistent numeric results across different machines, WebAssembly is a virtual ISA with a similar need.

sunfishcode · 2015-07-11T17:27:20Z

The other optimization recently mentioned is tree height reduction.

My expectation is that WebAssembly producers can do some of this using fairly simple platform-agnostic heuristics (split up really long chains into a small number of chains, especially when there's nothing else going on). Obviously this doesn't cover everything, but the question is: how important is the difference?

Also, we must weigh this against the fact that reassociation has the potential to create very significant behavior differences, harming WebAssembly's portability.

So while I wouldn't necessarily rule such a feature out, it's reasonable to expect a significant motivation before including it.

jfbastien · 2015-07-14T17:46:55Z

As discussed today: we'll wait for data before coming to a conclusion. Leave bug #148 open, don't change FAQ with #260 just yet. Re-discuss when @titzer and @pizlonator can discuss over higher-throughput medium than github issues.

sunfishcode · 2015-07-16T05:11:02Z

This issue is has both qualitative and quantitative aspects. The PR has a qualitative account, and is ready for discussion.

To address the request for data, I've now prepared this clang/LLVM patch, which strips all fast-math flags and settings from LLVM IR after the optimizer runs, and before codegen runs (controlled by an environment variable because it was easy). I claim that this is likely to over-estimate of the effect of omitting fast-math flags in WebAssembly, because optimizations like FMA formation and tree height reduction could still be done on WebAssembly following the methods discussed earlier, because LLVM historically did all of its fast-math optimizations in codegen (due to historic limitations) and has only fairly recently started making use of fast-math flags in the optimizer, and because some of the fast-math optimizations in codegen will be active in the WebAssembly LLVM backend too.

I've now run the LLVM test-suite MultiSource Benchmarks on an Intel system, comparing plain -ffast-math to -ffast-math with this experimental option to strip fast-math flags after the optimizer. The geometric mean scores between the two are within noise of each other (even without Bullet, discussed below). The following individual benchmarks had interesting results:

Olden/power was a case where my experimental patch was over-zealous. LLVM codegen turns divisions by non-constant values into multiplications by reciprocals in cases where the same denominator is used multiple times, and this depends on fast-math flags, but the LLVM WebAssembly backend will also be able to use that same optimization. Re-enabling just that optimization made the performance difference disappear.

A strange outlier was Bullet, which got about 3x faster when fast-math flags were stripped. Yes, that's right. This appears to be due to the fast-math flags changing the behavior of the program and changing the progression of the algorithm. This is a good example of a hazard of fast-math optimization, on no less important code than Bullet. Performance portability isn't just about implementations needing to "work harder".

The most interesting case was TSVC LoopRerolling (TSVC is a microbenchmark suite), in the dot product loop. Other benchmarks that do dot products (e.g. Linpack) didn't see slowdowns, so this was a little surprising. Looking into it, I found:

It's a simple dot product computation which really ought to just vectorize (since the optimizer is running in fast-math mode, it can reassociate the add reduction), however this is an artificial testcase and the loop body is manually unrolled by 5, and this appears to confuse LLVM's vectorizer. When I undid the manual unrolling, it vectorized went much faster than even the scalar fast-math code.
Dot products are also a case where FMA formation is potentially valuable. The PR here lays out a plan for how FMA formation may be done for WebAssembly.
The loop is unrolled to essentially a+b+c+d+e which ends up being a single dependence chain, and on top of that, a and b are fed by loads. Reassociating it shortens the critical path, which is what matters in this benchmark (vectorization and FMA aside). I believe this is a straightforward example of the optimization mentioned above, where the LLVM WebAssembly backend can do the same optimization that the x86 backend is doing here, using simple heuristics rather than specific latency computations. The chain here is long enough that splitting it up ought to be worthwhile for a wide variety of architectures. And, manually sticking a single pair of parentheses in the source confirms that that's all the optimizer would need to do to get the same speedup.

So to summarize, there were no significant performance losses, after correcting for cases where my experimental patch didn't accurately reflect what we'll be able to do, and ignoring Bullet.

Now. This was on Intel, and lower-power CPUs are less out-of-order or even not out-of-order at all, and the effect of floating point reassociation would sometimes be greater. Also, this is not a comprehensive or carefully crafted benchmark suite, though it does contain quite a variety of floating point code. I invite people interested in this topic to participate in the effort. For now, I believe I've shown that this idea is not completely crazy, and that with this the qualitative argument in the PR makes a vibrant case.

jfbastien · 2015-07-17T16:42:38Z

I don't think I was very clear in my summary. I confirmed with others and the type of data we expected was:

Context switch cost in an actual in-browser WebAssembly implementation.
How often these context switches happen for real usecase.
Runtime hit on relevant applications running in-browser.

Denormals don't have any meaningful impact on high-level design decision (it's literally just a bit!) so delaying decisions until we can have real-world data on pre-MVP VMs is much better than setting things in stone (as this PR does).

sunfishcode · 2015-07-17T20:24:06Z

Did you mean to post in #148? This PR is talking about fast-math flags which don't entail any mode switching.

jfbastien · 2015-07-17T21:34:50Z

Oh, I indeed meant to post in #148. I do think that this PR has some good content, but I'm not convinced on a few things including the conclusion on FMA (I think we may want to ship an FMA opcode, which can be split up if unsupported). I think the same applies here: let's leave it open for now, and once we have VMs we can measure things.

sunfishcode · 2015-07-17T22:28:21Z

Updated to not go into specifics about FMA here.

I have performed quite a few measurements, as elaborately detailed above. What other measurements would you like to see here?

jfbastien · 2015-07-18T00:17:58Z

FAQ.md

+
+WebAssembly implementations run on the user side, so there is no opportunity for developers to test the final behavior of the code. Nondeterminism at this level could cause distributed WebAssembly programs to behave differently in different implementations, or change over time. WebAssembly does have [some nondeterminism](Nondeterminism.md) in cases where the tradeoffs warrant it, but fast-math flags are not believed to be important enough:
+
+ * Many of the important fast-math optimizations happen in the mid-level optimizer of a compiler, before WebAssembly code is emitted. For example, loop vectorization that depends on floating point reassociation can still be done at this level if the user applies the appropriate fast-math flags, so WebAssembly programs can still enjoy these benefits. As another example, compilers can replace floating point division with floating point multiplication by a reciprocal in WebAssembly programs just as they do for other platforms.


I'm not sure I understand what you're suggesting for vectorization. Would it be done in the dev's machine, or on-device? I think we want to allow both, though maybe at different dates.

It could be on the dev-machine, or on-device in a JIT library, both of which would have the freedom to utilize fast-math flags, and the latter would have the ability to feature-test for things like available SIMD features.

Maybe expand on this as we discussed yesterday? Reassociation can be a middle-end thing, but may also be something that requires precise target ISA information (e.g. vector width, number of registers, cache information). The middle end can ask for that information, or wasm backends can implement clever vectorization if they want (but I'd rather have the middle end do it).

Halide would be a great example to use here.

Ok, I added a bullet point for this. It's an interesting question whether the number of registers or cache information are things we want to expose in feature tests (no doubt some developers will require us to provide them, but we have other requirements to consider as well).

jfbastien · 2015-07-18T00:21:05Z

Agreed with the current form (most fast math is performed on the dev's machine, not user device), after discussing vector some more and punting on math functions until later.

sunfishcode · 2015-07-18T03:09:55Z

I added a note about vectorization to #265. It's interesting: vectorization is greatly improved by floating point reassociativity, magical aliasing/loop directives, and knowledge of the underlying hardware (among other things). The dev-machine side has the first two. The WebAssembly VM has the third. A JIT library opens up another theater, one where we can get all three at the same time. Knowledge of the target machine would be limited by the granularity of feature testing available, but it'll probably be enough to start with, and there are ways we could extend the system in the future.

titzer · 2015-07-23T11:57:53Z

I like this PR, so lgtm is yes

jfbastien · 2015-07-23T20:37:44Z

lgtm, merging!

Add a FAQ explaining why there is no fast-math mode.

sunfishcode mentioned this pull request Jul 11, 2015

What about subnormals? #148

Closed

Add a FAQ explaining why there is no fast-math mode.

e6349a7

sunfishcode force-pushed the why-no-fast-math branch from 0959846 to 059681d Compare July 17, 2015 21:58

Don't mention feature-testing for FMA here.

059681d

jfbastien reviewed Jul 18, 2015
View reviewed changes

Add code quotes.

021bd56

Mention that middle-end optimization can benefit from feature tests.

eb440c4

jfbastien added a commit that referenced this pull request Jul 23, 2015

Merge pull request #260 from WebAssembly/why-no-fast-math

c89cdee

Add a FAQ explaining why there is no fast-math mode.

jfbastien merged commit c89cdee into master Jul 23, 2015

jfbastien deleted the why-no-fast-math branch July 23, 2015 20:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a FAQ explaining why there is no fast-math mode. #260

Add a FAQ explaining why there is no fast-math mode. #260

sunfishcode commented Jul 11, 2015

sunfishcode commented Jul 11, 2015

jfbastien commented Jul 14, 2015

sunfishcode commented Jul 16, 2015

jfbastien commented Jul 17, 2015

sunfishcode commented Jul 17, 2015

jfbastien commented Jul 17, 2015

sunfishcode commented Jul 17, 2015

jfbastien Jul 18, 2015

sunfishcode Jul 18, 2015

jfbastien Jul 23, 2015

sunfishcode Jul 23, 2015

jfbastien commented Jul 18, 2015

sunfishcode commented Jul 18, 2015

titzer commented Jul 23, 2015

jfbastien commented Jul 23, 2015


		WebAssembly implementations run on the user side, so there is no opportunity for developers to test the final behavior of the code. Nondeterminism at this level could cause distributed WebAssembly programs to behave differently in different implementations, or change over time. WebAssembly does have [some nondeterminism](Nondeterminism.md) in cases where the tradeoffs warrant it, but fast-math flags are not believed to be important enough:

		* Many of the important fast-math optimizations happen in the mid-level optimizer of a compiler, before WebAssembly code is emitted. For example, loop vectorization that depends on floating point reassociation can still be done at this level if the user applies the appropriate fast-math flags, so WebAssembly programs can still enjoy these benefits. As another example, compilers can replace floating point division with floating point multiplication by a reciprocal in WebAssembly programs just as they do for other platforms.

Add a FAQ explaining why there is no fast-math mode. #260

Add a FAQ explaining why there is no fast-math mode. #260

Conversation

sunfishcode commented Jul 11, 2015

sunfishcode commented Jul 11, 2015

jfbastien commented Jul 14, 2015

sunfishcode commented Jul 16, 2015

jfbastien commented Jul 17, 2015

sunfishcode commented Jul 17, 2015

jfbastien commented Jul 17, 2015

sunfishcode commented Jul 17, 2015

jfbastien Jul 18, 2015

Choose a reason for hiding this comment

sunfishcode Jul 18, 2015

Choose a reason for hiding this comment

jfbastien Jul 23, 2015

Choose a reason for hiding this comment

sunfishcode Jul 23, 2015

Choose a reason for hiding this comment

jfbastien commented Jul 18, 2015

sunfishcode commented Jul 18, 2015

titzer commented Jul 23, 2015

jfbastien commented Jul 23, 2015