-
Notifications
You must be signed in to change notification settings - Fork 695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a FAQ explaining why there is no fast-math mode. #260
Conversation
The other optimization recently mentioned is tree height reduction. My expectation is that WebAssembly producers can do some of this using fairly simple platform-agnostic heuristics (split up really long chains into a small number of chains, especially when there's nothing else going on). Obviously this doesn't cover everything, but the question is: how important is the difference? Also, we must weigh this against the fact that reassociation has the potential to create very significant behavior differences, harming WebAssembly's portability. So while I wouldn't necessarily rule such a feature out, it's reasonable to expect a significant motivation before including it. |
As discussed today: we'll wait for data before coming to a conclusion. Leave bug #148 open, don't change FAQ with #260 just yet. Re-discuss when @titzer and @pizlonator can discuss over higher-throughput medium than github issues. |
This issue is has both qualitative and quantitative aspects. The PR has a qualitative account, and is ready for discussion. To address the request for data, I've now prepared this clang/LLVM patch, which strips all fast-math flags and settings from LLVM IR after the optimizer runs, and before codegen runs (controlled by an environment variable because it was easy). I claim that this is likely to over-estimate of the effect of omitting fast-math flags in WebAssembly, because optimizations like FMA formation and tree height reduction could still be done on WebAssembly following the methods discussed earlier, because LLVM historically did all of its fast-math optimizations in codegen (due to historic limitations) and has only fairly recently started making use of fast-math flags in the optimizer, and because some of the fast-math optimizations in codegen will be active in the WebAssembly LLVM backend too. I've now run the LLVM test-suite MultiSource Benchmarks on an Intel system, comparing plain -ffast-math to -ffast-math with this experimental option to strip fast-math flags after the optimizer. The geometric mean scores between the two are within noise of each other (even without Bullet, discussed below). The following individual benchmarks had interesting results: Olden/power was a case where my experimental patch was over-zealous. LLVM codegen turns divisions by non-constant values into multiplications by reciprocals in cases where the same denominator is used multiple times, and this depends on fast-math flags, but the LLVM WebAssembly backend will also be able to use that same optimization. Re-enabling just that optimization made the performance difference disappear. A strange outlier was Bullet, which got about 3x faster when fast-math flags were stripped. Yes, that's right. This appears to be due to the fast-math flags changing the behavior of the program and changing the progression of the algorithm. This is a good example of a hazard of fast-math optimization, on no less important code than Bullet. Performance portability isn't just about implementations needing to "work harder". The most interesting case was TSVC LoopRerolling (TSVC is a microbenchmark suite), in the dot product loop. Other benchmarks that do dot products (e.g. Linpack) didn't see slowdowns, so this was a little surprising. Looking into it, I found:
So to summarize, there were no significant performance losses, after correcting for cases where my experimental patch didn't accurately reflect what we'll be able to do, and ignoring Bullet. Now. This was on Intel, and lower-power CPUs are less out-of-order or even not out-of-order at all, and the effect of floating point reassociation would sometimes be greater. Also, this is not a comprehensive or carefully crafted benchmark suite, though it does contain quite a variety of floating point code. I invite people interested in this topic to participate in the effort. For now, I believe I've shown that this idea is not completely crazy, and that with this the qualitative argument in the PR makes a vibrant case. |
I don't think I was very clear in my summary. I confirmed with others and the type of data we expected was:
Denormals don't have any meaningful impact on high-level design decision (it's literally just a bit!) so delaying decisions until we can have real-world data on pre-MVP VMs is much better than setting things in stone (as this PR does). |
Did you mean to post in #148? This PR is talking about fast-math flags which don't entail any mode switching. |
Oh, I indeed meant to post in #148. I do think that this PR has some good content, but I'm not convinced on a few things including the conclusion on FMA (I think we may want to ship an FMA opcode, which can be split up if unsupported). I think the same applies here: let's leave it open for now, and once we have VMs we can measure things. |
0959846
to
059681d
Compare
Updated to not go into specifics about FMA here. I have performed quite a few measurements, as elaborately detailed above. What other measurements would you like to see here? |
|
||
WebAssembly implementations run on the user side, so there is no opportunity for developers to test the final behavior of the code. Nondeterminism at this level could cause distributed WebAssembly programs to behave differently in different implementations, or change over time. WebAssembly does have [some nondeterminism](Nondeterminism.md) in cases where the tradeoffs warrant it, but fast-math flags are not believed to be important enough: | ||
|
||
* Many of the important fast-math optimizations happen in the mid-level optimizer of a compiler, before WebAssembly code is emitted. For example, loop vectorization that depends on floating point reassociation can still be done at this level if the user applies the appropriate fast-math flags, so WebAssembly programs can still enjoy these benefits. As another example, compilers can replace floating point division with floating point multiplication by a reciprocal in WebAssembly programs just as they do for other platforms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand what you're suggesting for vectorization. Would it be done in the dev's machine, or on-device? I think we want to allow both, though maybe at different dates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be on the dev-machine, or on-device in a JIT library, both of which would have the freedom to utilize fast-math flags, and the latter would have the ability to feature-test for things like available SIMD features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe expand on this as we discussed yesterday? Reassociation can be a middle-end thing, but may also be something that requires precise target ISA information (e.g. vector width, number of registers, cache information). The middle end can ask for that information, or wasm backends can implement clever vectorization if they want (but I'd rather have the middle end do it).
Halide would be a great example to use here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I added a bullet point for this. It's an interesting question whether the number of registers or cache information are things we want to expose in feature tests (no doubt some developers will require us to provide them, but we have other requirements to consider as well).
Agreed with the current form (most fast math is performed on the dev's machine, not user device), after discussing vector some more and punting on math functions until later. |
I added a note about vectorization to #265. It's interesting: vectorization is greatly improved by floating point reassociativity, magical aliasing/loop directives, and knowledge of the underlying hardware (among other things). The dev-machine side has the first two. The WebAssembly VM has the third. A JIT library opens up another theater, one where we can get all three at the same time. Knowledge of the target machine would be limited by the granularity of feature testing available, but it'll probably be enough to start with, and there are ways we could extend the system in the future. |
I like this PR, so lgtm is yes |
lgtm, merging! |
Add a FAQ explaining why there is no fast-math mode.
This PR documents why WebAssembly has no fast-math mode. Just as x86, ARM, MIPS, and POWER all have no fast-math mode, allowing compiled programs to have consistent numeric results across different machines, WebAssembly is a virtual ISA with a similar need.