-
Notifications
You must be signed in to change notification settings - Fork 758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wasp-opt makes bcrypt code slower #5088
Comments
This is definitely something we like to get bug reports on, thanks! If we find something to improve that would be great. My first guess is that If that's not it, if you can provide the unoptimized wasm and the command to run the benchmark, I could take a look. (I don't have Docker installed, so just the minimal setup for me to optimize and benchmark the file would be best.) |
Awesome! 🙌
Here are some stats from building with
As you can see, the
I've provided a zip here with the same six files as in the output above, and also a quick benchmark script. It should run with a recent version of Node.js, without any external dependencies. If you want to try to build for yourself, the C source code is available at the following link: I'm building it with the following clang command: clang \
--sysroot=/share/wasi-sysroot \
--target=wasm32-unknown-wasi \
-flto \
-O3 \
-o bcrypt.wasm \
-D__SKIP_GNU \
-mexec-model=reactor \
-fvisibility=hidden \
-Wl,--export=malloc,--export=free,--export=crypt,--export=crypt_gensalt,--strip-all \
-- crypt_blowfish-1.3/crypt_blowfish.c crypt_blowfish-1.3/crypt_gensalt.c crypt_blowfish-1.3/wrapper.c When building, make sure that Thank you for taking a look at this! ❤️ |
Thanks! Ok, I can confirm a slowdown locally. Bisecting on passes, the issue is specific to |
It turns out that the entire slowdown can be reversed with this: diff --git a/src/passes/OptimizeInstructions.cpp b/src/passes/OptimizeInstructions.cpp
index e9cf3baaa..8b6c32c67 100644
--- a/src/passes/OptimizeInstructions.cpp
+++ b/src/passes/OptimizeInstructions.cpp
@@ -2251,7 +2251,7 @@ private:
}
// Sort by the node id type, if different.
if (binary->left->_id != binary->right->_id) {
- if (binary->left->_id > binary->right->_id) {
+ if (binary->left->_id < binary->right->_id) {
return maybeSwap();
}
return; That leads to lots of small reorderings like this: (i32.or
- (local.get $3)
(i32.shl
(local.get $5)
(i32.const 16)
)
+ (local.get $3)
) Binary size is unchanged, but for some reason it affects speed. I suspect this is somehow because of the massive stack of xor-focused work that appear in this benchmark, which those reorderings rewrite quite a bit. Likely that just ends up helping or hurting register allocation somehow. It may be worth applying that first diff since we don't really have a reason to go one way or another, and it helps this benchmark. However, we'd need to do some careful benchmarking of other codebases, and also look at compressed size. It may be worth doing, but perhaps first, @LinusU it would be good to benchmark this in other VMs, as if we see it helps them all it would greatly increase the motivation to do this work. If you can make a benchmark that is runnable in a browser, for example, we could then test it in Firefox and Safari? |
Hmm, it seems to make sense calc sub-expressions depth and accordingly sort them from left to right |
Big expressions on the left hand side correspond to deeper stacks on the Wasm engine. I can easily imagine that deeper stacks lead to more register pressure and spilling on baseline compilers. Maybe we should generally try to minimize stack depth (and reduce the live ranges of values on the stack) by reordering to have smaller subexpressions on the left as @MaxGraey suggested. (But the further benchmarking @kripken suggested is still a good idea, too) |
Could be that the depth matters, yeah... but it's not obvious to me why in that direction? I'm probably missing something though. I opened a chromium bug for this with a focused version of this benchmark on just the ordering: https://bugs.chromium.org/p/v8/issues/detail?id=13347 |
Oops, I had it backward. Having the larger subtrees on the left produces a shallower stack, not a deeper one. As an example, |
I guess deeper stack reuse the same or scratched register somehow like |
I see, yes, perhaps we should look into that. It does seem like deeper things on the left is nicer. We should benchmark that. |
@kripken maybe ping someone from v8/wasm team here? I think there's more info about issue |
I added a link from there to here now. Looks like I forgot that before, thanks. |
Friendly remind. It seems no one from v8 seems to know? Are there similar tests for other browsers? Maybe wasmtime? |
I'm still waiting to hear from v8 people. I think they might experiment with some register allocation options that could help here. Separately from that, it would be good to experiment here with reversing the order to see what effect that has on useful benchmarks. It would also be good to reach out to other VMs and get their thoughts. I might have time for that eventually but it would be great if someone else did. |
The v8 issue was closed without any work found to do there. It seems like the difference is not for any obvious reason, but might be due to minor cache differences. We should still investigate here as mentioned in the last comment, and earlier. If we find a small change to canonicalization that looks good on a good set of benchmarks (this testcase + the emscripten test suite, for example, but hopefully even more) then we could move forward with it. |
I'm not sure if this is issues you want to receive, but I ran into this and thought that it could be good to report. Feel free to close this if this is not relevant.
I'm compiling Openwall's BCrypt implementation into WASM with
clang
. I'm compiling with-O3
and-flto
. Right out fromclang
I get the following performance:When I then run
wasm-opt --enable-bulk-memory -O3
on the resulting wasm file, the performance decreases:I get the same results when passing
-O4
towasm-opt
.You can find my project here:
https://github.com/LinusU/cwasm-openwall-bcrypt
There is a
Dockerfile
that builds the entire project, so it's easy to experiment with passing different flags toclang
. I'm using the lines below to add and runwasm-opt
. Do note that these needs to be added after theclang
invocation, sinceclang
will automatically invokewasm-opt
if it's present in the path.The text was updated successfully, but these errors were encountered: