-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize x86/aarch64 MD5 implementation #2137
Conversation
(Equivalent to openssl/openssl@ebe34f9) As suggested in https://github.com/animetosho/md5-optimisation?tab=readme-ov-file#dependency-shortcut-in-g-function, we can delay the dependency on 'x' by recognizing that ((x & z) | (y & ~z)) is equivalent to ((x & z) + (y + ~z)) in this scenario, and we can perform those additions independently, leaving our dependency on x to the final addition. This speeds it up around 5% on both platforms.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2137 +/- ##
=======================================
Coverage 78.95% 78.95%
=======================================
Files 610 610
Lines 105293 105293
Branches 14919 14921 +2
=======================================
Hits 83136 83136
Misses 21505 21505
Partials 652 652 ☔ View full report in Codecov by Sentry. |
movz x13, #0x2562 // Load lower half of constant 0xf61e2562 | ||
movk x13, #0xf61e, lsl #16 // Load upper half of constant 0xf61e2562 | ||
add w4, w4, w20 // Add dest value | ||
add w4, w4, w13 // Add constant 0xf61e2562 | ||
add w4, w4, w6 // Add aux function result | ||
and x13, x9, x17 // Aux function round 2 (x & z) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this and
moved closer to where its result, x13
, is needed? Or you're relying on the processor's out-of-order execution? I may be missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I recall, there was no special reason to move it closer, it just felt more natural to group these related operations together. The only reason for speedup is shortening the dependency path for 'x' (here, 'x13'), which is the longest path. Before, we had three operations (and, or, add). Now we have two (and, add).
Yes, I think this relies on effective out-of-order execution.
@@ -52,9 +51,9 @@ sub round2_step | |||
and $x, %r12d /* x & z */ | |||
and $y, %r11d /* y & (not z) */ | |||
mov $k_next*4(%rsi),%r10d /* (NEXT STEP) X[$k_next] */ | |||
or %r11d, %r12d /* (y & (not z)) | (x & z) */ | |||
add %r11d, $dst /* dst += (y & (not z)) */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again here, why not do this add secondly or swap lines 51 and 52?
Also for the subsequent mov
s, are you relying on that the processor will delay them in out-of-order processing manner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand your question correctly, yes it's similar to the above - we rely on the processor to order the independent instructions efficiently. I don't think you'd see any performance change from manual reordering.
As suggested in https://github.com/animetosho/md5-optimisation?tab=readme-ov-file#dependency-shortcut-in-g-function, we can delay the dependency on 'x' by recognizing that ((x & z) | (y & ~z)) is equivalent to ((x & z) + (y + ~z)) in this scenario, and we can perform those additions independently, leaving our dependency on x to the final addition. This speeds it up around 5% on both platforms.
Description of changes:
(Equivalent to openssl/openssl@ebe34f9)
As suggested in https://github.com/animetosho/md5-optimisation?tab=readme-ov-file#dependency-shortcut-in-g-function, we can delay the dependency on 'x' by recognizing that ((x & z) | (y & ~z)) is equivalent to ((x & z) + (y + ~z)) in this scenario, and we can perform those additions independently, leaving our dependency on x to the final addition. This speeds it up around 5% on both platforms.
Call-outs:
I applied the x86 patch manually, it's trivial, and the aarch64 patch applied cleanly after fixing the file name/path.
Testing:
ninja-build run_tests
on x86.bssl speed -filter MD5
on x86 shows the expected speedup:By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and the ISC license.