Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize x86/aarch64 MD5 implementation #2137

Merged
merged 4 commits into from
Jan 28, 2025

Conversation

olivergillespie
Copy link
Contributor

Description of changes:

(Equivalent to openssl/openssl@ebe34f9)

As suggested in https://github.com/animetosho/md5-optimisation?tab=readme-ov-file#dependency-shortcut-in-g-function, we can delay the dependency on 'x' by recognizing that ((x & z) | (y & ~z)) is equivalent to ((x & z) + (y + ~z)) in this scenario, and we can perform those additions independently, leaving our dependency on x to the final addition. This speeds it up around 5% on both platforms.

Call-outs:

I applied the x86 patch manually, it's trivial, and the aarch64 patch applied cleanly after fixing the file name/path.

Testing:

ninja-build run_tests on x86. bssl speed -filter MD5 on x86 shows the expected speedup:

Before
Did 7169250 MD5 (16 bytes) operations in 1000024us (7169077.9 ops/sec): 114.7 MB/s
Did 1826500 MD5 (256 bytes) operations in 1000119us (1826282.7 ops/sec): 467.5 MB/s
Did 433000 MD5 (1350 bytes) operations in 1001295us (432440.0 ops/sec): 583.8 MB/s
Did 76000 MD5 (8192 bytes) operations in 1005792us (75562.3 ops/sec): 619.0 MB/s
Did 39000 MD5 (16384 bytes) operations in 1025281us (38038.4 ops/sec): 623.2 MB/s


After
Did 7457250 MD5 (16 bytes) operations in 1000004us (7457220.2 ops/sec): 119.3 MB/s
Did 1919000 MD5 (256 bytes) operations in 1000057us (1918890.6 ops/sec): 491.2 MB/s
Did 456000 MD5 (1350 bytes) operations in 1001377us (455373.0 ops/sec): 614.8 MB/s
Did 80000 MD5 (8192 bytes) operations in 1004118us (79671.9 ops/sec): 652.7 MB/s
Did 41000 MD5 (16384 bytes) operations in 1022055us (40115.3 ops/sec): 657.2 MB/s 

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and the ISC license.

(Equivalent to openssl/openssl@ebe34f9)

As suggested in https://github.com/animetosho/md5-optimisation?tab=readme-ov-file#dependency-shortcut-in-g-function,
we can delay the dependency on 'x' by recognizing that ((x & z) | (y & ~z))
is equivalent to ((x & z) + (y + ~z)) in this scenario, and we can perform
those additions independently, leaving our dependency on x to the final
addition. This speeds it up around 5% on both platforms.
@olivergillespie olivergillespie requested a review from a team as a code owner January 24, 2025 10:04
@codecov-commenter
Copy link

codecov-commenter commented Jan 24, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.95%. Comparing base (81f138a) to head (553f032).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2137   +/-   ##
=======================================
  Coverage   78.95%   78.95%           
=======================================
  Files         610      610           
  Lines      105293   105293           
  Branches    14919    14921    +2     
=======================================
  Hits        83136    83136           
  Misses      21505    21505           
  Partials      652      652           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@justsmth
Copy link
Contributor

@AWSjswinney

movz x13, #0x2562 // Load lower half of constant 0xf61e2562
movk x13, #0xf61e, lsl #16 // Load upper half of constant 0xf61e2562
add w4, w4, w20 // Add dest value
add w4, w4, w13 // Add constant 0xf61e2562
add w4, w4, w6 // Add aux function result
and x13, x9, x17 // Aux function round 2 (x & z)
Copy link
Contributor

@nebeid nebeid Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this and moved closer to where its result, x13, is needed? Or you're relying on the processor's out-of-order execution? I may be missing something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I recall, there was no special reason to move it closer, it just felt more natural to group these related operations together. The only reason for speedup is shortening the dependency path for 'x' (here, 'x13'), which is the longest path. Before, we had three operations (and, or, add). Now we have two (and, add).

Yes, I think this relies on effective out-of-order execution.

@@ -52,9 +51,9 @@ sub round2_step
and $x, %r12d /* x & z */
and $y, %r11d /* y & (not z) */
mov $k_next*4(%rsi),%r10d /* (NEXT STEP) X[$k_next] */
or %r11d, %r12d /* (y & (not z)) | (x & z) */
add %r11d, $dst /* dst += (y & (not z)) */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again here, why not do this add secondly or swap lines 51 and 52?
Also for the subsequent movs, are you relying on that the processor will delay them in out-of-order processing manner?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand your question correctly, yes it's similar to the above - we rely on the processor to order the independent instructions efficiently. I don't think you'd see any performance change from manual reordering.

@torben-hansen torben-hansen merged commit ea58b3f into aws:main Jan 28, 2025
123 of 126 checks passed
manastasova pushed a commit to manastasova/aws-lc that referenced this pull request Jan 30, 2025
As suggested in https://github.com/animetosho/md5-optimisation?tab=readme-ov-file#dependency-shortcut-in-g-function, we can delay the dependency on 'x' by recognizing that ((x & z) | (y & ~z)) is equivalent to ((x & z) + (y + ~z)) in this scenario, and we can perform those additions independently, leaving our dependency on x to the final addition. This speeds it up around 5% on both platforms.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants