Optimize x86/aarch64 MD5 implementation #2137

olivergillespie · 2025-01-24T10:04:27Z

Description of changes:

(Equivalent to openssl/openssl@ebe34f9)

As suggested in https://github.com/animetosho/md5-optimisation?tab=readme-ov-file#dependency-shortcut-in-g-function, we can delay the dependency on 'x' by recognizing that ((x & z) | (y & ~z)) is equivalent to ((x & z) + (y + ~z)) in this scenario, and we can perform those additions independently, leaving our dependency on x to the final addition. This speeds it up around 5% on both platforms.

Call-outs:

I applied the x86 patch manually, it's trivial, and the aarch64 patch applied cleanly after fixing the file name/path.

Testing:

ninja-build run_tests on x86. bssl speed -filter MD5 on x86 shows the expected speedup:

Before
Did 7169250 MD5 (16 bytes) operations in 1000024us (7169077.9 ops/sec): 114.7 MB/s
Did 1826500 MD5 (256 bytes) operations in 1000119us (1826282.7 ops/sec): 467.5 MB/s
Did 433000 MD5 (1350 bytes) operations in 1001295us (432440.0 ops/sec): 583.8 MB/s
Did 76000 MD5 (8192 bytes) operations in 1005792us (75562.3 ops/sec): 619.0 MB/s
Did 39000 MD5 (16384 bytes) operations in 1025281us (38038.4 ops/sec): 623.2 MB/s


After
Did 7457250 MD5 (16 bytes) operations in 1000004us (7457220.2 ops/sec): 119.3 MB/s
Did 1919000 MD5 (256 bytes) operations in 1000057us (1918890.6 ops/sec): 491.2 MB/s
Did 456000 MD5 (1350 bytes) operations in 1001377us (455373.0 ops/sec): 614.8 MB/s
Did 80000 MD5 (8192 bytes) operations in 1004118us (79671.9 ops/sec): 652.7 MB/s
Did 41000 MD5 (16384 bytes) operations in 1022055us (40115.3 ops/sec): 657.2 MB/s

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and the ISC license.

(Equivalent to openssl/openssl@ebe34f9) As suggested in https://github.com/animetosho/md5-optimisation?tab=readme-ov-file#dependency-shortcut-in-g-function, we can delay the dependency on 'x' by recognizing that ((x & z) | (y & ~z)) is equivalent to ((x & z) + (y + ~z)) in this scenario, and we can perform those additions independently, leaving our dependency on x to the final addition. This speeds it up around 5% on both platforms.

codecov-commenter · 2025-01-24T10:26:13Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.95%. Comparing base (81f138a) to head (553f032).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2137   +/-   ##
=======================================
  Coverage   78.95%   78.95%           
=======================================
  Files         610      610           
  Lines      105293   105293           
  Branches    14919    14921    +2     
=======================================
  Hits        83136    83136           
  Misses      21505    21505           
  Partials      652      652

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

justsmth · 2025-01-24T14:23:26Z

@AWSjswinney

nebeid · 2025-01-27T16:41:33Z

crypto/fipsmodule/md5/asm/md5-armv8.pl

        movz x13, #0x2562             // Load lower half of constant 0xf61e2562
        movk x13, #0xf61e, lsl #16    // Load upper half of constant 0xf61e2562
        add w4, w4, w20               // Add dest value
        add w4, w4, w13               // Add constant 0xf61e2562
-        add w4, w4, w6                // Add aux function result
+        and x13, x9, x17              // Aux function round 2 (x & z)


Why was this and moved closer to where its result, x13, is needed? Or you're relying on the processor's out-of-order execution? I may be missing something.

If I recall, there was no special reason to move it closer, it just felt more natural to group these related operations together. The only reason for speedup is shortening the dependency path for 'x' (here, 'x13'), which is the longest path. Before, we had three operations (and, or, add). Now we have two (and, add).

Yes, I think this relies on effective out-of-order execution.

nebeid · 2025-01-27T16:53:14Z

crypto/fipsmodule/md5/asm/md5-x86_64.pl

@@ -52,9 +51,9 @@ sub round2_step
 	and	$x,		%r12d		/* x & z */
 	and	$y,		%r11d		/* y & (not z) */
 	mov	$k_next*4(%rsi),%r10d		/* (NEXT STEP) X[$k_next] */
-	or	%r11d,		%r12d		/* (y & (not z)) | (x & z) */
+	add	%r11d,		$dst		/* dst += (y & (not z)) */


Again here, why not do this add secondly or swap lines 51 and 52?
Also for the subsequent movs, are you relying on that the processor will delay them in out-of-order processing manner?

If I understand your question correctly, yes it's similar to the above - we rely on the processor to order the independent instructions efficiently. I don't think you'd see any performance change from manual reordering.

As suggested in https://github.com/animetosho/md5-optimisation?tab=readme-ov-file#dependency-shortcut-in-g-function, we can delay the dependency on 'x' by recognizing that ((x & z) | (y & ~z)) is equivalent to ((x & z) + (y + ~z)) in this scenario, and we can perform those additions independently, leaving our dependency on x to the final addition. This speeds it up around 5% on both platforms.

olivergillespie requested a review from a team as a code owner January 24, 2025 10:04

olivergillespie added 2 commits January 24, 2025 10:34

Fix whitespace

c603417

Fix more whitespace

297751f

justsmth requested review from nebeid and torben-hansen January 24, 2025 14:21

Update build files in generated-src

553f032

nebeid reviewed Jan 27, 2025

View reviewed changes

nebeid approved these changes Jan 27, 2025

View reviewed changes

torben-hansen approved these changes Jan 28, 2025

View reviewed changes

torben-hansen merged commit ea58b3f into aws:main Jan 28, 2025
123 of 126 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize x86/aarch64 MD5 implementation #2137

Optimize x86/aarch64 MD5 implementation #2137

olivergillespie commented Jan 24, 2025

codecov-commenter commented Jan 24, 2025 •

edited

Loading

justsmth commented Jan 24, 2025

nebeid Jan 27, 2025 •

edited

Loading

olivergillespie Jan 27, 2025

nebeid Jan 27, 2025

olivergillespie Jan 27, 2025

Optimize x86/aarch64 MD5 implementation #2137

Optimize x86/aarch64 MD5 implementation #2137

Conversation

olivergillespie commented Jan 24, 2025

Description of changes:

Call-outs:

Testing:

codecov-commenter commented Jan 24, 2025 • edited Loading

Codecov Report

justsmth commented Jan 24, 2025

nebeid Jan 27, 2025 • edited Loading

Choose a reason for hiding this comment

olivergillespie Jan 27, 2025

Choose a reason for hiding this comment

nebeid Jan 27, 2025

Choose a reason for hiding this comment

olivergillespie Jan 27, 2025

Choose a reason for hiding this comment

codecov-commenter commented Jan 24, 2025 •

edited

Loading

nebeid Jan 27, 2025 •

edited

Loading