Faster const-time modinv divsteps #1031

peterdettman · 2021-12-04T18:25:21Z

Changes to _divsteps_59 (_30) that give maybe 4% speed improvement to const-time modinv on 64 bit. I see a larger gain on 32 bit but measured on 64 bit so might not be real.

Start the result matrix scaled by 2^62 (resp. 2^30) and shift q, r down instead of u, v up at each step (should make life easier for vectorization). Since we're always shifting away the LSB of g, q, r, we can avoid doing a full negation for x, y, z (after a few tweaks). Then it makes sense to switch zeta back to delta (I confined this change to the local method for the moment).

sipa · 2021-12-08T19:41:40Z

Wowwow, calm down!

sipa · 2021-12-08T23:33:54Z

See https://github.com/sipa/secp256k1/commits/pr1031 for two commits on top of this. One renames your delta to theta (as the variable operated on is really defined as delta-1/2 compared to delta as defined in the paper). The other updates the writeup to reflect the improvements here.

Concept ACK in any case. I'll look over the code in more detail, but at least this convinces me it's correct.

real-or-random · 2021-12-09T10:07:36Z

As I understand, this should always be faster and we don't need lots of benchmarks to confirm.

The other updates the writeup to reflect the improvements here.

Nice, I wanted to ask for this.

sipa · 2021-12-09T18:09:30Z

Updated my branch with additional assertions and a few more improvements to the documentation.

Code review ACK. I'm happy to do my changes on top in a separate PR, though it's probably better to include them, as a few code comments are out of that as-is in this PR. Feel free to cherry-pick or otherwise include my suggested changes.

peterdettman · 2021-12-10T11:17:21Z

Wowwow, calm down!

@sipa Never!

So, after I PR'ed this, I sat there looking at the code and thinking: you know, there's a lot of low zeros in u/v/q/r... you could almost fit an f/g in there...

(a few moments later)

https://github.com/peterdettman/secp256k1/tree/vec_modinv

I won't spoiler it, but you might like to benchmark const-time inverses in that branch...

sipa · 2021-12-10T15:09:05Z

@peterdettman Nice!

Benchmarks on Ryzen 5950x (default compilation options, GCC 11.2.0, SECP256K1_BENCH_ITERS=1000000 ./bench_internal inverse:

master:
field_inverse                 ,     1.27      ,     1.27      ,     1.28   
field_inverse_var             ,     0.867     ,     0.871     ,     0.874  
pr1031:
field_inverse                 ,     1.26      ,     1.26      ,     1.27   
field_inverse_var             ,     0.846     ,     0.860     ,     0.868  
vec_modinv:
field_inverse                 ,     0.986     ,     0.995     ,     1.01   
field_inverse_var             ,     0.868     ,     0.870     ,     0.873  
vec_modinv, using the const-time matrix in the vartime code:
field_inverse                 ,     1.00      ,     1.00      ,     1.01   
field_inverse_var             ,     0.866     ,     0.871     ,     0.876

This seems to at least suggest the option of throwing out the vartime matrix computation code, and using this new constant-time code for both?

The last option (if someone else wants to benchmark) is simply this on top of https://github.com/peterdettman/secp256k1/tree/vec_modinv:

@@ -568,7 +568,7 @@ static void secp256k1_modinv64_var(secp256k1_modinv64_signed62 *x, const secp256
     while (1) {
         /* Compute transition matrix and new eta after 62 divsteps. */
         secp256k1_modinv64_trans2x2 t;
-        eta = secp256k1_modinv64_divsteps_62_var(eta, f.v[0], g.v[0], &t);
+        eta = secp256k1_modinv64_divsteps_59(eta, f.v[0], g.v[0], &t);
         /* Update d,e using that transition matrix. */

peterdettman · 2021-12-10T16:01:31Z

Wow, that's even more improvement than on my machine. Around 4900 cycles, right?

For the _var version, if copying the vec approach at minimum the vartime version would want to escape at the halfway mark if g1 is 0 (mod remaining steps) already, so I think there would still be a separate version. I haven't really looked at it though.

sipa · 2021-12-10T16:02:48Z

@peterdettman A vartime version of the vec matrix code could perhaps also do more than 59 iterations.

peterdettman · 2021-12-10T16:06:01Z

It could trivially do 60. The current core vec loop tops out at 30. In theory I think it can do 31, but it's not a given until I work it through.

sipa · 2021-12-10T16:28:42Z

Benchmarks on ARM64 (Cortex-A53, default compilation options, GCC 9.3.0, ./bench_internal inverse):

master:
field_inverse                 ,    12.5       ,    12.5       ,    12.5    
field_inverse_var             ,     7.26      ,     7.27      ,     7.27
pr1031:
field_inverse                 ,    11.9       ,    11.9       ,    11.9    
field_inverse_var             ,     7.26      ,     7.27      ,     7.27 
vec_modinv:
field_inverse                 ,    11.1       ,    11.1       ,    11.1    
field_inverse_var             ,     7.26      ,     7.27      ,     7.27   
vec_modinv using const-time matrix for vartime:
field_inverse                 ,    11.2       ,    11.2       ,    11.2    
field_inverse_var             ,     9.83      ,     9.83      ,     9.83

Looks like we shouldn't throw out the vartime matrix code just yet...

sipa · 2021-12-21T16:21:56Z

ACK 6afd499

@robot-dreams Perhaps you're interested in reviewing this too?

@peterdettman Given that your later (and much more significant) improvement builds on top of this one (at least conceptually, using initially-shifted u,v,q,r variables) and doesn't work for 32 bit yet, I'd vote to get this change merged as-is, and do the rest in a follow-up. WDYT?

peterdettman · 2021-12-21T16:50:42Z

@sipa Yeah, we may as well merge this, since the other one is still under experimentation and will in any case require significant write-up and VERIFY checks.

robot-dreams

Thanks for the tag @sipa ! You're right, I'm definitely interested in this change 😄

I took an initial look at the new writeup and it looks correct so far, but I still need to sleep on it and then review the new C implementation carefully.

doc/safegcd_implementation.md

robot-dreams · 2021-12-22T06:57:45Z

doc/safegcd_implementation.md

-    """Compute zeta and transition matrix t after N divsteps (multiplied by 2^N)."""
-    u, v, q, r = 1, 0, 0, 1 # start with identity matrix
+def divsteps_n_matrix(theta, f, g):
+    """Compute delta and transition matrix t after N divsteps (multiplied by 2^N)."""


Nit: theta, since that's what's returned?

doc/safegcd_implementation.md

robot-dreams

ACK 6afd499

I checked the writeup updates, made sure the updated Python code produced the same transition matrices as before (using various small parameters), and then compared the Python vs. C implementations line by line.

Am I understanding correctly that removing 3 subtractions from the inner loop gives such a dramatic speedup?

src/modinv64_impl.h

src/modinv32_impl.h

peterdettman · 2021-12-22T17:22:46Z

Am I understanding correctly that removing 3 subtractions from the inner loop gives such a dramatic speedup?

@sipa gave numbers from a different branch of mine that is more radical. This PR probably only gives a few percent, and maybe nothing if the processor could have executed them in parallel anyway.

Edit: "in parallel" meaning using otherwise idle ALUs/registers, although it's also possible to have f/u/v, g/q/r, x/y/z be vectorized, in which case there might only even be the one operation saved.

peterdettman · 2021-12-24T17:27:57Z

@sipa I'm currently assuming that you will get to the nits that @robot-dreams had above about safegcd_implementation.md, but I'll sort them out next week if not.

sipa · 2022-01-04T18:44:07Z

See my updated branch https://github.com/sipa/secp256k1/commits/pr1031 . I've edited the "Update safegcd writeup to reflect the code" commit to address @robot-dreams's comments.

peterdettman · 2022-01-11T05:24:27Z

Thanks, @sipa. Squashed one spelling fix into my last commit.

- Add explanation for 59 divsteps vs 2^62 scaling

sipa · 2022-11-21T20:52:31Z

Rebased on top of merged #1000: https://github.com/sipa/secp256k1/commits/pr1031

sipa · 2023-01-19T21:02:47Z

I've taken the liberty to open a rebased PR of this #1197.

real-or-random · 2023-05-11T16:19:35Z

Closing in favor of #1197

peterdettman force-pushed the opt_modinv branch from 487af7b to 6afd499 Compare December 21, 2021 13:36

robot-dreams reviewed Dec 22, 2021

View reviewed changes

src/modinv64_impl.h Outdated Show resolved Hide resolved

src/modinv32_impl.h Show resolved Hide resolved

peterdettman force-pushed the opt_modinv branch 2 times, most recently from 169e61c to 9d25987 Compare December 24, 2021 17:21

peterdettman force-pushed the opt_modinv branch 2 times, most recently from b89ec94 to a41fbcd Compare January 11, 2022 05:23

peterdettman and others added 5 commits April 20, 2022 12:31

Faster const-time modinv divsteps

d0a4b21

modinv: introduce theta=delta-1/2

f027e1f

Update safegcd writeup to reflect the code

5271894

Improve matrix computation assertions

ab4900d

Improve verify check and comments

7e2acda

- Add explanation for 59 divsteps vs 2^62 scaling

peterdettman force-pushed the opt_modinv branch from a41fbcd to 7e2acda Compare April 20, 2022 05:38

real-or-random closed this May 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster const-time modinv divsteps #1031

Faster const-time modinv divsteps #1031

peterdettman commented Dec 4, 2021 •

edited

Loading

sipa commented Dec 8, 2021

sipa commented Dec 8, 2021

real-or-random commented Dec 9, 2021

sipa commented Dec 9, 2021

peterdettman commented Dec 10, 2021

sipa commented Dec 10, 2021 •

edited

Loading

peterdettman commented Dec 10, 2021

sipa commented Dec 10, 2021

peterdettman commented Dec 10, 2021

sipa commented Dec 10, 2021 •

edited

Loading

sipa commented Dec 21, 2021

peterdettman commented Dec 21, 2021

robot-dreams left a comment

robot-dreams Dec 22, 2021

robot-dreams left a comment

peterdettman commented Dec 22, 2021 •

edited

Loading

peterdettman commented Dec 24, 2021

sipa commented Jan 4, 2022

peterdettman commented Jan 11, 2022

sipa commented Nov 21, 2022

sipa commented Jan 19, 2023

real-or-random commented May 11, 2023

Faster const-time modinv divsteps #1031

Faster const-time modinv divsteps #1031

Conversation

peterdettman commented Dec 4, 2021 • edited Loading

sipa commented Dec 8, 2021

sipa commented Dec 8, 2021

real-or-random commented Dec 9, 2021

sipa commented Dec 9, 2021

peterdettman commented Dec 10, 2021

sipa commented Dec 10, 2021 • edited Loading

peterdettman commented Dec 10, 2021

sipa commented Dec 10, 2021

peterdettman commented Dec 10, 2021

sipa commented Dec 10, 2021 • edited Loading

sipa commented Dec 21, 2021

peterdettman commented Dec 21, 2021

robot-dreams left a comment

Choose a reason for hiding this comment

robot-dreams Dec 22, 2021

Choose a reason for hiding this comment

robot-dreams left a comment

Choose a reason for hiding this comment

peterdettman commented Dec 22, 2021 • edited Loading

peterdettman commented Dec 24, 2021

sipa commented Jan 4, 2022

peterdettman commented Jan 11, 2022

sipa commented Nov 21, 2022

sipa commented Jan 19, 2023

real-or-random commented May 11, 2023

peterdettman commented Dec 4, 2021 •

edited

Loading

sipa commented Dec 10, 2021 •

edited

Loading

sipa commented Dec 10, 2021 •

edited

Loading

peterdettman commented Dec 22, 2021 •

edited

Loading