-
Notifications
You must be signed in to change notification settings - Fork 13
Issue 68: Refactor existing modular arithmetic code #80
Issue 68: Refactor existing modular arithmetic code #80
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great otherwise! I'll just do a quick cargo bench u64
once I'm able to build, to double check for perf changes as you suggested, though it seems like there shouldn't be any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I ran the u64 benches a few times on your branch & master, and saw some 5-10% swings in both directions. I don't think performance was actually affected; most likely my laptop is just being inconsistent.
So after all that, I am now somewhat doubtful that rustc optimises const generic code properly. I finished implementing Montgomery squaring last night and saw that, with const generics the code was almost twice as slow as Montgomery multiplication, whereas when I simply replaced the const generic parameter with the number 4 (for the Tweedle* curves) in the same code, it took only about 60% of the time of Montgomery multiplication which is much more in line with expectations (and a pretty great result itself!). I'll have to investigate a bit further to see if other people have had performance problems with const generics; I assumed they would work much like non-type template parameters in C++, but there's no way we should be seeing such huge performance differences if that were the case. Our options might be (i) return to the previous situation with explicit specialisations to limb length 4 and 6, or (ii) use macros instead. Let's discuss further at the next catch-up. |
Interesting! Sorry I didn't think of this in advance, but I think the issue may be with Maybe we could build a proc macro similar to #[unroll_const_generics(max = 6)]
for i in 0..n {
body(i);
} with if 0 < n { body(0); }
if 1 < n { body(1); }
if 2 < n { body(2); }
if 3 < n { body(3); }
if 4 < n { body(4); }
if 5 < n { body(5); }
if 6 < n { body(6); }
// Fallback in case a loop has more than 6 iterations.
for i in 6..n {
body(i);
} Then I think we should end up with the unrolled code we want, after the compiler eliminated the dead code and the always-true conditionals. Admittedly it would be a huge hack, and it might slow down compilation (though probably not that bad for small-ish |
We actually started with macros, but eventually got rid of them here. As I recall we ran into a couple little annoyances. One was that some of our macros would produce dead code given certain parameters, e.g. in subtraction, we had a sometimes-empty for loop to check limbs which only existed in one operand. (Of course we could have written a separate macro for subtraction with different limb counts, or suppressed the warnings.) Another little thing was that we ended up with lists of macro invocations like
which were slightly annoying to maintain. Nothing that major though, so we could definitely consider returning to that approach. Perhaps another option to consider would be just removing the BLS curve so that we only need to deal with ~256 bit fields. There's a chance that we might go back to BLS (Aztec and Matter Labs are going that direction), but I think it's relatively low. |
I think this seems OK to merge. I was a bit confused since our main method ( |
I investigated a bit more on this topic. It seems rustc does unroll loops with const generic parameters to some extent, but for some reason my squaring code does not pass the compiler's internal threshold for code that's suitable to unroll. I'm fairly confident that the simple functions like Indeed, the |
This PR addresses the easy parts of #68. Almost all of the functions that were specialised to 4 or 6 limbs have been replaced with a single version that takes a const generic parameter
N
that can be any positive integer. There are a lot of changes, but they are all very repetitive and straight forward I think.I say "almost all" of the functions because it became clear that more significant restructuring will be necessary to refactor all the repeated
montgomery_multiplication
implementations. Specifically, the dependence on the representation is more explicit in that function (though it is of course there in the other arithmetic functions). I have a solution in mind for tidying this all up that I will implement eventually (it's already partly done), but at this stage I think it's a better idea for me to do some actual performance work on the arithmetic. :)Speaking of performance, it's important that we don't introduce any performance regressions with these refactoring PRs. I couldn't detect any performance issues with this PR with my setup, but the reviewer should double-check this on their setup!
(The commit log incorporates all the commits from my previous PR, because this one was built on top of that; the messages can be ignored, it will merge fine. Sorry for the noise; I'll rebase properly next time.)