-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Signed-digit multi-comb for ecmult_gen (by peterdettman) #693
Conversation
Note that I have not reviewed the multiplication code in detail yet. |
@sipa do you think it would make more sense to benchmark all the blocks and teeth settings, and then find the efficient frontier for best performance as a function of memory, and then have the setting just offer a couple options along that frontier? As this stands, by my count this creates 306 build configurations which should at least get minimal testing occasionally. Probably just a couple sizes are sufficient for exposed options. Usage probably comes in "embedded device, as small as possible please" "embedded device, I care about performance but small please" "desktop, meh whatever but don't blow out my cache too bad" "signing network service GIVE ME ALL YOUR MEMORY", flavour and not too many others. |
@gmaxwell Yeah, I was thinking of doing something like that. I hoped to reduce the number of combinations by looking at the number of adds/doubles, and removing things that don't improve on either but have a larger table size. The problem is that the number of cmovs grows exponentially with the teeth settings, and at teeth=8 every point add is preceded by iterating over 8 kB of data to select the right entry, at which point the cmov cost is certainly nontrivial as well. |
Anyway, ignoring the costs of scalar operations and cmovs, and counting a point add as 260 ns and a doubling as 159 ns, these are the optimal configurations: blocks=1 teeth=2 tblsize=192 adds=127 doubles=127 |
Here is a more extensive list of all potentially-useful combinations, taking number of cmovs into account (and not making any assumptions about the relative costs of adds vs doubles vs cmovs): teeth=1 blocks=1 tblsize=128 adds=255 doubles=255 cmovs=256 teeth=2 blocks=1 tblsize=192 adds=127 doubles=127 cmovs=256 teeth=3 blocks=1 tblsize=256 adds=85 doubles=85 cmovs=344 teeth=4 blocks=1 tblsize=576 adds=63 doubles=63 cmovs=512 teeth=5 blocks=1 tblsize=1024 adds=51 doubles=51 cmovs=832 teeth=6 blocks=1 tblsize=2048 adds=42 doubles=42 cmovs=1376 teeth=7 blocks=1 tblsize=4096 adds=36 doubles=36 cmovs=2368 teeth=8 blocks=1 tblsize=8256 adds=31 doubles=31 cmovs=4096 |
Here is the efficient frontier I found running bench_sign on an AMD 7742 (using sha256+sha-ni as the nonce function).
Here are all the configurations visualized: Most points off the frontier aren't even close. The current 64kb table is 43.7µs on my test setup. I think the logic options are {11,6} (22528b), {2,5} (2048b) and then one of the smallest three options. I would say {1,3} (256b) but I assume that {1,1} (64b) has the benefit that a considerable amount of code could also be omitted which might understate its space advantage. Presumably only someone who was extremely memory starved would consider using something smaller than {2,5}. One possibility would be to just offer the 22kb choice and 2kb choice and leave it to the user to pick their best trade-off if they really want to go smaller, with a copy of these benchmark numbers in some integration documentation. You also might want to consider how this changes with endomorphism, since after next year the library can probably drop the non-endomorphism configuration. Under the old algorithm endomorphism would halve the table size for only slightly less performance. I might also want to run the figures for schnorr signing, since the constant time modular inversion of K is probably a big fraction of the total time in these figures and it doesn't exist there. The extra constant time chunk diminishes the magnitude of the speedup. Edit: LOL, yes, it's a 1.4x speedup to avoid that inversion. Maybe someone who wants to use this library to sign on a very memory constrained device might want to comment? The code itself is some tens of kilobytes, so I don't know how much anyone cares about saving a hundred bytes in exchange for running a tiny fraction of the speed. I wouldn't be surprised that if the slower configurations were also somewhat easier to attack with differential power analysis too. |
How small does a table need to be before it's reasonable to include in the source explicitly? Not having to run gen_context at build time would be nice. In #614 I considered 1024 bytes obviously OK. |
Just for fun, I ran the same test using a more optimized signer that doesn't use any slow inversions. The efficient frontier is slightly different, but I think my suggested options would be the same.
|
@douglasbakkum @jonasschnelli @ddustin @benma you've commented before about memory usage in signing. |
@gmaxwell I'm not sure what the best strategy is:
It'd be very useful to hear from people on embedded systems how much they care about small table sizes. |
I dislike leaving comb/teeth configurable in the build system. It would increase the number of build configurations that the project should be testing by hundreds of times for no clear gain... Plus if the project can't figure out how to set them correctly, it's doubtful that someone just using the build system is going to know what to do. I wouldn't see any problem with having hidden options to override-- no promises made that those configurations were in any way tested. Obviously I'm a fan of small number of presets but there needs to be feedback on which options people want. I'm hoping to hear something like the size of the library itself means that no one cares too much about 2048 bytes vs 256 bytes vs 64 bytes-- if so then choosing a small option will be much easier. Taking a size and solving for the best value that fits would be my second choice... but only because it results in a testing headache. Things like the peak stack usage and other behavioural differences wouldn't be monotone with the size... so I don't think e.g. testing the smallest and largest would be sufficient. Otherwise I would probably prefer this one. |
Excited that this is getting worked on!
If the code itself is tens of kilobytes than making it small may be a moot issue. What's making the code so big -- is there any low hanging fruit for shrinking it down? I've been using the mbed tls library: https://tls.mbed.org/api/group__encdec__module.html The main perk there being it uses probably under 1 kilobyte of ram in total -- though it is super slow. It also doesn't seem to get the attention that secp256k1 gets, so perhaps some vulnerabilities could slip through the cracks. I had a hell of a time figuring out that I needed to flip negative values to their positive variants in some places -- a task that libsecp256k1 mostly handles for you automatically. So perhaps there's an argument for libsecp256k1 getting more embed usable from an ease of use standpoint. I don't really know what's up with the project other than I guess ARM bought it? It actually looks fairly active: https://github.com/ARMmbed/mbedtls Surprisingly in the last few years some chips have been adding hardware optimized curves. Though support, documentation, etc can be sparse to non-existent. |
While adjusting the size for speed with granularity would be nice I don't think it's every really going to be needed. Steps would be great though -- xlarge, large, mediumlarge, medium, mediumsmall, small, xtrasmall, tiny would be plenty. Often you end up in this situation where you're just out of memory and you're hunting for things to make smaller. Having an option for that would be great -- but I can't imagine a scenario where it needs to be granular. |
@ddustin The library isn't optimized for code size: Correctness, testability, performance, safe interfaces..., sure. To some extent there are size/performance trade-offs: Things like the constant time inverse and the sqrt could be implemented using a small state machine, instead of an entirely unrolled function. Instead of 5 different field element normalize functions for different combinations of strength and constant-timeness we could have just one that was the strongest and constant time. Instead of implementing things that give big speedups like WNAF but take a bit of code ... we could ... just not do that. Rather than having an overcomplete field with carefully hand optimize multiplies and squares we could have just a plain bignum with schoolbook multiplies and squaring accomplished using the multiply instruction... On a desktop computer, all these decisions favouring speed at the expense of size are almost certainly good ones-- in fact it would probably be well worth it to increase the code size another 50% if it made it a few percent faster. Even on a mobile phone or similar, the same applies. On some tiny embedded device? Probably not-- it depends on the application. Of course, these devices are also slow-- so a tradeoff that made it 50% slower just to save a few hundred bytes of code probably wouldn't be a win. Some of these tradeoffs like the multi-fold normalizes could easily be made ifdef switchable between existing code and are also not that huge a performance cost. Other tradeoffs would be big performance hits or would require extensive rewrites. But it's hard to justify expending effort doing that, even where its easy, without a lot of help or at least feedback from someone that really cares. Otherwise, it's probably more interesting spending time on additional validation, other optimizations that are useful on big desktop/server cpus-- things like ADX or AVX2-FMA or GMP-less fast jacobi/inverse, because there the objectives are more clear: Is it faster enough to justify the effort testing and reviewing it. :) or on implementing additional cryptographic functions, since those at least enable new and interesting applications. Some things could probably be made a lot smaller without being any slower too (I think the aforementioned constant time sqrt and inverses are likely candidates), but just no one has tried. I think to the extent that this project is active at all the contributors to it are interested in working with people to make size optimizations at least ones that don't come at a substantial cost to other applications. But without at least people who care about them helping to drive the tradeoffs, I wouldn't expect any to happen quickly. |
@gmaxwell That all makes sense. Targeting super small chips may be foolish. Medium size chips might be interesting. I've been throwing around the idea of a cheap and low power as possible embedded full node. In that situation power usage is probably the main concern, though turning on flash cells also takes power. As kind of an aside to all of this -- I wonder if having an easy way to throw in backend hardware optimized curves would be worthwhile. Keep the safe interfaces while leveraging hardware optimization where it's available. Also hardware optimized curves are just, well, cool! |
413713c
to
1eb38a5
Compare
Rebased. I've left the configurability for now - we can decide on a set of supported configurations later if all tests still pass. I also noticed that COMB_NEGATION wasn't configurable before; added that. |
@gmaxwell If you feel inclined, there are now 2x more configurations to choose from (with COMB_NEGATION=0, the table size is doubled but negations are avoided... I'm not sure that's ever a good tradeoff, but with benchmarks we could tell for sure). |
Sounds fun. I can't imagine disabling negation will ever help, but lets see. :) |
Mentioning @peterdettman here, because I'm not sure he's aware that people are working on this. |
Yep, was aware, but it's in capable hands. Regarding the optional COMB_NEGATION, I recall I was thinking more about power analysis than performance. |
1eb38a5
to
4a165e8
Compare
Rebased. |
4a165e8
to
026cbfb
Compare
@peterdettman In what way is avoiding negations useful for power analysis? If there isn't a good reason to support it, I'd rather remove it from the codebase so there are fewer combinations to test and maintain. |
|
See e.g. One&Done: A Single-Decryption EM-Based Attack on OpenSSL’s Constant-Time Blinded RSA, by Alam et. al. The code for calculating windows in this PR was written to somewhat obscure the bits during window assembly by keeping more bits involved at each step (but it's not perfect). OTOH, the sign bit is isolated and turned into an all 0s or all 1s value. The paper is about attacking RSA, so negation of lookup values doesn't arise, but I think it's reasonable to extrapolate that isolating the sign bit and using it as a mask should be visible in the same way. It's possible to conceive of mitigations. e.g. the precomputed values could be possibly-pre-negated according to a random mask (as part of blinding), then the sign bit could be XORed with that mask before being isolated. EDIT: Removed section mistakenly based on scalar blinding cost being related to precomp size; that's only true with #570). |
4935a46
to
937a8c2
Compare
Rebased, but I still intend to write an explanation of the algorithm. |
What's the status here? Concretely, I'm wondering whether it makes sense to work on a PR that sticks with the existing gen_context code but prebuilds the table and adds a to the repo, in the same way as in #956. Then we would get rid of the common cross-compilation issues entirely, we'd be in the position to reconsider some things about contexts (flags, etc.) Pro: I could get this done pretty quickly. It's mostly changes in the build system and we have the boilerplate in #956 already. Con: This would conflict a lot of with the changes in this PR. |
I'm ok with moving forward with making the ecmult_gen context compile-time/precomputed only, and then rebasing this. |
I'm reworking this, and am considering a few changes. In particular:
@peterdettman Thoughts? |
@sipa The recoding is into 9*32bit, not 8. This is also why it can't just be a scalar. It's also the reason for the 288 limit. It's true that only 1 bit is ever set in the highest limb of the recoding, but it simplified the core loop to just be able to read into that limb instead of treating it specially. COMB_OFFSET is a specific optimization to let the comb cover exactly 256 bits since that has nice factors; there's still an implied extra bit for 257 total which is why we need the initial value in that case. If it can be merged with the blind somehow, that's all good. I'm a bit unsure of the rest since it's been a while. Concepts around initial point and where adjustments are done sound fine. |
@peterdettman If there is a benefit to the 9*32 representation that could easily be kept, but I think it's only useful for COMB_GROUPED configurations really, and none of those appear to be among the best candidates. |
@sipa The comb is always larger than 256 bits (although only 1 higher bit is ever set), e.g. the current default config covers 260 bits. The inner loop of _ecmul_gen collects the "teeth" at each position: e.g. bit_pos is in the range [0, 260) for the default config. So having the 9*32 representation avoids dealing specially with the bit positions >= 256. It may be possible to always use an offset point (secp256k1_ecmult_gen_context.offset) to account for the high bit (currently only done in the case of a comb config that covers exactly 256 bits) and just a scalar for the remaining bits. It would require dealing with those high comb positions somehow - probably just some const-time bit-twiddling can ensure we always get 0 when bit_pos >= 256:
Depending on the comb configuration that is probably only really needed for the highest block and maybe only the highest tooth of that block. Just to be clear: unless the offset point is merged with other initialization/finalization there is effectively an extra addition incurred which the comb usually does "for free". Also if the comb reads the high-bit as '0' it's subtracting 2^(bits-1) instead of adding, so the offset would then actually be 2^bits instead. I think that COMB_GROUPED should be removed. It only makes possible sense for 4 or 8 teeth and the fast options are actually 5,6,7. Even at 4 or 8 (even 2), using spacing of 1 would be extravagant use of memory, since you could halve that usage at the cost of a single doubling per sig. Removing it will simplify dealing with the above "high-tooth" issue. Once that is all resolved, then the ideas in #693 (comment) seem reasonable things to investigate. |
@peterdettman See https://github.com/sipa/secp256k1/commits/202112_sdmc I think the design decisions fall into three questions:
|
There's also a more formal way of restricting the comb, which is to not have all the blocks be uniform. (This options is mentioned in the original paper already, BTW: https://eprint.iacr.org/2012/309). One general pattern for this might be distinguishing "lower blocks" and "upper blocks" (even though it may be a common case to have only one upper block). If our current default scheme is 11B * 6T * 4S (==264), therefore because of the extra 8 bits, we're doing in a vague sense 1 useless addition (6 of those bits), and 1 excessive lookup (the other 2); there may be table entries that are unused in the top block (or let's say, at least, less-frequently used). So we can somehow save some combination of an addition, or memory, or lookup cmovs. e.g.
There are also implied cmov lookup savings in the upper blocks of those options. The single upper block of 4 bits ones are maybe especially attractive since they are easily added with a single addition at the end of the comb and are likely to simplify things for non-uniform addition. |
Is there any hope that dedicated asm could speed up the cmov table lookups? After all we have a very nice data shape and just want to blit it all with some masking. Can explicit prefetch add anything beyond what the compiler and/or hardware are doing automatically (given that we are just scanning forward over the whole precomputed data in each pass). Is there any possibility that consolidating the lookups (one-per-block) before performing the additions can be faster? |
No idea if it will help here but I sometimes considered an "arbitray-data" cmov that works on char arrays. Advantages will include not only the possibility to cmov data of arbitrary size. Also, we'll need to get only one cmov function right and consolidate some improvements/issues (#777, #457). So we could spend our time more efficiently on getting that single function constant-time on all platforms (the compiler could still specialize it for some sizes but I still think it's a good idea). We could then write an x86 ASM version, which could be beneficial not only for performance but using native CMOVs will guarantee that the compiler can't make it variable time. (In spirit it's similar to my unfinished #636 on data cleaning. IIRC, this implements all the |
Am I right that we could still use the rescaling blinding that we currently use (e.g., by applying rescaling the new first point instead of |
Yes, it's already in the new PR: https://github.com/sipa/secp256k1/blob/c914fcc130a04f7df174aa14b9ba134fcf79efa3/src/ecmult_gen_impl.h#L187 . |
@peterdettman Good to point out that non-uniform approaches are possible too, and if we're going in the direction of actually optimizing the fetch/add loop, it's probably better to just do nonuniform blocks (as they can avoid post-bit-256 issues entirely, and are probably much easier to emit efficient code for, as the loops are more predictable). I think I'll try that. My preference would be to keep the spacing uniform for simplicity reasons (the outer loop can stay), and then have lower blocks with T teeth en upper blocks with T-1 teeth. The upper blocks can use the same table and fetch loop, but forego conditional negation (or halve the fetch loop but keep conditional negation). This doesn't save additions, but isn't too big of a change either, and my intuition is that the optimal performance will be had with solutions that don't have too much difference between upper and lower blocks. EDIT: What I said here isn't very accurate; the upper blocks of course have separate tables to begin with. So in that case, I think the most reasonable approach to try is halving the upper block tables in size, and keep conditional negation for them. |
@sipa I think you may come to appreciate the virtues of just a single upper block of 4 (in the default case) bits as I mentioned previously. I actually find Upper(1B * 4T * 1S), Lower(7B * 6T * 6S) quite attractive. The extra 2 doubles are (with latest formulae) cheaper than the saved addition, so it will actually be faster than the current one, using 2/3 the table memory (which will also make it faster), and it saves an entire lookup PLUS replaces another 6T lookup (32 entries) with a 4T lookup (8 entries), so let's say 1.75 lookups saved (faster again). All that and the awkward upper block can just be added at the end outside the loop (basically it joins the final iteration), so it doesn't complicate that. The lower blocks only cover 252 bits (i.e. an odd scalar in (-2^252, 2^252) and it seems likely that could all be non-uniform additions. It's even possible to add the final blind/offset value to all 8 of the upper block table entries in precomp and save another (uniform) addition (need context space for that though). We should check for a similar option in a small-memory config too I guess. Since maximizing the number of teeth (in the face of the cmov costs) is where most of the speed comes from I realise it's intuitive to say lower@T then upper@(T-1), but I played around with the available configurations, and (trying to target exactly 256 bits) it's not all that important in the end and mostly the optimum is a single block at the top. If the spacing is less in the upper blocks, it just means join the loop after a few doubles, but yes that might end up being a separate loop. OTOH if it's a single block with a spacing of one it can just be a single addition after the loop. |
@peterdettman I'm starting to see the appeal of having a single upper block only, and giving it spacing 1 would be extra nice. Thinking about how one would specify the parametrization for that?
We wouldn't want the number of teeth in the upper configuration to exceed T (because going beyond the user's wish for T has a large impact on table size and cmov cost). So that would mean a requirement that (256 mod ((B-1)T)) <= T. Do you think that could be overly restrictive? Another possibility may be specifying configurations as (max adds, max table size) pairs, which probably more accurately corresponds to real-world costs (the entire table is read in cmovs, and most of the cost beyond that is adds (and the bit fetching for it)). It's not obvious to me how to translate these to actual table configurations though. To be clear, I do expect we'll have just 2 or 3 precomputed configurations to choose from for production, which could easily be curated, but it'd be nice to be able to test the code by hand for a wide variety of configurations easily. |
Here are a list of configurations (blocks=number of "full" blocks excluding the upper one if any, fteeth=teeth for final block which has spacing one, tblsize=in bytes, groupops=adds+doubles, bitops=number of get_bits or equivalent operations to construct the table entries). Excluded any configuration that is not better in at least one of the metrics:
If I also include conditional negations as a metric, 72 configurations remain. If I additionally split doubles/adds into separate metrics, 95. |
This may just be a rephrasing, but for some choice of B,T, use the highest S such that For 7T we need to hit 252 == For 6T we need to hit 252 == For 5T we need to hit 255 == For 4T we can hit 256 == We're already down at 512 byte, so I doubt fewer teeth are relevant, but in case 4T is somehow too many: So, yeah, overall it feels a little restrictive to me. This is basically why the original PR just let you aim for 256 plus a bit and we just accepted that the price of simplicity was probably going to be an extra addition. I'm not saying that's better or worse, it's just that the comb configurations aren't as flexible as we might like. Edit: sorry, of course it's the number of doubles is 1 less than the spacing, so all the above are 1 too high. |
Just give the additions a relative weight of 2 vs the doubles (even 2.1). Edit: Then again, what's the weight for a non-uniform addition? |
Also there seems to be no restriction on fteeth, that will cut your numbers down. |
I realized that there is no need for restricting fteeth: its "badness" is captured perfectly by the increase in tablesize (both memory usage and cmovs are affected by it, but those scale 1:1 with table size). I'll run more analysis to compare with the configurations supported by the current code, but my thinking now is that this design space is getting too big, and we should focus on getting some form of SDMC in first. I'll probably remove some optimizations from my new PR and perhaps reintroduce the uint32[9] array to simplify some edge cases. |
Yes I think that it would be wise to rein it in and get a basic version in place. |
See #1058 for the newest iteration. |
This is a rebase of #546. See the original PR for a full description, but in short, this introduces a new constant-time multiplication algorithm with precomputation, with better speed/size tradeoffs. It is more flexible, allowing both better speeds with the same table size, or smaller table sizes for the same speed. It permits extrmely small tables with still reasonable speeds.
In addition to the original PR, this also:
configure
, and tests a few combinations in Travis.