Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paranoia settings are a bit confused #261

Closed
SeeSpotRun opened this issue Nov 5, 2017 · 15 comments
Closed

Paranoia settings are a bit confused #261

SeeSpotRun opened this issue Nov 5, 2017 · 15 comments

Comments

@SeeSpotRun
Copy link
Collaborator

Current hashes for different paranoia levels are:

-PP:       xxhash   (64 bit, non-cryptographic)
-P:        bastard  (256 bit, murmur128+city128)
(default): blake2b  (512 bit, cryptographic)
-p:        sha512   (512 bit, cryptographic) if available else sha256
-pp:       bitwise comparison

The -p option is actually less secure than the default option, due to the default being upgraded to blake2b.
Branch https://github.com/SeeSpotRun/rmlint/tree/paranoia proposes a new sequence for discussion. This includes a re-write of bastard hash to provide a more secure option than blake2b:

-PP:       spooky   (128 bit, non-cryptographic)
-P:        sha256   (256 bit, cryptographic)
(default): blake2b  (512 bit, cryptographic)
-p:        bastard* (1024 bit, cryptographic, blake2b + sha3-512)
-pp:       bitwise comparison
@sahib
Copy link
Owner

sahib commented Nov 5, 2017

You're right there. Forgot to change it in that place.
(Although I'd see blake2b and sha512 somewhat equal - but confusing nevertheless)

This includes a re-write of bastard hash to provide a more secure option than blake2b:

Truth be told, I'm still no big fan of that bastard hash. From my point of view it neither adds real security, nor performance advantages. Personally I'd even vote to make a single -p already do the bitwise comparison with no step in between. Would also make the explanation in the docs a bit simpler.

@SeeSpotRun
Copy link
Collaborator Author

Personally I'd even vote to make a single -p already do the bitwise comparison with no step in between

Agreed, was thinking this also; anyone who can't live with the risk of 512 bit hash collisions isn't going to be satisfied by 1024 bits either.

Let's drop -p and -a bastard.

@SeeSpotRun
Copy link
Collaborator Author

Here's a comparison table based on our standard /usr test with warm caches, R5 1600 cpu:

                                           BITS
 HASH        TIME   CPU    RAM(kB)  32  64  128 160 256 384 512  ∞
 ----------------------------------------------------------
 spooky32    3.95   313    169       X
 murmur      3.96   311    169               X
 spooky64    3.96   313    169           X
 spooky128   3.96   312    169               X
 xxhash      3.96   299    167           X
 farmhash    3.97   314    170           X
 city128     3.97   331    172               X
~MD5         4.05   331    217               X
 murmur256   4.06   359    180                       X
~SHA1        4.11   359    220                   X
 city256     4.20   393    182                       X
*SHA512      4.39   429    223                               X
 murmur512   4.43   435    188                               X
*SHA256      4.56   462    224                       X
 city512     4.68   486    189                               X
 paranoid    5.23   331   1438                                   X
*BLAKE2B     5.90   636    222                               X
*BLAKE2BP    6.63   675    429                               X
*BLAKE2S     7.26   738    206                       X
*BLAKE2SP    8.38   773    428                       X
*SHA3-256   21.45  1008    221                       X
*SHA3-384   26.55  1042    221                           X
*SHA3-512   36.05  1086    221                               X

Another thing I came across in the process when reading about hash collision attacks: if I'm reading them correctly, if two files have identical murmur hashes then they will also have identical seeded murmur hashes (http://emboss.github.io/blog/2012/12/14/breaking-murmur-hash-flooding-dos-reloaded/). So our MURMUR256 and MURMUR512 hashes (which are just multiple MURMUR128 hashes with different seeds) actually offer no better collision resistance than MURMUR128. I suspect the same may also be true for our CITY256 and CITY512 concoctions...

@sahib
Copy link
Owner

sahib commented Nov 7, 2017

Here's a comparison table based on our standard /usr test with warm caches, R5 1600 cpu:

Cool table! 👍

I guess your point here is also that paranoid is pretty much as fast as blake2b? 😏
Good opportunity to reevaluate -P and -PP. What do you suggest?

Another thing I came across in the process when reading about hash collision attacks: if I'm reading them correctly, if two files have identical murmur hashes then they will also have identical seeded murmur hashes

You're right here, but that's due the way we seeded the initial hash. In the end it's 4 times the same hash xor'd with the initial seed value. We'd have to do the seeding like for the sha1 digests (i.e. feed murmur with a constant seed in the beginning for the second half in case of murmur256).

But after all, I also don't mind if we remove those "creations" from our codebase. I guess we support by far enough established hash functions.

@SeeSpotRun
Copy link
Collaborator Author

Checksum code cleanup well underway: https://github.com/SeeSpotRun/rmlint/tree/checksum-dev
Will revisit paranoia scale shortly...

@SeeSpotRun
Copy link
Collaborator Author

Hmmm, seems that none of our non-cryptographic hashes have correct streaming implementations. Feeding the data to the hasher in several increments gives a different checksum to when the same data is hashed all at once.

The current strategy for these hashes is to use the result from the previous increment as seed for the next increment. That's probably still going to give a hash result with the desirable properties of randomness and collision resistance, but there's really no way to be sure.

For murmur in particular there is an issue because it only uses the first 32 bits of the previous increment as seed for the next. So files which differ at the start but have identical endings will have a 1 in 2^32 chance of hash collision instead of the expected 1 in 2^128.

My thought is we should drop most of these hashes from rmlint, maybe preserving one non-cryptographic hash but rewriting its code to the more classical streaming format:

state = state_init();
while data {
    update(state, data, len);
}
checksum = finalise(state)

The excellent analysis at https://github.com/rurban/smhasher gives some good clues on which non-crypt hash to go with.

@sahib
Copy link
Owner

sahib commented Nov 13, 2017

The current strategy for these hashes is to use the result from the previous increment as seed for the next increment. That's probably still going to give a hash result with the desirable properties of randomness and collision resistance, but there's really no way to be sure.

I generally agree, but should note that hash functions generally work blockwise anyways. Streaming implementations usually do nothing more than a XOR (or similar) of the previous block value with the current block value.

For murmur in particular there is an issue because it only uses the first 32 bits of the previous increment as seed for the next. So files which differ at the start but have identical endings will have a 1 in 2^32 chance of hash collision instead of the expected 1 in 2^128.

True, that's a problem.

My thought is we should drop most of these hashes from rmlint, maybe preserving one non-cryptographic hash but rewriting its code to the more classical streaming format: [...]

I don't mind dropping some less used ones (like farmhash/city e.g.). But instead of doing implementing "streaming" (which is not a native concept and merely an API concept, as said above) we could do also the following:

  1. Take the previous block value and feed it really to the hash function.
  2. XOR the previous block value to the new block value.

Both options are pretty general and work for each non-crypotgraphic hash we have.

The excellent analysis at https://github.com/rurban/smhasher gives some good clues on which non-crypt hash to go with.

Very nice link.

@SeeSpotRun
Copy link
Collaborator Author

I've taken this further in https://github.com/SeeSpotRun/rmlint/tree/checksum-dev.

I converted murmur and xxhash to streaming form. They have fairly small state structures (40 & 88 bytes respectively).

I ditched spooky, city and farmhash because they required large state structs and didn't offer any particular advantages.

Based on https://github.com/rurban/smhasher the best-performed 128-bit hash is metro hash, so I have added that to rmlint. Also since it offers two "statistically unique" variants, it's acceptable to cat the two variants together to create a 256-bit variant.

Test results for a synthetic test (256 pairs of duplicate ~4MB sparse files grouped into 32 size groups of 16 files each; 2 cpu cores):

HASH          TIME     CPU     RAM     BITS  CRYPTO
metrocrc      0.888    173     57600   128     N
metro         1.022    177     57061   128     N
xxhash        1.170    179     47501    64     N
paranoid      1.492    183    247795    ∞      -
murmur        1.562    186     56775   128     N
metro256      1.763    187     55740   256     N
metrocrc256   1.790    187     56261   256     N
md5           1.900    189     58320   128     ~
sha1          2.030    189     54388   160     ~
highway256    3.080    194     56756   256     Y?
highway128    3.095    193     54786   128     Y?
highway64     3.110    192     57126    64     Y?
sha512        3.670    194     52410   512     Y
sha256        4.325    195     49090   256     Y
cumulative    7.60     197     51040   128     N
blake2b      13.63     198     56784   512     Y
blake2bp     14.01     198     46204   512     Y
blake2s      22.22     199     53388   256     Y
blake2sp     22.62     199     49244   256     Y
sha3-256     85.20     199     53884   256     Y
sha3-384    108.83     199     54064   384     Y
sha3-512    155.59     199     53296   512     Y

The metro hashes live up to the promise of https://github.com/rurban/smhasher. The 256 and 128 bit variants are my choices for -P and -PP options.

The sha3 results are disappointingly slow, despite the speed results at https://www.linkedin.com/pulse/sha-3-shake-keccak-final-version-william-buchanan. I suspect there is a flaw in our implementation (either the hash function itself or the way we manage threads and buffering). Running rmlint --hash -a sha3-512 <file> is over 10 times slower than running rhash --sha3-512 <file>

Sha512 is way faster than blake and looks like a good candidate for default hash. In which case there is no good candidate for -p other than paranoid.

The paranoid hash would be a good candidate for the default hash except that:
(a) on rotational disks it loses out on speed because it has to sacrifice seek order to conserve RAM
(b) with large sets of same-size files is slows down drastically due to algorithm inefficiency.

SeeSpotRun added a commit to SeeSpotRun/rmlint that referenced this issue Nov 14, 2017
@SeeSpotRun
Copy link
Collaborator Author

SeeSpotRun commented Nov 14, 2017

Well even after copying the sha3 code directly from rhash source, sha3 was still extremely slow. So I had a look at the compiler settings and realised that no compiler optimisation is applied with DEBUG=1..
Is that the intention?

rmlint/SConstruct

Lines 613 to 618 in dcf23a0

if ARGUMENTS.get('DEBUG') == "1":
conf.env.Append(CCFLAGS=['-ggdb3'])
else:
# Generic compiler:
conf.env.Append(CCFLAGS=['-Os'])
conf.env.Append(LINKFLAGS=['-s'])

Anyway with -Os the times come down to something more reasonable, and can match the rhash sha3 times if I compile with -O2.

For the same synthetic benchmark as above:

HASH          TIME   CLANG   CPU     RAM     BITS  CRYPTO
xxhash        0.468  0.464    147     19566    64     N
metro         0.527  0.535    152     22298   128     N      <-
metrocrc      0.582  0.609    155     27419   128     N
metrocrc256   0.679  0.716    160     32684   256     N
metro256      0.689  0.728    161     29272   256     N      <-
cumulative    0.698  0.651    161     32838   128     X
murmur        0.720  0.669    162     35540   128     N
paranoid      0.989  0.993    168    144731    ∞      -      <-
highway64     1.980  2.214    183     53963    64     Y?
highway128    1.986  2.217    183     53457   128     Y?
highway256    1.988  2.226    183     54101   256     Y?     <-
md5           2.351  2.263    183     55438   128     Y(broken)
blake2b       2.754  3.133    186     53251   512     Y      <-
blake2bp      3.087  3.362    187     52477   512     Y
sha1          3.620  3.630    187     52468   160     Y(broken)
blake2s       4.112  5.150    188     51048   256     Y
blake2sp      4.632  5.410    188     49001   256     Y
sha512        6.447  6.152    189     51745   512     Y
sha256        8.473  8.523    190     50892   256     Y
sha3-256     11.935  6.610    191     50134   256     Y
sha3-384     15.735  8.507    192     49314   384     Y
sha3-512     22.130 12.085    192     55236   512     Y

The 5 hashes highlighted with <- appear to offer best "value for money" i.e. collision resistance vs resource usage, and would suggest paranoia levels metro < metro256 < highway256 < blake2b < paranoid.

EDIT: significantly better sha3 performance if compiled with clang vs gcc (clang times added above)

@sahib
Copy link
Owner

sahib commented Nov 14, 2017

Is that the intention?

Definitely. Using -Os removes debug symbols, making rmlint harder to debug.
If you want to benchmark stuff, always use DEBUG=0 since even -ggdb3 will add some runtime.

Anyway with -Os the times come down to something more reasonable, and can match the rhash sha3 times if I compile with -O2.

I like your benchmarks, but keep in mind that the performance here is pretty hard to measure. It also depends a lot on the CPU that you are using (especially the instruction set), the chose compiler (you noticed that), what settings you pass to the compiler and also how optimized the implementation is (you tied different ones, but one might perform better on Intel than on AMD). Still you're doing very good work to get the much needed "feeling" for performance here.

The 5 hashes highlighted with <- appear to offer best "value for money" i.e. collision resistance vs resource usage, and would suggest paranoia levels metro < metro256 < highway256 < blake2b < paranoid.

This already sounds very reasonable to me (with blake2b being the default I guess?), although I never really heard of metro and highway yet. Feels like there's a new hash function every morning.
I was confused/surprised earlier that sha512 was faster than blake2b since being fast was one of the claims made by the blake authors...

@SeeSpotRun
Copy link
Collaborator Author

Using -Os removes debug symbols, making rmlint harder to debug

Ah ok thanks for clarifying

@sahib
Copy link
Owner

sahib commented Nov 29, 2017

I guess we can close this since #265 was merged.

@sahib sahib closed this as completed Nov 29, 2017
@p1839251
Copy link

Hello,
I'd like to know if -pp (byte by byte comparison) is gone forever?

$ rmlint -pp

ERROR: Only up to -p or down to -PPP flags allowed.

@sahib
Copy link
Owner

sahib commented Jul 14, 2018

It's still there. paranoid was changed to -p from -pp, due to the discussion in this thread.
I noticed however that some docs still mention -pp, which I fixed as of 703bfda.

From the manpage:

       -p --paranoid / -P --less-paranoid (default)
              Increase  or decrease the paranoia of rmlint's duplicate algorithm.  
              Use -p if you want byte-by-byte comparison without any hashing.

              · -p is equivalent to --algorithm=paranoid
              · -P is equivalent to --algorithm=highway256
              · -PP is equivalent to --algorithm=metro256
              · -PPP is equivalent to --algorithm=metro

@p1839251
Copy link

thank you, it was confusing since -pp is still in the archlinux manpage:
http://jlk.fjfi.cvut.cz/arch/manpagesman/community/rmlint/rmlint.1.en

v 2.7.0-1

-p --paranoid / -P --less-paranoid (default)
Increase or decrease the paranoia of rmlint's duplicate algorithm. Use -pp if you want byte-by-byte comparison without any hashing.
• -p is equivalent to --algorithm=paranoid
• -P is equivalent to --algorithm=highway256
• -PP is equivalent to --algorithm=metro256
• -PPP is equivalent to --algorithm=metro

It sounded like -pp was more secure than -p (paranoid)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants