Benchmark faster hash functions #90

kirillt · 2024-01-28T10:35:02Z

We are interested in overall performance for index construction, i.e. both pure hashing speed and collisions amount are important. Note that there are at least 2 kinds of t1ha function: t1ha0 (the fastest) and t1ha1 (portable).

We should also measure t1ha2 and t1ha3 since they should have less collisions.

https://github.com/PositiveTechnologies/t1ha

The text was updated successfully, but these errors were encountered:

kirillt · 2024-02-01T01:26:28Z

These functions could be the fastest, too:

xxh3
wyhash

Double-check with https://rurban.github.io/smhasher/

tareknaser · 2024-02-10T17:01:31Z

Here are some hash functions that are worth benchmarking:

tareknaser · 2024-02-11T18:22:53Z

Note that "FarmHash" is not portable and is too machine-specific, as it operates differently on 64-bit and 32-bit systems.

kirillt · 2024-02-12T07:43:03Z

From the preliminary results (very limited data) on these benchmarks:

Plain hashing (1MB of data)

I could rank the functions like this:

gxhash
wyhash
t1ha0
farmhash
highway
xxh3
crc32

@tareknaser how could we approach realistic benchmarking? Something like building index of a folder with 10K files from 100K to 2MB each? Does it make sense generate such a test dataset? I'll also run the benches against my own real data (using the branches you've created).

We also could have collision ratios extracted from such a benchmark.

kirillt · 2024-02-12T13:01:52Z

Benchmarks using this data:

$ ncdu
--- /mnt/data/test_all --------------------------------------------------------------------
    4.8 GiB [#############] /test2
    4.5 GiB [############ ] /test
    
$ find test | rg -v .ark | wc -l
12276

$ find test2 | rg -v .ark | wc -l
28

main branch:

Benchmarking index_build/index_build//mnt/data/test_all/: Collecting 100 samples in estimatindex_build/index_build//mnt/data/test_all/
                        time:   [5.2732 s 5.2812 s 5.2897 s]

Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

Collisions: 0

hash_gxhash branch:

Gnuplot not found, using plotters backend
Benchmarking index_build/index_build//mnt/data/test_all/: Collecting 100 samples in estimatindex_build/index_build//mnt/data/test_all/
                        time:   [4.4865 s 4.4969 s 4.5073 s]

Collisions: 0

tareknaser/betterhash branch (Blake3)

Benchmarking index_build/index_build//mnt/data/test_all/: Warming up for 3.0000 s
index_build/index_build//mnt/data/test_all/
                        time:   [11.234 s 11.644 s 12.108 s]

Found 13 outliers among 100 measurements (13.00%)
  2 (2.00%) high mild
  11 (11.00%) high severe

I'm a bit confused about these 2 points:

Indexing with gxhash is just 15% faster, more time spent on something else like putting into tree structure? Or reading from disk, especially when it's not SSD. But blake3 is 120% times slower as expected.
Zero collisions with fast hash functions, this could be a bug somewhere...

Looks that we need profiles, too:

Generate profiles for index construction #40

Up:
wyhash: 5.35 sec
t1ha0: 5.1 sec

tareknaser · 2024-02-12T18:34:31Z

Looking at the benchmark results, here's what I think:

For gxhash:

crc32 uses a smaller hash (u32) compared to gxhash (u64). This difference can slow down lookups in ResourceIndex.
Bigger hashes like u64 usually have fewer collisions, which is good. But getting zero collisions with gxhash seems strange, let's double-check that.

For blake3:

It's already slower than gxhash, as expected. Benchmarks online might not be helpful here since we do more than just hashing.
Using its larger hash ([u8; 32]) for lookups adds extra work.

We should think about the trade-off between fewer collisions and faster lookups when choosing hash sizes.
We can also consider other cryptographic hash functions and see how they compare to blake3.

tareknaser · 2024-02-12T19:14:03Z

how could we approach realistic benchmarking?

Our current benchmarks index the tests/ directory, which is a good starting point but isn't ideal for performance evaluations. In CI, we try to provide quick feedback on potential issues rather than comprehensive performance assessments because of resource limitations.

Profiling can help identify performance bottlenecks within arklib. Benchmarks tell us "how fast", profiling tells us "why".

However, if we opt for large datasets for benchmarks, we can store them separately from the project code on GitHub. Then, use tools like wget or reqwest to fetch datasets during benchmark execution.

kirillt · 2024-02-13T01:14:42Z

@tareknaser

crc32 uses a smaller hash (u32) compared to gxhash (u64). This difference can slow down lookups in ResourceIndex

Do you think this could be the reason on 64-bit architecture?

Bigger hashes like u64 usually have fewer collisions, which is good. But getting zero collisions with gxhash seems strange, let's double-check that.

Totally. But CRC-32 has zero collisions, too. If I remember correctly, for test_all/test folder I had something like at least 1 collision per 1,000 resources. I would expect around 10 collisions total for main branch. Probably it's worth to check indexing code and also fix this command (at the moment it seems to not print the collisions number).

Benchmarks online might not be helpful here since we do more than just hashing. Using its larger hash ([u8; 32]) for lookups adds extra work.

Agree. But the strange part here is that difference in hashing performance between CRC-32 and Blake-3 is the same as the difference in indexing performance...

We should think about the trade-off between fewer collisions and faster lookups when choosing hash sizes.

Agree.

We can also consider other cryptographic hash functions and see how they compare to blake3.

Do you think there are faster cryptographic functions? We could try, but let's first implement this for easier experimentation:

Index generic over hash function arklib#75

Our current benchmarks index the tests/ directory, which is a good starting point but isn't ideal for performance evaluations. In CI, we try to provide quick feedback on potential issues rather than comprehensive performance assessments because of resource limitations.

That's all correct. I just think we need also some heavy automated indexing test which will be run only manually. We could also collect some huge dataset and store it in some cloud storage provider, but generating random data might be quicker for arbitrary person that download the dataset from the cloud. What do you think?

tareknaser · 2024-02-13T16:36:38Z

Do you think this could be the reason on 64-bit architecture?

I think it contributes, yes. But maybe not the sole reason.

but generating random data might be quicker for arbitrary person that download the dataset from the cloud. What do you think?

I think that generating random data would be more straightforward to implement and maintain. But using the same dataset is better for benchmarking as it provides a more accurate representation of real-world data.

Generating random data would be the better choice for now. we can always switch to a fixed dataset later.

tareknaser · 2024-02-25T00:35:11Z

Apparently, aHash is not supposed to be used in our case
AHash does not have a fixed standard for its output

kirillt changed the title ~~Benchmark hash functions from t1ha family~~ Benchmark faster hash functions Feb 1, 2024

kirillt assigned tareknaser Feb 6, 2024

kirillt transferred this issue from ARK-Builders/arklib Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark faster hash functions #90

Benchmark faster hash functions #90

kirillt commented Jan 28, 2024

kirillt commented Feb 1, 2024

tareknaser commented Feb 10, 2024 •

edited

Loading

tareknaser commented Feb 11, 2024

kirillt commented Feb 12, 2024 •

edited

Loading

kirillt commented Feb 12, 2024 •

edited

Loading

tareknaser commented Feb 12, 2024

tareknaser commented Feb 12, 2024

kirillt commented Feb 13, 2024 •

edited

Loading

tareknaser commented Feb 13, 2024

tareknaser commented Feb 25, 2024

Benchmark faster hash functions #90

Benchmark faster hash functions #90

Comments

kirillt commented Jan 28, 2024

kirillt commented Feb 1, 2024

tareknaser commented Feb 10, 2024 • edited Loading

tareknaser commented Feb 11, 2024

kirillt commented Feb 12, 2024 • edited Loading

kirillt commented Feb 12, 2024 • edited Loading

tareknaser commented Feb 12, 2024

tareknaser commented Feb 12, 2024

kirillt commented Feb 13, 2024 • edited Loading

tareknaser commented Feb 13, 2024

tareknaser commented Feb 25, 2024

tareknaser commented Feb 10, 2024 •

edited

Loading

kirillt commented Feb 12, 2024 •

edited

Loading

kirillt commented Feb 12, 2024 •

edited

Loading

kirillt commented Feb 13, 2024 •

edited

Loading