-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark faster hash functions #90
Comments
t1ha
family
These functions could be the fastest, too: Double-check with https://rurban.github.io/smhasher/ |
Note that "FarmHash" is not portable and is too machine-specific, as it operates differently on 64-bit and 32-bit systems. |
From the preliminary results (very limited data) on these benchmarks:
I could rank the functions like this:
@tareknaser how could we approach realistic benchmarking? Something like building index of a folder with 10K files from 100K to 2MB each? Does it make sense generate such a test dataset? I'll also run the benches against my own real data (using the branches you've created). We also could have collision ratios extracted from such a benchmark. |
Benchmarks using this data:
I'm a bit confused about these 2 points:
Looks that we need profiles, too: Up: |
Looking at the benchmark results, here's what I think: For
For
We should think about the trade-off between fewer collisions and faster lookups when choosing hash sizes. |
Our current benchmarks index the Profiling can help identify performance bottlenecks within arklib. Benchmarks tell us "how fast", profiling tells us "why". However, if we opt for large datasets for benchmarks, we can store them separately from the project code on GitHub. Then, use tools like |
Do you think this could be the reason on 64-bit architecture?
Totally. But CRC-32 has zero collisions, too. If I remember correctly, for
Agree. But the strange part here is that difference in hashing performance between CRC-32 and Blake-3 is the same as the difference in indexing performance...
Agree.
Do you think there are faster cryptographic functions? We could try, but let's first implement this for easier experimentation:
That's all correct. I just think we need also some heavy automated indexing test which will be run only manually. We could also collect some huge dataset and store it in some cloud storage provider, but generating random data might be quicker for arbitrary person that download the dataset from the cloud. What do you think? |
I think it contributes, yes. But maybe not the sole reason.
I think that generating random data would be more straightforward to implement and maintain. But using the same dataset is better for benchmarking as it provides a more accurate representation of real-world data. Generating random data would be the better choice for now. we can always switch to a fixed dataset later. |
Apparently, |
We are interested in overall performance for index construction, i.e. both pure hashing speed and collisions amount are important. Note that there are at least 2 kinds of
t1ha
function:t1ha0
(the fastest) andt1ha1
(portable).We should also measure
t1ha2
andt1ha3
since they should have less collisions.https://github.com/PositiveTechnologies/t1ha
The text was updated successfully, but these errors were encountered: