-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hash/crc32: use faster method for CRC32-C #16107
Labels
Milestone
Comments
If you can make it faster while producing the same results, by all means go for it. Thanks. |
Working on a decent initial version. Sneak peek:
|
CL https://golang.org/cl/24471 mentions this issue. |
CL https://golang.org/cl/27931 mentions this issue. |
gopherbot
pushed a commit
that referenced
this issue
Aug 28, 2016
The algorithm is explained in the comments. The improvement in throughput is about 1.4x for buffers between 500b-4Kb and 2.5x-2.6x for larger buffers. Additionally, we no longer initialize the software tables if SSE4.2 is available. Adding a test for the SSE implementation (restricted to amd64 and amd64p32). Benchmarks on a Haswell i5-4670 @ 3.4 GHz: name old time/op new time/op delta CastagnoliCrc15B-4 21.9ns ± 1% 22.9ns ± 0% +4.45% CastagnoliCrc15BMisaligned-4 22.6ns ± 0% 23.4ns ± 0% +3.43% CastagnoliCrc40B-4 23.3ns ± 0% 23.9ns ± 0% +2.58% CastagnoliCrc40BMisaligned-4 25.4ns ± 0% 26.1ns ± 0% +2.86% CastagnoliCrc512-4 72.6ns ± 0% 52.8ns ± 0% -27.33% CastagnoliCrc512Misaligned-4 76.3ns ± 1% 56.3ns ± 0% -26.18% CastagnoliCrc1KB-4 128ns ± 1% 89ns ± 0% -30.04% CastagnoliCrc1KBMisaligned-4 130ns ± 0% 88ns ± 0% -32.65% CastagnoliCrc4KB-4 461ns ± 0% 187ns ± 0% -59.40% CastagnoliCrc4KBMisaligned-4 463ns ± 0% 191ns ± 0% -58.77% CastagnoliCrc32KB-4 3.58µs ± 0% 1.35µs ± 0% -62.22% CastagnoliCrc32KBMisaligned-4 3.58µs ± 0% 1.36µs ± 0% -61.84% name old speed new speed delta CastagnoliCrc15B-4 684MB/s ± 1% 655MB/s ± 0% -4.32% CastagnoliCrc15BMisaligned-4 663MB/s ± 0% 641MB/s ± 0% -3.32% CastagnoliCrc40B-4 1.72GB/s ± 0% 1.67GB/s ± 0% -2.69% CastagnoliCrc40BMisaligned-4 1.58GB/s ± 0% 1.53GB/s ± 0% -2.82% CastagnoliCrc512-4 7.05GB/s ± 0% 9.70GB/s ± 0% +37.59% CastagnoliCrc512Misaligned-4 6.71GB/s ± 1% 9.09GB/s ± 0% +35.43% CastagnoliCrc1KB-4 7.98GB/s ± 1% 11.46GB/s ± 0% +43.55% CastagnoliCrc1KBMisaligned-4 7.86GB/s ± 0% 11.70GB/s ± 0% +48.75% CastagnoliCrc4KB-4 8.87GB/s ± 0% 21.80GB/s ± 0% +145.69% CastagnoliCrc4KBMisaligned-4 8.83GB/s ± 0% 21.39GB/s ± 0% +142.25% CastagnoliCrc32KB-4 9.15GB/s ± 0% 24.22GB/s ± 0% +164.62% CastagnoliCrc32KBMisaligned-4 9.16GB/s ± 0% 24.00GB/s ± 0% +161.94% Fixes #16107. Change-Id: Ibe50ea76574674ce0571ef31c31015e0ed66b907 Reviewed-on: https://go-review.googlesource.com/27931 Run-TryBot: Brad Fitzpatrick <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Brad Fitzpatrick <[email protected]>
This was referenced Aug 28, 2016
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
The hardware (SSE4.2) assisted CRC32-C implementation can be made faster (from my experience about 2.5 - 2.7 times faster if checksumming enough bytes e.g. 512+),
The gist of the idea is that (on recent Intel platforms at least) the CRC32 instruction takes three cycles but three of them can be pipelined in parallel if they don't depend on each other; so we can split the buffer into three, checksum the three pieces in "parallel" (interleaving the CRC32 instructions) and then combine the results using some tables.
This article goes into more depth: http://www.drdobbs.com/parallel/fast-parallelized-crc-computation-using/229401411
I can take a stab at implementing it at some point when I have some time of my hands, unless there are objections.
The text was updated successfully, but these errors were encountered: