-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add avx512 implementation #79
base: main
Are you sure you want to change the base?
Conversation
``` goos: linux goarch: amd64 pkg: github.com/cespare/xxhash/v2 cpu: AMD Ryzen 9 7950X 16-Core Processor │ /tmp/old.results │ /tmp/new.results │ │ sec/op │ sec/op vs base │ Sum64/4B 2.295n ± 5% 3.018n ± 2% +31.53% (p=0.000 n=10) Sum64/16B 3.103n ± 5% 4.168n ± 3% +34.32% (p=0.000 n=10) Sum64/100B 9.865n ± 4% 8.515n ± 3% -13.68% (p=0.000 n=10) Sum64/4KB 201.4n ± 3% 133.1n ± 3% -33.91% (p=0.000 n=10) Sum64/10MB 489.8µ ± 4% 384.4µ ± 3% -21.52% (p=0.000 n=10) geomean 92.93n 88.67n -4.58% │ /tmp/old.results │ /tmp/new.results │ │ B/s │ B/s vs base │ Sum64/4B 1.623Gi ± 4% 1.234Gi ± 2% -23.96% (p=0.000 n=10) Sum64/16B 4.802Gi ± 5% 3.575Gi ± 3% -25.56% (p=0.000 n=10) Sum64/100B 9.441Gi ± 4% 10.937Gi ± 3% +15.85% (p=0.000 n=10) Sum64/4KB 18.49Gi ± 3% 27.99Gi ± 3% +51.33% (p=0.000 n=10) Sum64/10MB 19.01Gi ± 4% 24.23Gi ± 3% +27.41% (p=0.000 n=10) geomean 7.631Gi 7.998Gi +4.81% ``` I've tried to optimize the small numbers but I don't think I can since a huge part of that slowdown is checking the `useAvx512` global. I think that fine 4ns is still extremely fast for a single hash operation.
``` goos: linux goarch: amd64 pkg: github.com/cespare/xxhash/v2 cpu: AMD Ryzen 9 7950X 16-Core Processor │ /tmp/old.results │ /tmp/new.results │ │ sec/op │ sec/op vs base │ DigestBytes/4B 9.358n ± 2% 9.377n ± 2% ~ (p=0.481 n=10) DigestBytes/16B 11.48n ± 3% 11.48n ± 1% ~ (p=0.469 n=10) DigestBytes/100B 15.97n ± 2% 20.41n ± 3% +27.80% (p=0.000 n=10) DigestBytes/4KB 212.4n ± 2% 167.7n ± 3% -21.09% (p=0.000 n=10) DigestBytes/10MB 493.2µ ± 3% 380.3µ ± 2% -22.90% (p=0.000 n=10) geomean 178.2n 169.6n -4.87% │ /tmp/old.results │ /tmp/new.results │ │ B/s │ B/s vs base │ DigestBytes/4B 407.6Mi ± 2% 406.8Mi ± 2% ~ (p=0.481 n=10) DigestBytes/16B 1.298Gi ± 3% 1.298Gi ± 1% ~ (p=0.529 n=10) DigestBytes/100B 5.831Gi ± 2% 4.563Gi ± 3% -21.75% (p=0.000 n=10) DigestBytes/4KB 17.54Gi ± 2% 22.22Gi ± 3% +26.72% (p=0.000 n=10) DigestBytes/10MB 18.88Gi ± 2% 24.49Gi ± 2% +29.69% (p=0.000 n=10) geomean 3.979Gi 4.183Gi +5.12% ```
``` > benchstat /tmp/{old,new,extra}.results goos: linux goarch: amd64 pkg: github.com/cespare/xxhash/v2 cpu: AMD Ryzen 9 7950X 16-Core Processor │ no avx512 │ avx512 │ avx512+extra │ │ sec/op │ sec/op vs base │ sec/op vs base │ DigestBytes/4B 9.358n ± 2% 9.377n ± 2% ~ (p=0.481 n=10) 5.873n ± 5% -37.24% (p=0.000 n=10) DigestBytes/16B 11.485n ± 3% 11.485n ± 1% ~ (p=0.469 n=10) 7.292n ± 5% -36.51% (p=0.000 n=10) DigestBytes/100B 15.97n ± 2% 20.41n ± 3% +27.80% (p=0.000 n=10) 20.31n ± 3% +27.18% (p=0.000 n=10) DigestBytes/4KB 212.4n ± 2% 167.7n ± 3% -21.09% (p=0.000 n=10) 163.5n ± 3% -23.02% (p=0.000 n=10) DigestBytes/10MB 493.2µ ± 3% 380.3µ ± 2% -22.90% (p=0.000 n=10) 375.1µ ± 2% -23.94% (p=0.000 n=10) geomean 178.2n 169.6n -4.87% 139.8n -21.57% │ no avx512 │ avx512 │ avx512+extra │ │ B/s │ B/s vs base │ B/s vs base │ DigestBytes/4B 407.6Mi ± 2% 406.8Mi ± 2% ~ (p=0.481 n=10) 649.6Mi ± 5% +59.35% (p=0.000 n=10) DigestBytes/16B 1.298Gi ± 3% 1.298Gi ± 1% ~ (p=0.529 n=10) 2.044Gi ± 5% +57.49% (p=0.000 n=10) DigestBytes/100B 5.831Gi ± 2% 4.563Gi ± 3% -21.75% (p=0.000 n=10) 4.586Gi ± 3% -21.35% (p=0.000 n=10) DigestBytes/4KB 17.54Gi ± 2% 22.22Gi ± 3% +26.72% (p=0.000 n=10) 22.78Gi ± 3% +29.90% (p=0.000 n=10) DigestBytes/10MB 18.88Gi ± 2% 24.49Gi ± 2% +29.69% (p=0.000 n=10) 24.83Gi ± 2% +31.48% (p=0.000 n=10) geomean 3.979Gi 4.183Gi +5.12% 5.074Gi +27.51% ```
Overall LGTM. A few notes: A) Test on Intel as well if possible. Alder Lake has terrible latency on VPMULLQ compared to AMD. Your switch-over point may be different. B) wrt small sizes, I would do the size check in Go code and fall back for sizes <32 (or a similar threshold). That also eliminates all the initial branching: https://github.com/cespare/xxhash/pull/79/files#diff-c74c7d4d55252632fe997e1642f175859870550d550e8a96351e1782a6793bf2R82-R89 Actually if I do fallback to pure Go like this (click to expand)func Sum64(b []byte) uint64 {
if len(b) < 32 {
var h uint64
h = prime5 + uint64(len(b))
for ; len(b) >= 8; b = b[8:] {
k1 := round(0, u64(b[:8]))
h ^= k1
h = rol27(h)*prime1 + prime4
}
if len(b) >= 4 {
h ^= uint64(u32(b[:4])) * prime1
h = rol23(h)*prime2 + prime3
b = b[4:]
}
for ; len(b) > 0; b = b[1:] {
h ^= uint64(b[0]) * prime5
h = rol11(h) * prime1
}
h ^= h >> 33
h *= prime2
h ^= h >> 29
h *= prime3
h ^= h >> 32
return h
}
if useAvx512 {
return sum64Avx512(b)
}
return sum64Scalar(b)
} .. it is actually faster than using scalar asm. Not as fast is before, but the regression is certainly less. Can't test AVX-512. main/avx-512/with above BenchmarkSum64/4B 1391.82 MB/s | 942.91 MB/s | 1281.78 MB/s or... C) Try doing the branching in the assembly function. The branch should be fully predictable in the benchmark, so not affecting it too much. The function call is probably what makes the difference. Keeping (I don't really like any of your other solutions - maybe except ABIInternal as a last resort) |
I'll try to make time to review this but it might be a while (pretty busy with work stuff). Just to set expectations, though:
By the way, are you making this change just for its own sake or do you have a use case where this matters a lot? If so, can you describe it? Thanks. |
I am trying to improve the performance of @klauspost's zstd decompressor. zstd benchmarks are very hit and miss, there is a fair amount of 1.2x improvements but there is also as much 0.8x regressions, depends a lot on the files being processed. I'll try some of @klauspost's suggestion to make it faster on small inputs.
I think I can get *one of the issue is moving data from the SIMD-integer complex to the GP-integer complex, latency on
I can add that, but it's probable the |
Gotcha, thanks, that is useful motivation. Speeding up zstd is certainly an important and worthy goal. |
This will effectively mean it will not be used. But it doesn't really change the fundamental problem - the regression on smaller sizes.
Blocks this small are a rare use case, and usually not a benefit in terms of compression anyway. I have forked xxhash anyway to remove the use of unsafe. It is a good call to add it here anyway to benefit all. But it does mean we can tweak the usage of AVX-512 eventually.
Yeah. Thought of that, but forgot to add it. I tend to have a If you have extra time one thing I saw is that you could load 64 bytes and do the It seems like VPMULLQ can only execute on one port at the time, so even if Since the But.... Just a theory. |
I tried that it changes nothing the critical dependency is on state, the whole pre state mixing load and multiply doesn't even show up in |
@Jorropo I was mainly thinking of Intel here. On AMD it can "hide" most of the (next loop) mul behind the 2 cycle latency of the add and rotate as well as the branch check. I assume you also checked processing speed. Generally be careful about trusting the profiles on an instruction level - the CPU often "stalls" in other places than what is the bottleneck. I often see unreasonably high load attributed to single instructions - where it is actually "collapsing" all of the unresolved OoO instructions, rather than just waiting for this single instruction. |
I did tried that not just looking at perf, for intel I don't see how any of this make sense, the scalar loop compute 4 multiplies in 6Lat (with just one multiply pipe, 4lat with two). It needs to share with the first multiply but that still faster than one avx512 multiply. |
Ran the benchmark on Intel. AVX512 is strictly worse :/
Ran your branch without any modification, except |
It's actually terrible, I'm using the |
Yeah. Zen4+ is a rather narrow target right now. The latest timings I can find on Intel is Rocket Lake. Still 15 cycles. |
Keeps avo out of the dependency tree.
4fb1336
to
24dee90
Compare
@Jorropo What are the numbers comparing non-avx512 use to master? |
@klauspost |
This is to work around the go compiler's prove limitations. It compiles to nothing but it requires go1.17 since casting to array pointers has only been added in go1.17.
This didn't impacted anything since the process would exit right away, but we would like the file if w.WriteTo failed.
This makes the slide faster than anything else on most inputs. Extremely small inputs are slower but by little.
I implemented a slide sum in pure go in efforts of improving the regressions on small sizes. But I can't seem to beat I'm gonna do proper benchmarks later. |
For example various pieces of the algorithm benefits from adding LUTs, but this only impact extremely size buffer like |
I feel a tingling in the repos. :) |
I fixed the performance regressions on small inputs (for both avx512 and non-avx512 with runtime detection) and improved the avx512 performance even more. |
Great news, with the latest commits no performance regression (beyond inexplicable block layout randomness) on my Zen2 (non-avx512):
The only outliar is Here is the throughput on Zen4, improvements everywhere (*except 16B which is tied):
I think I addressed all the issues that were brought up:
|
*I just noticed that |
This removes the extra overhead and is cycle for cycle tied with pre-avx512 decision on non avx512. The PCALIGN on avx512 improves the performance from 28.5GiB/s to 30GiB/s on 4K, and 25GiB/s to 28GiB/s on 10M. This also removes avo because it were running in my legs as I couldn't get it to jump to an other function and cespare emited reservation to using it.
PCALIGN gives a ~12% performance improvement so we really want it, if you are using a previous version of go, too bad please upgrade.
ec61def
to
1bea645
Compare
CMPQ n, $3 | ||
JBE loop_1 | ||
CMPQ n, $7 | ||
JBE do_4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this would make more sense at the just before loop_8
with the branch below going to that, so this would also get run when then avx loop exits.
Plus it would make the >31 start much faster and you avoid the CMPQ n, $8
when exiting the avx512 loop.
Maybe not have this at all? Seems like something that is faster in benchmarks only, where all the sizes are the same, so branches can be fully predicted.
Here are results on two different kinds of CPU (Zen2 without AVX512 support and Zen4 with AVX512 support):
For big sizes it's either no change or way faster.
I don't like how smaller sizes are impacted, all the things I tried made it worst because the overhead of meta algorithm selection and function calls is very significant for theses small buffers (50~25%).
The main solutions I see:
Make the input escape to the heap.
GOAMD64=v4
.Not as accessible to end users, prevent to test the scalar impl when
GOAMD64=v4
is enabled (because it works be usingconst useAvx512 bool
override that can't be turned off in tests).ABIInternal
.Can break in the future, not covered by compat promise.
Sum64_13(seed uint64, a uint64, b uint32, c uint8)
.Very cumbersome to use.
¯\_(ツ)_/¯