-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Audit and optimize usage of bit shifting #30674
Comments
Any update on this? |
Could we define the bitshifts of the usual types to always mask this? So we don't have to do this in the entire codebase? |
that would be a somewhat breaking change since the behavior is different. That said, for constant sized shifts, I'd expect our compiler to fix this already. |
Why would this be breaking? Oh, we return 0 for values above 63, got it. |
At least on 1.8 it seems we leave about 10% on the table. I will try on latest master. But if we do leave something here we might want to figure out what. On the M1 we are missing about 10% here, on master. @btime (s = 0; @inbounds for n = 1:length($A)-1 s += gcd($A[n], $A[n+1]) end; s)
10.521 ms (0 allocations: 0 bytes)
753392
@btime (s = 0; @inbounds for n = 1:length($A)-1 s += gcd_masked($A[n], $A[n+1]) end; s)
9.300 ms (0 allocations: 0 bytes) The IR difference is master
%81 = lshr i64 %79, %80
%82 = icmp ugt i64 %80, 63
%83 = select i1 %82, i64 0, i64 %81
with the mask
%81 = and i64 %80, 63
%82 = lshr i64 %79, %81
|
As discussed in this discourse thread, bit shifting in Julia differs from how it's done natively, since Julia supports shift counts larger than the word bit size, while natively the shift count is masked with 63 in 64 bit CPUs. This means that Julia's shift operator doesn't translate very well to native code:
Luckily, there's an easy way to improve this; masking the count to 6 bits (
n << (k&63)
) generates efficient native code:A (very artificial) benchmark showing how this can improve performance:
To test if this can lead to improvements in actual code, I grepped for the string
>>
in the base source, looked for functions that did non-masked shifting with a variable, and arbitrarily chosegcd
inintfuncs.jl
:The shifts can be masked like this:
Comparing performance with some random ints:
From ~191 ns per call to ~98 ns per call, or a 1.95x improvement. So, unless I'm missing something in this benchmark, there seems to be some room for improvement in base code.
The idea for this issue is to audit the usage of bit shifting and identify places where major improvements can be made (such as
gcd
). I don't know if it makes sense to actually do any code changes as part of this issue, or if that's better left to separate issues. There's also this suggestion from the discourse thread:Note: The above benchmark and improvement in
gcd
was observed on a 2.9 GHz Skylake CPU. Rerunning the same benchmark on a 2.6 GHz Broadwell CPU, both versions ran in about ~28 ms, with almost no improvement for the masked one. I haven't looked into why.The text was updated successfully, but these errors were encountered: