Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Audit and optimize usage of bit shifting #30674

Open
maxbennedich opened this issue Jan 10, 2019 · 5 comments
Open

Audit and optimize usage of bit shifting #30674

maxbennedich opened this issue Jan 10, 2019 · 5 comments
Labels
performance Must go faster

Comments

@maxbennedich
Copy link
Contributor

As discussed in this discourse thread, bit shifting in Julia differs from how it's done natively, since Julia supports shift counts larger than the word bit size, while natively the shift count is masked with 63 in 64 bit CPUs. This means that Julia's shift operator doesn't translate very well to native code:

julia> code_native((n,k) -> n << k, (Int,Int); syntax=:intel)
	xor    eax, eax
	cmp    rsi, 63
	shlx   rcx, rdi, rsi
	cmovbe rax, rcx
	mov    rcx, rsi
	neg    rcx
	cmp    rcx, 63
	jb     L29
	mov    cl, 63
L29:
	test   rsi, rsi
	sarx   rcx, rdi, rcx
	cmovs  rax, rcx
	ret

Luckily, there's an easy way to improve this; masking the count to 6 bits (n << (k&63)) generates efficient native code:

julia> code_native((n,k) -> n << (k&63), (Int,Int); syntax=:intel)
	shlx	rax, rdi, rsi
	ret

A (very artificial) benchmark showing how this can improve performance:

julia> A = rand(0:63, 100_000);

julia> @btime (s = 0; @inbounds @simd for k = 1:length($A); s += k << $A[k]; end; s)
  42.788 μs (0 allocations: 0 bytes)
-2271119849451809947

julia> @btime (s = 0; @inbounds @simd for k = 1:length($A); s += k << ($A[k] & 63); end; s)
  15.195 μs (0 allocations: 0 bytes)
-2271119849451809947

To test if this can lead to improvements in actual code, I grepped for the string >> in the base source, looked for functions that did non-masked shifting with a variable, and arbitrarily chose gcd in intfuncs.jl:

# binary GCD (aka Stein's) algorithm
# about 1.7x (2.1x) faster for random Int64s (Int128s)
function gcd(a::T, b::T) where T<:Union{Int8,UInt8,Int16,UInt16,Int32,UInt32,Int64,UInt64,Int128,UInt128}
    @noinline throw1(a, b) = throw(OverflowError("gcd($a, $b) overflows"))
    a == 0 && return abs(b)
    b == 0 && return abs(a)
    za = trailing_zeros(a)
    zb = trailing_zeros(b)
    k = min(za, zb)
    u = unsigned(abs(a >> za))
    v = unsigned(abs(b >> zb))
    while u != v
        if u > v
            u, v = v, u
        end
        v -= u
        v >>= trailing_zeros(v)
    end
    r = u << k
    # T(r) would throw InexactError; we want OverflowError instead
    r > typemax(T) && throw1(a, b)
    r % T
end

The shifts can be masked like this:

function gcd_masked(a::T, b::T) where T<:Union{Int8,UInt8,Int16,UInt16,Int32,UInt32,Int64,UInt64,Int128,UInt128}
    @noinline throw1(a, b) = throw(OverflowError("gcd($a, $b) overflows"))
    a == 0 && return abs(b)
    b == 0 && return abs(a)
    za = trailing_zeros(a)
    zb = trailing_zeros(b)
    k = min(za, zb)
    bits = sizeof(T) << 3
    u = unsigned(abs(a >> (za % bits)))
    v = unsigned(abs(b >> (zb % bits)))
    while u != v
        if u > v
            u, v = v, u
        end
        v -= u
        v >>= (trailing_zeros(v) % bits)
    end
    r = u << (k % bits)
    # T(r) would throw InexactError; we want OverflowError instead
    r > typemax(T) && throw1(a, b)
    r % T
end

Comparing performance with some random ints:

julia> A = rand(Int, 100_000);

julia> @btime (s = 0; @inbounds for n = 1:length($A)-1 s += gcd($A[n], $A[n+1]) end; s)
  19.087 ms (0 allocations: 0 bytes)
1946053

julia> @btime (s = 0; @inbounds for n = 1:length($A)-1 s += gcd_masked($A[n], $A[n+1]) end; s)
  9.785 ms (0 allocations: 0 bytes)
1946053

From ~191 ns per call to ~98 ns per call, or a 1.95x improvement. So, unless I'm missing something in this benchmark, there seems to be some room for improvement in base code.

The idea for this issue is to audit the usage of bit shifting and identify places where major improvements can be made (such as gcd). I don't know if it makes sense to actually do any code changes as part of this issue, or if that's better left to separate issues. There's also this suggestion from the discourse thread:

Can we put that into the docstring of the shift operators? Especially x>>trailing_zeros(x) is an important idiom.

Note: The above benchmark and improvement in gcd was observed on a 2.9 GHz Skylake CPU. Rerunning the same benchmark on a 2.6 GHz Broadwell CPU, both versions ran in about ~28 ms, with almost no improvement for the masked one. I haven't looked into why.

@JeffBezanson JeffBezanson added the performance Must go faster label Jan 10, 2019
@stevengj
Copy link
Member

stevengj commented Jan 4, 2023

Any update on this?

@gbaraldi
Copy link
Member

gbaraldi commented Jan 4, 2023

Could we define the bitshifts of the usual types to always mask this? So we don't have to do this in the entire codebase?
@oscardssmith

@oscardssmith
Copy link
Member

that would be a somewhat breaking change since the behavior is different. That said, for constant sized shifts, I'd expect our compiler to fix this already.

@gbaraldi
Copy link
Member

gbaraldi commented Jan 4, 2023

Why would this be breaking?

Oh, we return 0 for values above 63, got it.

@gbaraldi
Copy link
Member

gbaraldi commented Jan 4, 2023

At least on 1.8 it seems we leave about 10% on the table. I will try on latest master. But if we do leave something here we might want to figure out what.

On the M1 we are missing about 10% here, on master.

@btime (s = 0; @inbounds for n = 1:length($A)-1 s += gcd($A[n], $A[n+1]) end; s)
  10.521 ms (0 allocations: 0 bytes)
753392

@btime (s = 0; @inbounds for n = 1:length($A)-1 s += gcd_masked($A[n], $A[n+1]) end; s)
  9.300 ms (0 allocations: 0 bytes)

The IR difference is

master
  %81 = lshr i64 %79, %80
  %82 = icmp ugt i64 %80, 63
  %83 = select i1 %82, i64 0, i64 %81

with the mask
  %81 = and i64 %80, 63
  %82 = lshr i64 %79, %81

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

5 participants