Audit and optimize usage of bit shifting #30674

maxbennedich · 2019-01-10T00:07:37Z

As discussed in this discourse thread, bit shifting in Julia differs from how it's done natively, since Julia supports shift counts larger than the word bit size, while natively the shift count is masked with 63 in 64 bit CPUs. This means that Julia's shift operator doesn't translate very well to native code:

julia> code_native((n,k) -> n << k, (Int,Int); syntax=:intel)
	xor    eax, eax
	cmp    rsi, 63
	shlx   rcx, rdi, rsi
	cmovbe rax, rcx
	mov    rcx, rsi
	neg    rcx
	cmp    rcx, 63
	jb     L29
	mov    cl, 63
L29:
	test   rsi, rsi
	sarx   rcx, rdi, rcx
	cmovs  rax, rcx
	ret

Luckily, there's an easy way to improve this; masking the count to 6 bits (n << (k&63)) generates efficient native code:

julia> code_native((n,k) -> n << (k&63), (Int,Int); syntax=:intel)
	shlx	rax, rdi, rsi
	ret

A (very artificial) benchmark showing how this can improve performance:

julia> A = rand(0:63, 100_000);

julia> @btime (s = 0; @inbounds @simd for k = 1:length($A); s += k << $A[k]; end; s)
  42.788 μs (0 allocations: 0 bytes)
-2271119849451809947

julia> @btime (s = 0; @inbounds @simd for k = 1:length($A); s += k << ($A[k] & 63); end; s)
  15.195 μs (0 allocations: 0 bytes)
-2271119849451809947

To test if this can lead to improvements in actual code, I grepped for the string >> in the base source, looked for functions that did non-masked shifting with a variable, and arbitrarily chose gcd in intfuncs.jl:

# binary GCD (aka Stein's) algorithm
# about 1.7x (2.1x) faster for random Int64s (Int128s)
function gcd(a::T, b::T) where T<:Union{Int8,UInt8,Int16,UInt16,Int32,UInt32,Int64,UInt64,Int128,UInt128}
    @noinline throw1(a, b) = throw(OverflowError("gcd($a, $b) overflows"))
    a == 0 && return abs(b)
    b == 0 && return abs(a)
    za = trailing_zeros(a)
    zb = trailing_zeros(b)
    k = min(za, zb)
    u = unsigned(abs(a >> za))
    v = unsigned(abs(b >> zb))
    while u != v
        if u > v
            u, v = v, u
        end
        v -= u
        v >>= trailing_zeros(v)
    end
    r = u << k
    # T(r) would throw InexactError; we want OverflowError instead
    r > typemax(T) && throw1(a, b)
    r % T
end

The shifts can be masked like this:

function gcd_masked(a::T, b::T) where T<:Union{Int8,UInt8,Int16,UInt16,Int32,UInt32,Int64,UInt64,Int128,UInt128}
    @noinline throw1(a, b) = throw(OverflowError("gcd($a, $b) overflows"))
    a == 0 && return abs(b)
    b == 0 && return abs(a)
    za = trailing_zeros(a)
    zb = trailing_zeros(b)
    k = min(za, zb)
    bits = sizeof(T) << 3
    u = unsigned(abs(a >> (za % bits)))
    v = unsigned(abs(b >> (zb % bits)))
    while u != v
        if u > v
            u, v = v, u
        end
        v -= u
        v >>= (trailing_zeros(v) % bits)
    end
    r = u << (k % bits)
    # T(r) would throw InexactError; we want OverflowError instead
    r > typemax(T) && throw1(a, b)
    r % T
end

Comparing performance with some random ints:

julia> A = rand(Int, 100_000);

julia> @btime (s = 0; @inbounds for n = 1:length($A)-1 s += gcd($A[n], $A[n+1]) end; s)
  19.087 ms (0 allocations: 0 bytes)
1946053

julia> @btime (s = 0; @inbounds for n = 1:length($A)-1 s += gcd_masked($A[n], $A[n+1]) end; s)
  9.785 ms (0 allocations: 0 bytes)
1946053

From ~191 ns per call to ~98 ns per call, or a 1.95x improvement. So, unless I'm missing something in this benchmark, there seems to be some room for improvement in base code.

The idea for this issue is to audit the usage of bit shifting and identify places where major improvements can be made (such as gcd). I don't know if it makes sense to actually do any code changes as part of this issue, or if that's better left to separate issues. There's also this suggestion from the discourse thread:

Can we put that into the docstring of the shift operators? Especially x>>trailing_zeros(x) is an important idiom.

Note: The above benchmark and improvement in gcd was observed on a 2.9 GHz Skylake CPU. Rerunning the same benchmark on a 2.6 GHz Broadwell CPU, both versions ran in about ~28 ms, with almost no improvement for the masked one. I haven't looked into why.

The text was updated successfully, but these errors were encountered:

stevengj · 2023-01-04T19:50:05Z

Any update on this?

gbaraldi · 2023-01-04T20:17:26Z

Could we define the bitshifts of the usual types to always mask this? So we don't have to do this in the entire codebase?
@oscardssmith

oscardssmith · 2023-01-04T20:30:25Z

that would be a somewhat breaking change since the behavior is different. That said, for constant sized shifts, I'd expect our compiler to fix this already.

gbaraldi · 2023-01-04T20:36:52Z

Why would this be breaking?

Oh, we return 0 for values above 63, got it.

gbaraldi · 2023-01-04T20:43:29Z

At least on 1.8 it seems we leave about 10% on the table. I will try on latest master. But if we do leave something here we might want to figure out what.

On the M1 we are missing about 10% here, on master.

@btime (s = 0; @inbounds for n = 1:length($A)-1 s += gcd($A[n], $A[n+1]) end; s)
  10.521 ms (0 allocations: 0 bytes)
753392

@btime (s = 0; @inbounds for n = 1:length($A)-1 s += gcd_masked($A[n], $A[n+1]) end; s)
  9.300 ms (0 allocations: 0 bytes)

The IR difference is

master
  %81 = lshr i64 %79, %80
  %82 = icmp ugt i64 %80, 63
  %83 = select i1 %82, i64 0, i64 %81

with the mask
  %81 = and i64 %80, 63
  %82 = lshr i64 %79, %81

JeffBezanson added the performance Must go faster label Jan 10, 2019

jakobnissen mentioned this issue Jun 20, 2023

Feature request: Native bitshift operation #50225

Open

jakobnissen mentioned this issue Jan 10, 2024

Add a fast processor-native bitshift function #52828

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audit and optimize usage of bit shifting #30674

Audit and optimize usage of bit shifting #30674

maxbennedich commented Jan 10, 2019

stevengj commented Jan 4, 2023

gbaraldi commented Jan 4, 2023 •

edited

Loading

oscardssmith commented Jan 4, 2023

gbaraldi commented Jan 4, 2023 •

edited

Loading

gbaraldi commented Jan 4, 2023 •

edited

Loading

Audit and optimize usage of bit shifting #30674

Audit and optimize usage of bit shifting #30674

Comments

maxbennedich commented Jan 10, 2019

stevengj commented Jan 4, 2023

gbaraldi commented Jan 4, 2023 • edited Loading

oscardssmith commented Jan 4, 2023

gbaraldi commented Jan 4, 2023 • edited Loading

gbaraldi commented Jan 4, 2023 • edited Loading

gbaraldi commented Jan 4, 2023 •

edited

Loading

gbaraldi commented Jan 4, 2023 •

edited

Loading

gbaraldi commented Jan 4, 2023 •

edited

Loading