-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fast broadcast over combinations of tuples and scalars #20817
Conversation
Could the same trick as in #19879 be used for the specialization on the type as the first argument? |
Possibly, yes. Will have a look later (unless someone beats me to it! :) ). (Edit: A naive equivalent doesn't serve unfortunately. Perhaps @pabloferz has an idea?) Best! |
Something like this should close the final gap newtuplebc(f, A, Bs...) = ntuple(k -> f(Base.Broadcast._broadcast_getindex(A, k), _tbcgetargs(Bs, k)...), _tbcreslength(A, Bs))
newtuplebc{T}(f, ::Type{T}, Bs...) = ntuple(k -> f(Base.Broadcast._broadcast_getindex(T, k), _tbcgetargs(Bs, k)...), _tbcreslength(T, Bs))
@inline _tbcgetargs(As, k) = (Base.Broadcast._broadcast_getindex(first(As), k), _tbcgetargs(Base.tail(As), k)...)
@inline _tbcgetargs(::Tuple{}, k) = ()
_tbcreslength(A, Bs) = (Base.@_noinline_meta; _tbcmaxlength(_tbclength(A), Bs))
@inline _tbcmaxlength(l, As) = _tbcmaxlength(max(l, _tbclength(first(As))), Base.tail(As))
@inline _tbcmaxlength(l, ::Tuple{}) = l
@inline _tbclength(t::Tuple) = length(t)
@inline _tbclength(s) = 1 The problem is that the arguments to the anonymous function don't specialize for a |
CI is #20818 |
Added a method explicitly specialized for type-first-arguments similar to julia> @benchmark newtuplebc(round, Int, $tup)
BenchmarkTools.Trial:
memory estimate: 32 bytes
allocs estimate: 1
--------------
minimum time: 27.367 ns (0.00% GC)
median time: 27.806 ns (0.00% GC)
mean time: 32.810 ns (10.01% GC)
maximum time: 2.976 μs (96.52% GC)
--------------
samples: 10000
evals/sample: 995
time tolerance: 5.00%
memory tolerance: 1.00% Thanks! |
base/broadcast.jl
Outdated
@inline _tbcmaxlength(l, As) = _tbcmaxlength(max(l, _tbclength(first(As))), tail(As)) | ||
@inline _tbcmaxlength(l, ::Tuple{}) = l | ||
@inline _tbclength(t::Tuple) = length(t) | ||
@inline _tbclength(s) = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please do not use acronyms / abbreviations in method names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestions? _tuplebroadcast_maxlength
etc seemed a bit long :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems clear to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in _tbc[...]
seems clear to you, or _tuplebroadcast_[...]
seems clear to you and not too long? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I meant the long version. Too long isn't that bad for complex code that appears infrequently and internally - hopefully the greater number of people that can follow the workings of broadcast
, the better maintained it will be.
|
Restarted travis. |
(Only AV failures were #20818.) |
I put a benchmark tag here because it doesen't seem that there are any real tuple broadcasting benchmarks in BaseBenchmarks right now. |
Names expanded. What say you, @nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels |
I could not reproduce any of the possible regressions locally, but for good measure: @nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels |
(The combined results suggest the identified potential regressions are noise.) |
@inline _tuplebroadcast_getargs(::Tuple{}, k) = () | ||
@inline _tuplebroadcast_getargs(As, k) = | ||
(_broadcast_getindex(first(As), k), _tuplebroadcast_getargs(tail(As), k)...) | ||
@noinline _tuplebroadcast_reslength(As) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If inlining the length is the problem, then just defining
@noinline _tuplebroadcast_reslength(As) = length(broadcast_indices(As...)[1])
instead of these should do (it might be even faster).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If inlining the length is the problem, then just defining
@noinline _tuplebroadcast_reslength(As) = length(broadcast_indices(As...)[1])
instead of these should do (it might be even faster).
Inlining of the length determination code was most but not all of the problem. The length determination code in this pull request is substantially faster than the length(broadcast_indices(...))
construction. For example,
using Base.tail
using Base.Broadcast: _broadcast_getindex, broadcast_indices
@inline _tuplebroadcast_getargs(::Tuple{}, k) = ()
@inline _tuplebroadcast_getargs(As, k) =
(_broadcast_getindex(first(As), k), _tuplebroadcast_getargs(tail(As), k)...)
tuplebc(f, ::Type{Tuple}, As...) =
ntuple(k -> f(_tuplebroadcast_getargs(As, k)...), _tuplebroadcast_reslength(As))
@noinline _tuplebroadcast_reslength(As) =
_tuplebroadcast_maxlength(_tuplebroadcast_length(first(As)), tail(As))
@inline _tuplebroadcast_maxlength(l, As) =
_tuplebroadcast_maxlength(max(l, _tuplebroadcast_length(first(As))), tail(As))
@inline _tuplebroadcast_maxlength(l, ::Tuple{}) = l
@inline _tuplebroadcast_length(t::Tuple) = length(t)
@inline _tuplebroadcast_length(s) = 1
tuplebc2(f, ::Type{Tuple}, As...) =
ntuple(k -> f(_tuplebroadcast_getargs(As, k)...), _tuplebroadcast_reslength2(As))
@noinline _tuplebroadcast_reslength2(As) = length(broadcast_indices(As...)[1])
tup = (1., 2., 3.)
using BenchmarkTools
@benchmark tuplebc(+, Tuple, $tup, $tup)
@benchmark tuplebc2(+, Tuple, $tup, $tup)
yields
julia> @benchmark tuplebc(+, Tuple, $tup, 1, $tup, 1, $tup, 1, $tup)
BenchmarkTools.Trial:
memory estimate: 32 bytes
allocs estimate: 1
--------------
minimum time: 36.073 ns (0.00% GC)
median time: 37.772 ns (0.00% GC)
mean time: 43.407 ns (7.76% GC)
maximum time: 3.230 μs (95.74% GC)
--------------
samples: 10000
evals/sample: 993
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark tuplebc2(+, Tuple, $tup, 1, $tup, 1, $tup, 1, $tup)
BenchmarkTools.Trial:
memory estimate: 32 bytes
allocs estimate: 1
--------------
minimum time: 70.027 ns (0.00% GC)
median time: 70.797 ns (0.00% GC)
mean time: 75.237 ns (0.00% GC)
maximum time: 523.711 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 976
time tolerance: 5.00%
memory tolerance: 1.00%
i.e. the length determination code in this pull request reduces runtime by roughly a factor of two. Best!
(Making the length determination still faster might be possible by excluding (at compile time) non-tuples from the length comparisons. I wrote the length determination that way originally, but later simplified that code to what you see now, not being certain the performance benefit (if any, did not benchmark carefully) justifies the additional code complexity cost.) Edit: Just benchmarked the alternative length determination approach mentioned in this comment. The performance difference (if any) does not seem appreciable (in the noise). Best! |
If there's no difference in performance, I'd go with the simpler approach, but I leave it to your criterion ;). |
Agreed :). (To make certain we're on the same page: There is no appreciable performance difference between the implementation in this pull request and an earlier, more complex implementation described in #20817 (comment) and similar to the |
Yep :) |
This is just a performance improvement and should be good to merge, right? |
Made a PR to benchmark this: JuliaCI/BaseBenchmarks.jl#66 |
Thanks all! |
For reference: julia> t1, t2, t3, t4 = (rand(3)...), (rand(5)...), (rand(10)...), (rand(15)...);
julia> f_round(v) = round.(Int, v);
julia> for t in (t1,t2,t3,t4)
@btime f_round($t)
end
13.865 ns (1 allocation: 32 bytes)
26.595 ns (1 allocation: 48 bytes)
408.035 ns (3 allocations: 192 bytes)
890.682 ns (5 allocations: 368 bytes) This is likely the standrad long tuple problems though. |
This pull request improves the performance of
broadcast
over combinations of tuples and scalars, reducing runtime and allocation count/volume by as much as multiple orders of magnitude. For example, withnewtuplebc
the implementation in this pull request,This pull request's implementation now seems limited by the performance of
ntuple
:Though this pull request substantially improves the test case from #20802,
some performance remains left on the table, I conjecture due to specialization behavior with type arguments?:
(Notably, the forced inlining of
broadcast_indices
andbroadcast_shape
in the existing implementation kills performance; achieving consistently high performance required forcing non-inlining of the result-length-determining code. I wonder whether forced inlining ofbroadcast_indices
andbroadcast_shape
might be hurting small-input performance for some otherbroadcast
methods?) Best!