Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fast broadcast over combinations of tuples and scalars #20817

Merged
merged 1 commit into from
Mar 6, 2017

Conversation

Sacha0
Copy link
Member

@Sacha0 Sacha0 commented Feb 26, 2017

This pull request improves the performance of broadcast over combinations of tuples and scalars, reducing runtime and allocation count/volume by as much as multiple orders of magnitude. For example, with newtuplebc the implementation in this pull request,

julia> tup = (1., 2., 3.);

julia> using BenchmarkTools

julia> @benchmark broadcast(+, $tup, 1, $tup, 1, $tup, 1, $tup)
BenchmarkTools.Trial:
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  7.56 kb
  allocs estimate:  167
  minimum time:     13.36 μs (0.00% GC)
  median time:      14.32 μs (0.00% GC)
  mean time:        15.79 μs (5.22% GC)
  maximum time:     3.34 ms (96.46% GC)

julia> @benchmark newtuplebc(+, $tup, 1, $tup, 1, $tup, 1, $tup)
BenchmarkTools.Trial:
  samples:          10000
  evals/sample:     993
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  32.00 bytes
  allocs estimate:  1
  minimum time:     34.00 ns (0.00% GC)
  median time:      34.00 ns (0.00% GC)
  mean time:        37.59 ns (4.93% GC)
  maximum time:     2.35 μs (95.75% GC)

This pull request's implementation now seems limited by the performance of ntuple:

julia> ntuplelimit(f, n, As...) = ntuple(k -> f(_tbcgetargs(As, k)...), n)
ntuplelimit (generic function with 1 method)

julia> @benchmark ntuplelimit(+, 3, $tup, 1, $tup, 1, $tup, 1, $tup)
BenchmarkTools.Trial:
  samples:          10000
  evals/sample:     993
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  32.00 bytes
  allocs estimate:  1
  minimum time:     33.00 ns (0.00% GC)
  median time:      33.00 ns (0.00% GC)
  mean time:        37.25 ns (5.31% GC)
  maximum time:     2.48 μs (96.05% GC)

Though this pull request substantially improves the test case from #20802,

julia> @benchmark broadcast(round, Int, $tup) # test case from #20802
BenchmarkTools.Trial:
  samples:          10000
  evals/sample:     5
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  1.05 kb
  allocs estimate:  31
  minimum time:     6.19 μs (0.00% GC)
  median time:      6.40 μs (0.00% GC)
  mean time:        6.81 μs (2.24% GC)
  maximum time:     895.49 μs (96.90% GC)

julia> @benchmark newtuplebc(round, Int, $tup) # test case from #20802
BenchmarkTools.Trial:
  samples:          10000
  evals/sample:     186
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  96.00 bytes
  allocs estimate:  5
  minimum time:     554.00 ns (0.00% GC)
  median time:      559.00 ns (0.00% GC)
  mean time:        588.48 ns (1.15% GC)
  maximum time:     14.21 μs (92.76% GC)

some performance remains left on the table, I conjecture due to specialization behavior with type arguments?:

julia> specntuple1(f, n, As...) = ntuple(k -> f(As[1], As[2][k]), n)
specntuple1 (generic function with 1 method)

julia> specntuple2(f, n, As...) = ntuple(k -> f(Int, As[2][k]), n)
specntuple2 (generic function with 1 method)

julia> @benchmark specntuple1(round, 3, Int, $tup)
BenchmarkTools.Trial:
  samples:          10000
  evals/sample:     183
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  96.00 bytes
  allocs estimate:  5
  minimum time:     586.00 ns (0.00% GC)
  median time:      592.00 ns (0.00% GC)
  mean time:        626.54 ns (1.19% GC)
  maximum time:     15.39 μs (93.39% GC)

julia> @benchmark specntuple2(round, 3, Int, $tup)
BenchmarkTools.Trial:
  samples:          10000
  evals/sample:     997
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  48.00 bytes
  allocs estimate:  2
  minimum time:     22.00 ns (0.00% GC)
  median time:      23.00 ns (0.00% GC)
  mean time:        28.05 ns (13.35% GC)
  maximum time:     2.73 μs (96.63% GC)

(Notably, the forced inlining of broadcast_indices and broadcast_shape in the existing implementation kills performance; achieving consistently high performance required forcing non-inlining of the result-length-determining code. I wonder whether forced inlining of broadcast_indices and broadcast_shape might be hurting small-input performance for some other broadcast methods?) Best!

@Sacha0 Sacha0 added performance Must go faster broadcast Applying a function over a collection labels Feb 26, 2017
@KristofferC
Copy link
Member

KristofferC commented Feb 26, 2017

Could the same trick as in #19879 be used for the specialization on the type as the first argument?

@Sacha0
Copy link
Member Author

Sacha0 commented Feb 26, 2017

Could the same trick as in #19879 be used fot the specialization on the type as the first argument?

Possibly, yes. Will have a look later (unless someone beats me to it! :) ). (Edit: A naive equivalent doesn't serve unfortunately. Perhaps @pabloferz has an idea?) Best!

@pabloferz
Copy link
Contributor

pabloferz commented Feb 26, 2017

Something like this should close the final gap

newtuplebc(f, A, Bs...) = ntuple(k -> f(Base.Broadcast._broadcast_getindex(A, k), _tbcgetargs(Bs, k)...), _tbcreslength(A, Bs))
newtuplebc{T}(f, ::Type{T}, Bs...) = ntuple(k -> f(Base.Broadcast._broadcast_getindex(T, k), _tbcgetargs(Bs, k)...), _tbcreslength(T, Bs))
@inline _tbcgetargs(As, k) = (Base.Broadcast._broadcast_getindex(first(As), k), _tbcgetargs(Base.tail(As), k)...)
@inline _tbcgetargs(::Tuple{}, k) = ()
_tbcreslength(A, Bs) = (Base.@_noinline_meta; _tbcmaxlength(_tbclength(A), Bs))
@inline _tbcmaxlength(l, As) = _tbcmaxlength(max(l, _tbclength(first(As))), Base.tail(As))
@inline _tbcmaxlength(l, ::Tuple{}) = l
@inline _tbclength(t::Tuple) = length(t)
@inline _tbclength(s) = 1

The problem is that the arguments to the anonymous function don't specialize for a Type (a DataType is inferred instead).

@KristofferC
Copy link
Member

CI is #20818

@Sacha0
Copy link
Member Author

Sacha0 commented Feb 27, 2017

Added a method explicitly specialized for type-first-arguments similar to @pabloferz's suggestion above. Result:

julia> @benchmark newtuplebc(round, Int, $tup)
BenchmarkTools.Trial:
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     27.367 ns (0.00% GC)
  median time:      27.806 ns (0.00% GC)
  mean time:        32.810 ns (10.01% GC)
  maximum time:     2.976 μs (96.52% GC)
  --------------
  samples:          10000
  evals/sample:     995
  time tolerance:   5.00%
  memory tolerance: 1.00%

Thanks!

@inline _tbcmaxlength(l, As) = _tbcmaxlength(max(l, _tbclength(first(As))), tail(As))
@inline _tbcmaxlength(l, ::Tuple{}) = l
@inline _tbclength(t::Tuple) = length(t)
@inline _tbclength(s) = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do not use acronyms / abbreviations in method names

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions? _tuplebroadcast_maxlength etc seemed a bit long :).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems clear to me!

Copy link
Member Author

@Sacha0 Sacha0 Feb 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in _tbc[...] seems clear to you, or _tuplebroadcast_[...] seems clear to you and not too long? Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant the long version. Too long isn't that bad for complex code that appears infrequently and internally - hopefully the greater number of people that can follow the workings of broadcast, the better maintained it will be.

@KristofferC
Copy link
Member

6.19 μs -> 27 ns... preeetty good...

@KristofferC
Copy link
Member

Restarted travis.

@Sacha0
Copy link
Member Author

Sacha0 commented Feb 27, 2017

(Only AV failures were #20818.)

@KristofferC
Copy link
Member

I put a benchmark tag here because it doesen't seem that there are any real tuple broadcasting benchmarks in BaseBenchmarks right now.

@KristofferC KristofferC added the potential benchmark Could make a good benchmark in BaseBenchmarks label Feb 27, 2017
@Sacha0
Copy link
Member Author

Sacha0 commented Mar 2, 2017

Names expanded. What say you, @nanosoldier runbenchmarks(ALL, vs = ":master") ?

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

@Sacha0
Copy link
Member Author

Sacha0 commented Mar 3, 2017

I could not reproduce any of the possible regressions locally, but for good measure: @nanosoldier runbenchmarks(ALL, vs = ":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

@Sacha0
Copy link
Member Author

Sacha0 commented Mar 4, 2017

(The combined results suggest the identified potential regressions are noise.)

@inline _tuplebroadcast_getargs(::Tuple{}, k) = ()
@inline _tuplebroadcast_getargs(As, k) =
(_broadcast_getindex(first(As), k), _tuplebroadcast_getargs(tail(As), k)...)
@noinline _tuplebroadcast_reslength(As) =
Copy link
Contributor

@pabloferz pabloferz Mar 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If inlining the length is the problem, then just defining

@noinline _tuplebroadcast_reslength(As) = length(broadcast_indices(As...)[1])

instead of these should do (it might be even faster).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If inlining the length is the problem, then just defining
@noinline _tuplebroadcast_reslength(As) = length(broadcast_indices(As...)[1])
instead of these should do (it might be even faster).

Inlining of the length determination code was most but not all of the problem. The length determination code in this pull request is substantially faster than the length(broadcast_indices(...)) construction. For example,

using Base.tail
using Base.Broadcast: _broadcast_getindex, broadcast_indices

@inline _tuplebroadcast_getargs(::Tuple{}, k) = ()
@inline _tuplebroadcast_getargs(As, k) =
    (_broadcast_getindex(first(As), k), _tuplebroadcast_getargs(tail(As), k)...)

tuplebc(f, ::Type{Tuple}, As...) =
    ntuple(k -> f(_tuplebroadcast_getargs(As, k)...), _tuplebroadcast_reslength(As))
@noinline _tuplebroadcast_reslength(As) =
    _tuplebroadcast_maxlength(_tuplebroadcast_length(first(As)), tail(As))
@inline _tuplebroadcast_maxlength(l, As) =
    _tuplebroadcast_maxlength(max(l, _tuplebroadcast_length(first(As))), tail(As))
@inline _tuplebroadcast_maxlength(l, ::Tuple{}) = l
@inline _tuplebroadcast_length(t::Tuple) = length(t)
@inline _tuplebroadcast_length(s) = 1

tuplebc2(f, ::Type{Tuple}, As...) =
    ntuple(k -> f(_tuplebroadcast_getargs(As, k)...), _tuplebroadcast_reslength2(As))
@noinline _tuplebroadcast_reslength2(As) = length(broadcast_indices(As...)[1])

tup = (1., 2., 3.)

using BenchmarkTools

@benchmark tuplebc(+, Tuple, $tup, $tup)
@benchmark tuplebc2(+, Tuple, $tup, $tup)

yields

julia> @benchmark tuplebc(+, Tuple, $tup, 1, $tup, 1, $tup, 1, $tup)
BenchmarkTools.Trial:
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     36.073 ns (0.00% GC)
  median time:      37.772 ns (0.00% GC)
  mean time:        43.407 ns (7.76% GC)
  maximum time:     3.230 μs (95.74% GC)
  --------------
  samples:          10000
  evals/sample:     993
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark tuplebc2(+, Tuple, $tup, 1, $tup, 1, $tup, 1, $tup)
BenchmarkTools.Trial:
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     70.027 ns (0.00% GC)
  median time:      70.797 ns (0.00% GC)
  mean time:        75.237 ns (0.00% GC)
  maximum time:     523.711 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     976
  time tolerance:   5.00%
  memory tolerance: 1.00%

i.e. the length determination code in this pull request reduces runtime by roughly a factor of two. Best!

@Sacha0
Copy link
Member Author

Sacha0 commented Mar 4, 2017

(Making the length determination still faster might be possible by excluding (at compile time) non-tuples from the length comparisons. I wrote the length determination that way originally, but later simplified that code to what you see now, not being certain the performance benefit (if any, did not benchmark carefully) justifies the additional code complexity cost.)

Edit: Just benchmarked the alternative length determination approach mentioned in this comment. The performance difference (if any) does not seem appreciable (in the noise). Best!

@pabloferz
Copy link
Contributor

pabloferz commented Mar 4, 2017

If there's no difference in performance, I'd go with the simpler approach, but I leave it to your criterion ;).

@Sacha0
Copy link
Member Author

Sacha0 commented Mar 4, 2017

If there's no difference in performance, I'd go with the simpler approach, but I leave it to your criterion ;).

Agreed :). (To make certain we're on the same page: There is no appreciable performance difference between the implementation in this pull request and an earlier, more complex implementation described in #20817 (comment) and similar to the max(tuplens(As)...) construction in #20802 (comment). There is, however, a substantial performance difference between the implementation in this pull request and the length(broadcast_indices(...)) construction. So the implementation in this pull request strikes me as best for now, being both simple and achieving best-known performance. Are we on the same page? :) ). Thanks!

@pabloferz
Copy link
Contributor

Are we on the same page?

Yep :)

@Sacha0 Sacha0 added this to the 0.6.x milestone Mar 5, 2017
@KristofferC
Copy link
Member

KristofferC commented Mar 6, 2017

This is just a performance improvement and should be good to merge, right?

@pabloferz pabloferz removed this from the 0.6.x milestone Mar 6, 2017
@pabloferz pabloferz merged commit 015cc63 into JuliaLang:master Mar 6, 2017
@KristofferC
Copy link
Member

Made a PR to benchmark this: JuliaCI/BaseBenchmarks.jl#66

@Sacha0
Copy link
Member Author

Sacha0 commented Mar 6, 2017

Thanks all!

@KristofferC
Copy link
Member

For reference:

julia> t1, t2, t3, t4 = (rand(3)...), (rand(5)...), (rand(10)...), (rand(15)...);

julia> f_round(v) = round.(Int, v);

julia> for t in (t1,t2,t3,t4)
         @btime f_round($t)
       end

  13.865 ns (1 allocation: 32 bytes)
  26.595 ns (1 allocation: 48 bytes)
  408.035 ns (3 allocations: 192 bytes)
  890.682 ns (5 allocations: 368 bytes)

This is likely the standrad long tuple problems though.

@KristofferC KristofferC removed the potential benchmark Could make a good benchmark in BaseBenchmarks label Oct 31, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
broadcast Applying a function over a collection performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants