fast broadcast over combinations of tuples and scalars #20817

Sacha0 · 2017-02-26T21:38:07Z

This pull request improves the performance of broadcast over combinations of tuples and scalars, reducing runtime and allocation count/volume by as much as multiple orders of magnitude. For example, with newtuplebc the implementation in this pull request,

julia> tup = (1., 2., 3.);

julia> using BenchmarkTools

julia> @benchmark broadcast(+, $tup, 1, $tup, 1, $tup, 1, $tup)
BenchmarkTools.Trial:
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  7.56 kb
  allocs estimate:  167
  minimum time:     13.36 μs (0.00% GC)
  median time:      14.32 μs (0.00% GC)
  mean time:        15.79 μs (5.22% GC)
  maximum time:     3.34 ms (96.46% GC)

julia> @benchmark newtuplebc(+, $tup, 1, $tup, 1, $tup, 1, $tup)
BenchmarkTools.Trial:
  samples:          10000
  evals/sample:     993
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  32.00 bytes
  allocs estimate:  1
  minimum time:     34.00 ns (0.00% GC)
  median time:      34.00 ns (0.00% GC)
  mean time:        37.59 ns (4.93% GC)
  maximum time:     2.35 μs (95.75% GC)

This pull request's implementation now seems limited by the performance of ntuple:

julia> ntuplelimit(f, n, As...) = ntuple(k -> f(_tbcgetargs(As, k)...), n)
ntuplelimit (generic function with 1 method)

julia> @benchmark ntuplelimit(+, 3, $tup, 1, $tup, 1, $tup, 1, $tup)
BenchmarkTools.Trial:
  samples:          10000
  evals/sample:     993
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  32.00 bytes
  allocs estimate:  1
  minimum time:     33.00 ns (0.00% GC)
  median time:      33.00 ns (0.00% GC)
  mean time:        37.25 ns (5.31% GC)
  maximum time:     2.48 μs (96.05% GC)

Though this pull request substantially improves the test case from #20802,

julia> @benchmark broadcast(round, Int, $tup) # test case from #20802
BenchmarkTools.Trial:
  samples:          10000
  evals/sample:     5
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  1.05 kb
  allocs estimate:  31
  minimum time:     6.19 μs (0.00% GC)
  median time:      6.40 μs (0.00% GC)
  mean time:        6.81 μs (2.24% GC)
  maximum time:     895.49 μs (96.90% GC)

julia> @benchmark newtuplebc(round, Int, $tup) # test case from #20802
BenchmarkTools.Trial:
  samples:          10000
  evals/sample:     186
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  96.00 bytes
  allocs estimate:  5
  minimum time:     554.00 ns (0.00% GC)
  median time:      559.00 ns (0.00% GC)
  mean time:        588.48 ns (1.15% GC)
  maximum time:     14.21 μs (92.76% GC)

some performance remains left on the table, I conjecture due to specialization behavior with type arguments?:

julia> specntuple1(f, n, As...) = ntuple(k -> f(As[1], As[2][k]), n)
specntuple1 (generic function with 1 method)

julia> specntuple2(f, n, As...) = ntuple(k -> f(Int, As[2][k]), n)
specntuple2 (generic function with 1 method)

julia> @benchmark specntuple1(round, 3, Int, $tup)
BenchmarkTools.Trial:
  samples:          10000
  evals/sample:     183
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  96.00 bytes
  allocs estimate:  5
  minimum time:     586.00 ns (0.00% GC)
  median time:      592.00 ns (0.00% GC)
  mean time:        626.54 ns (1.19% GC)
  maximum time:     15.39 μs (93.39% GC)

julia> @benchmark specntuple2(round, 3, Int, $tup)
BenchmarkTools.Trial:
  samples:          10000
  evals/sample:     997
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  48.00 bytes
  allocs estimate:  2
  minimum time:     22.00 ns (0.00% GC)
  median time:      23.00 ns (0.00% GC)
  mean time:        28.05 ns (13.35% GC)
  maximum time:     2.73 μs (96.63% GC)

(Notably, the forced inlining of broadcast_indices and broadcast_shape in the existing implementation kills performance; achieving consistently high performance required forcing non-inlining of the result-length-determining code. I wonder whether forced inlining of broadcast_indices and broadcast_shape might be hurting small-input performance for some other broadcast methods?) Best!

KristofferC · 2017-02-26T21:53:12Z

Could the same trick as in #19879 be used for the specialization on the type as the first argument?

Sacha0 · 2017-02-26T22:02:39Z

Could the same trick as in #19879 be used fot the specialization on the type as the first argument?

Possibly, yes. Will have a look later (unless someone beats me to it! :) ). (Edit: A naive equivalent doesn't serve unfortunately. Perhaps @pabloferz has an idea?) Best!

pabloferz · 2017-02-26T23:25:59Z

Something like this should close the final gap

newtuplebc(f, A, Bs...) = ntuple(k -> f(Base.Broadcast._broadcast_getindex(A, k), _tbcgetargs(Bs, k)...), _tbcreslength(A, Bs))
newtuplebc{T}(f, ::Type{T}, Bs...) = ntuple(k -> f(Base.Broadcast._broadcast_getindex(T, k), _tbcgetargs(Bs, k)...), _tbcreslength(T, Bs))
@inline _tbcgetargs(As, k) = (Base.Broadcast._broadcast_getindex(first(As), k), _tbcgetargs(Base.tail(As), k)...)
@inline _tbcgetargs(::Tuple{}, k) = ()
_tbcreslength(A, Bs) = (Base.@_noinline_meta; _tbcmaxlength(_tbclength(A), Bs))
@inline _tbcmaxlength(l, As) = _tbcmaxlength(max(l, _tbclength(first(As))), Base.tail(As))
@inline _tbcmaxlength(l, ::Tuple{}) = l
@inline _tbclength(t::Tuple) = length(t)
@inline _tbclength(s) = 1

The problem is that the arguments to the anonymous function don't specialize for a Type (a DataType is inferred instead).

KristofferC · 2017-02-26T23:40:08Z

CI is #20818

Sacha0 · 2017-02-27T00:35:29Z

Added a method explicitly specialized for type-first-arguments similar to @pabloferz's suggestion above. Result:

julia> @benchmark newtuplebc(round, Int, $tup)
BenchmarkTools.Trial:
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     27.367 ns (0.00% GC)
  median time:      27.806 ns (0.00% GC)
  mean time:        32.810 ns (10.01% GC)
  maximum time:     2.976 μs (96.52% GC)
  --------------
  samples:          10000
  evals/sample:     995
  time tolerance:   5.00%
  memory tolerance: 1.00%

Thanks!

vtjnash · 2017-02-27T01:07:38Z

base/broadcast.jl

+@inline _tbcmaxlength(l, As) = _tbcmaxlength(max(l, _tbclength(first(As))), tail(As))
+@inline _tbcmaxlength(l, ::Tuple{}) = l
+@inline _tbclength(t::Tuple) = length(t)
+@inline _tbclength(s) = 1


please do not use acronyms / abbreviations in method names

Suggestions? _tuplebroadcast_maxlength etc seemed a bit long :).

Seems clear to me!

As in _tbc[...] seems clear to you, or _tuplebroadcast_[...] seems clear to you and not too long? Thanks!

Sorry, I meant the long version. Too long isn't that bad for complex code that appears infrequently and internally - hopefully the greater number of people that can follow the workings of broadcast, the better maintained it will be.

KristofferC · 2017-02-27T18:16:29Z

6.19 μs -> 27 ns... preeetty good...

KristofferC · 2017-02-27T18:17:07Z

Restarted travis.

Sacha0 · 2017-02-27T18:21:05Z

(Only AV failures were #20818.)

KristofferC · 2017-02-27T18:25:49Z

I put a benchmark tag here because it doesen't seem that there are any real tuple broadcasting benchmarks in BaseBenchmarks right now.

Sacha0 · 2017-03-02T17:49:58Z

Names expanded. What say you, @nanosoldier runbenchmarks(ALL, vs = ":master") ?

nanosoldier · 2017-03-02T21:48:09Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

Sacha0 · 2017-03-03T22:42:22Z

I could not reproduce any of the possible regressions locally, but for good measure: @nanosoldier runbenchmarks(ALL, vs = ":master")

nanosoldier · 2017-03-04T02:10:37Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

Sacha0 · 2017-03-04T02:13:34Z

(The combined results suggest the identified potential regressions are noise.)

pabloferz · 2017-03-04T07:23:14Z

base/broadcast.jl

+@inline _tuplebroadcast_getargs(::Tuple{}, k) = ()
+@inline _tuplebroadcast_getargs(As, k) =
+    (_broadcast_getindex(first(As), k), _tuplebroadcast_getargs(tail(As), k)...)
+@noinline _tuplebroadcast_reslength(As) =


If inlining the length is the problem, then just defining

@noinline _tuplebroadcast_reslength(As) = length(broadcast_indices(As...)[1])

instead of these should do (it might be even faster).

If inlining the length is the problem, then just defining
@noinline _tuplebroadcast_reslength(As) = length(broadcast_indices(As...)[1])
instead of these should do (it might be even faster).

Inlining of the length determination code was most but not all of the problem. The length determination code in this pull request is substantially faster than the length(broadcast_indices(...)) construction. For example,

using Base.tail using Base.Broadcast: _broadcast_getindex, broadcast_indices @inline _tuplebroadcast_getargs(::Tuple{}, k) = () @inline _tuplebroadcast_getargs(As, k) = (_broadcast_getindex(first(As), k), _tuplebroadcast_getargs(tail(As), k)...) tuplebc(f, ::Type{Tuple}, As...) = ntuple(k -> f(_tuplebroadcast_getargs(As, k)...), _tuplebroadcast_reslength(As)) @noinline _tuplebroadcast_reslength(As) = _tuplebroadcast_maxlength(_tuplebroadcast_length(first(As)), tail(As)) @inline _tuplebroadcast_maxlength(l, As) = _tuplebroadcast_maxlength(max(l, _tuplebroadcast_length(first(As))), tail(As)) @inline _tuplebroadcast_maxlength(l, ::Tuple{}) = l @inline _tuplebroadcast_length(t::Tuple) = length(t) @inline _tuplebroadcast_length(s) = 1 tuplebc2(f, ::Type{Tuple}, As...) = ntuple(k -> f(_tuplebroadcast_getargs(As, k)...), _tuplebroadcast_reslength2(As)) @noinline _tuplebroadcast_reslength2(As) = length(broadcast_indices(As...)[1]) tup = (1., 2., 3.) using BenchmarkTools @benchmark tuplebc(+, Tuple, $tup, $tup) @benchmark tuplebc2(+, Tuple, $tup, $tup)

yields

julia> @benchmark tuplebc(+, Tuple, $tup, 1, $tup, 1, $tup, 1, $tup) BenchmarkTools.Trial: memory estimate: 32 bytes allocs estimate: 1 -------------- minimum time: 36.073 ns (0.00% GC) median time: 37.772 ns (0.00% GC) mean time: 43.407 ns (7.76% GC) maximum time: 3.230 μs (95.74% GC) -------------- samples: 10000 evals/sample: 993 time tolerance: 5.00% memory tolerance: 1.00% julia> @benchmark tuplebc2(+, Tuple, $tup, 1, $tup, 1, $tup, 1, $tup) BenchmarkTools.Trial: memory estimate: 32 bytes allocs estimate: 1 -------------- minimum time: 70.027 ns (0.00% GC) median time: 70.797 ns (0.00% GC) mean time: 75.237 ns (0.00% GC) maximum time: 523.711 ns (0.00% GC) -------------- samples: 10000 evals/sample: 976 time tolerance: 5.00% memory tolerance: 1.00%

i.e. the length determination code in this pull request reduces runtime by roughly a factor of two. Best!

Sacha0 · 2017-03-04T18:29:45Z

(Making the length determination still faster might be possible by excluding (at compile time) non-tuples from the length comparisons. I wrote the length determination that way originally, but later simplified that code to what you see now, not being certain the performance benefit (if any, did not benchmark carefully) justifies the additional code complexity cost.)

Edit: Just benchmarked the alternative length determination approach mentioned in this comment. The performance difference (if any) does not seem appreciable (in the noise). Best!

pabloferz · 2017-03-04T19:12:57Z

If there's no difference in performance, I'd go with the simpler approach, but I leave it to your criterion ;).

Sacha0 · 2017-03-04T19:21:47Z

If there's no difference in performance, I'd go with the simpler approach, but I leave it to your criterion ;).

Agreed :). (To make certain we're on the same page: There is no appreciable performance difference between the implementation in this pull request and an earlier, more complex implementation described in #20817 (comment) and similar to the max(tuplens(As)...) construction in #20802 (comment). There is, however, a substantial performance difference between the implementation in this pull request and the length(broadcast_indices(...)) construction. So the implementation in this pull request strikes me as best for now, being both simple and achieving best-known performance. Are we on the same page? :) ). Thanks!

pabloferz · 2017-03-04T19:37:44Z

Are we on the same page?

Yep :)

KristofferC · 2017-03-06T07:51:55Z

This is just a performance improvement and should be good to merge, right?

KristofferC · 2017-03-06T16:17:33Z

Made a PR to benchmark this: JuliaCI/BaseBenchmarks.jl#66

Sacha0 · 2017-03-06T16:26:42Z

Thanks all!

KristofferC · 2017-03-07T09:14:52Z

For reference:

julia> t1, t2, t3, t4 = (rand(3)...), (rand(5)...), (rand(10)...), (rand(15)...);

julia> f_round(v) = round.(Int, v);

julia> for t in (t1,t2,t3,t4)
         @btime f_round($t)
       end

  13.865 ns (1 allocation: 32 bytes)
  26.595 ns (1 allocation: 48 bytes)
  408.035 ns (3 allocations: 192 bytes)
  890.682 ns (5 allocations: 368 bytes)

This is likely the standrad long tuple problems though.

Sacha0 added performance Must go faster broadcast Applying a function over a collection labels Feb 26, 2017

Sacha0 mentioned this pull request Feb 26, 2017

Performance hit with broadcasting over tuples #20802

Closed

Sacha0 force-pushed the tuplebc branch from cca57f8 to 60514a8 Compare February 27, 2017 00:30

vtjnash reviewed Feb 27, 2017

View reviewed changes

KristofferC added the potential benchmark Could make a good benchmark in BaseBenchmarks label Feb 27, 2017

Much faster broadcast over combinations of tuples and broadcast scalars.

453da2c

Sacha0 force-pushed the tuplebc branch from 60514a8 to 453da2c Compare March 2, 2017 17:48

pabloferz reviewed Mar 4, 2017

View reviewed changes

pabloferz approved these changes Mar 4, 2017

View reviewed changes

Sacha0 added this to the 0.6.x milestone Mar 5, 2017

pabloferz removed this from the 0.6.x milestone Mar 6, 2017

pabloferz merged commit 015cc63 into JuliaLang:master Mar 6, 2017

Sacha0 deleted the tuplebc branch March 6, 2017 16:26

Sacha0 mentioned this pull request Mar 6, 2017

add benchmarks for broadcast including types in args, ref #20802 JuliaCI/BaseBenchmarks.jl#66

Merged

Sacha0 mentioned this pull request Apr 8, 2017

Type-stabilize broadcast over tuples and scalars #21331

Merged

KristofferC removed the potential benchmark Could make a good benchmark in BaseBenchmarks label Oct 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fast broadcast over combinations of tuples and scalars #20817

fast broadcast over combinations of tuples and scalars #20817

Sacha0 commented Feb 26, 2017

KristofferC commented Feb 26, 2017 •

edited by StefanKarpinski

Loading

Sacha0 commented Feb 26, 2017 •

edited

Loading

pabloferz commented Feb 26, 2017 •

edited

Loading

KristofferC commented Feb 26, 2017

Sacha0 commented Feb 27, 2017

vtjnash Feb 27, 2017

Sacha0 Feb 27, 2017

andyferris Feb 28, 2017

Sacha0 Feb 28, 2017 •

edited

Loading

andyferris Feb 28, 2017

KristofferC commented Feb 27, 2017

KristofferC commented Feb 27, 2017

Sacha0 commented Feb 27, 2017

KristofferC commented Feb 27, 2017

Sacha0 commented Mar 2, 2017

nanosoldier commented Mar 2, 2017

Sacha0 commented Mar 3, 2017

nanosoldier commented Mar 4, 2017

Sacha0 commented Mar 4, 2017

pabloferz Mar 4, 2017 •

edited

Loading

Sacha0 Mar 4, 2017

Sacha0 commented Mar 4, 2017 •

edited

Loading

pabloferz commented Mar 4, 2017 •

edited

Loading

Sacha0 commented Mar 4, 2017

pabloferz commented Mar 4, 2017

KristofferC commented Mar 6, 2017 •

edited

Loading

KristofferC commented Mar 6, 2017

Sacha0 commented Mar 6, 2017

KristofferC commented Mar 7, 2017

fast broadcast over combinations of tuples and scalars #20817

fast broadcast over combinations of tuples and scalars #20817

Conversation

Sacha0 commented Feb 26, 2017

KristofferC commented Feb 26, 2017 • edited by StefanKarpinski Loading

Sacha0 commented Feb 26, 2017 • edited Loading

pabloferz commented Feb 26, 2017 • edited Loading

KristofferC commented Feb 26, 2017

Sacha0 commented Feb 27, 2017

vtjnash Feb 27, 2017

Choose a reason for hiding this comment

Sacha0 Feb 27, 2017

Choose a reason for hiding this comment

andyferris Feb 28, 2017

Choose a reason for hiding this comment

Sacha0 Feb 28, 2017 • edited Loading

Choose a reason for hiding this comment

andyferris Feb 28, 2017

Choose a reason for hiding this comment

KristofferC commented Feb 27, 2017

KristofferC commented Feb 27, 2017

Sacha0 commented Feb 27, 2017

KristofferC commented Feb 27, 2017

Sacha0 commented Mar 2, 2017

nanosoldier commented Mar 2, 2017

Sacha0 commented Mar 3, 2017

nanosoldier commented Mar 4, 2017

Sacha0 commented Mar 4, 2017

pabloferz Mar 4, 2017 • edited Loading

Choose a reason for hiding this comment

Sacha0 Mar 4, 2017

Choose a reason for hiding this comment

Sacha0 commented Mar 4, 2017 • edited Loading

pabloferz commented Mar 4, 2017 • edited Loading

Sacha0 commented Mar 4, 2017

pabloferz commented Mar 4, 2017

KristofferC commented Mar 6, 2017 • edited Loading

KristofferC commented Mar 6, 2017

Sacha0 commented Mar 6, 2017

KristofferC commented Mar 7, 2017

KristofferC commented Feb 26, 2017 •

edited by StefanKarpinski

Loading

Sacha0 commented Feb 26, 2017 •

edited

Loading

pabloferz commented Feb 26, 2017 •

edited

Loading

Sacha0 Feb 28, 2017 •

edited

Loading

pabloferz Mar 4, 2017 •

edited

Loading

Sacha0 commented Mar 4, 2017 •

edited

Loading

pabloferz commented Mar 4, 2017 •

edited

Loading

KristofferC commented Mar 6, 2017 •

edited

Loading