Don't `unbroadcast` some cases which don't need broadcasting #973

mcabbott · 2021-05-15T15:47:44Z

This is a small optimisation of what I think are fairly common broadcasts:

julia> @btime gradient((x,y) -> sum(x .* y), $(rand(100,100)), pi);
  20.125 μs (7 allocations: 234.62 KiB)
  13.750 μs (3 allocations: 78.22 KiB)  # this PR

julia> @btime gradient((x,y) -> sum(x ./ y), $(rand(100,100)), pi);
  19.667 μs (7 allocations: 234.62 KiB)
  13.958 μs (3 allocations: 78.22 KiB)  # this PR

For best effect it will need JuliaLang/julia#39053, which you can simulate via

@eval Base function mapreduce(f, op, A::AbstractArrayOrBroadcasted...; kw...)
   get(kw, :dims, :) === (:) && return mapreduce(A -> f(A...), op, zip(A...); kw...)
   return reduce(op, map(f, A...); kw...)
end

DhairyaLGandhi · 2021-05-15T16:08:33Z

Yeah, this seems too specific of an optimisation to be acceptable. It also seems to need some testing with the Julialang pr you mentioned.

mcabbott · 2021-05-15T16:13:12Z

It also seems to need some testing with the Julialang pr you mentioned.

Great, if you have a GPU handy, this would be a useful thing to do.

test/cuda.jl

CarloLucibello · 2021-05-16T10:06:30Z

a 30% speedup for such common operations with a few lines of code seems great

mcabbott · 2021-05-16T11:15:18Z

It looks like CUDA.jl supports mapreducedim fully, and efficiently, which is great -- it's ahead of Base. The interaction with FillArrays wasn't great, and only matters on CPU in the end... but after fixing that, this is a 3x improvement in both time and memory:

julia> @btime gradient((x,y) -> sum(x .* y), $(rand(100,100)), pi);
  20.125 μs (7 allocations: 234.62 KiB) # master
  6.108 μs (3 allocations: 78.22 KiB)   # this PR + FillArrays PR

julia> @btime gradient((x,y) -> sum(x ./ y), $(rand(100,100)), pi);
  19.667 μs (7 allocations: 234.62 KiB)
  6.781 μs (3 allocations: 78.22 KiB)

This is trimmed down from a bigger attempt to do this for more broadcasts. Things like ones(10,10) ./ ones(10) could also benefit (and are also common, e.g. normalising columns) but for things like ones(1,10) .* ones(10) you'd need something more general than mapreducedim, so it needs a bit of logic to avoid that. Maybe later.

Edit -- here what the full mapreduce idea would look like for 2-arg .*, FWIW. Not too bad, but would be better to do it once inside unbroadcast, not for each function. Base's mapreduce(f, op, A, B; dims) calls map(f, A, B) so this is neither faster nor slower than the present Δ .* conj.(y) way.

@adjoint function broadcasted(::typeof(*), x::AbstractArray{<:Number}, y::AbstractArray{<:Number})
  x .* y, Δ -> begin
    init = zero(promote_type(eltype(Δ), eltype(x), eltype(y)))
    Δx = if length(x) != length(Δ) && size(y) == size(Δ) # ideally size == up to trailing 1s
      mapreduce(dot, +, y, Δ; dims = ntuple(d -> size(x,d)==1 ? d : ndims(Δ)+1, ndims(Δ)), init=init)
    else
      unbroadcast(x, Δ .* conj.(y))
    end
    Δy = if length(y) != length(Δ) && size(x) == size(Δ)
      mapreduce(dot, +, x, Δ; dims = ntuple(d -> size(y,d)==1 ? d : ndims(Δ)+1, ndims(Δ)), init=init)
    else
      unbroadcast(y, Δ .* conj.(x))
    end
    (nothing, Δx, Δy)
  end
end
@adjoint function broadcasted(::typeof(*), x::Numeric, y::Numeric) # only cases with at least one scalar
  z, back = pullback(*, x, y)
  z, Δ -> (nothing, back(Δ)...)
end

Simple test case:

julia> @btime gradient((x,y) -> sum(x .* y), $(rand(100,100)), $(rand(100)));
  25.208 μs (14 allocations: 235.72 KiB)  # with or without this
  21.625 μs (12 allocations: 157.52 KiB)  # with this + mapreduce implementation

julia> @btime copy( $(rand(100,100)));  # thus 3 matrix copies, 1 avoidable
  2.708 μs (2 allocations: 78.20 KiB)

src/lib/grad.jl

mcabbott · 2021-05-16T18:32:17Z

In fact the scalar .* array cases are much easier, I was mislead by thinking about different arrays. In these cases we can just call the existing rules for un-broadcasted multiplication etc, which use dot instead of broadcasting & reduction. It'll be faster than the previous benchmarks (at least once some specialisations in FillArrays are added) and faster in other cases, and it's simpler.

un-broadcast using mapreduce

f070840

cu test

f051129

mcabbott force-pushed the mapreduce branch from 452628c to f051129 Compare May 15, 2021 17:00

tests

189a4bc

CarloLucibello reviewed May 16, 2021

View reviewed changes

test/cuda.jl Outdated Show resolved Hide resolved

tests, III

014c131

CarloLucibello reviewed May 16, 2021

View reviewed changes

src/lib/grad.jl Show resolved Hide resolved

CarloLucibello mentioned this pull request May 16, 2021

add github workflows FluxML/IRTools.jl#87

Closed

mcabbott added 3 commits May 16, 2021 12:25

fill example

75e2fb7

change to use existing scalar * array rules

69a4ca0

tests

615100f

mcabbott changed the title ~~unbroadcast using mapreduce, sometimes~~ Don't unbroadcast some cases which don't need broadcasting May 16, 2021

CarloLucibello approved these changes May 21, 2021

View reviewed changes

CarloLucibello merged commit 992ca28 into FluxML:master May 21, 2021

marius311 mentioned this pull request Jun 15, 2021

[FR] Mixed eltype dot products JuliaGPU/CUDA.jl#982

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't `unbroadcast` some cases which don't need broadcasting #973

Don't `unbroadcast` some cases which don't need broadcasting #973

mcabbott commented May 15, 2021

DhairyaLGandhi commented May 15, 2021

mcabbott commented May 15, 2021

CarloLucibello commented May 16, 2021

mcabbott commented May 16, 2021 •

edited

Loading

mcabbott commented May 16, 2021

Don't unbroadcast some cases which don't need broadcasting #973

Don't unbroadcast some cases which don't need broadcasting #973

Conversation

mcabbott commented May 15, 2021

DhairyaLGandhi commented May 15, 2021

mcabbott commented May 15, 2021

CarloLucibello commented May 16, 2021

mcabbott commented May 16, 2021 • edited Loading

mcabbott commented May 16, 2021

Don't `unbroadcast` some cases which don't need broadcasting #973

Don't `unbroadcast` some cases which don't need broadcasting #973

mcabbott commented May 16, 2021 •

edited

Loading