Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't unbroadcast some cases which don't need broadcasting #973

Merged
merged 7 commits into from
May 21, 2021

Conversation

mcabbott
Copy link
Member

This is a small optimisation of what I think are fairly common broadcasts:

julia> @btime gradient((x,y) -> sum(x .* y), $(rand(100,100)), pi);
  20.125 μs (7 allocations: 234.62 KiB)
  13.750 μs (3 allocations: 78.22 KiB)  # this PR

julia> @btime gradient((x,y) -> sum(x ./ y), $(rand(100,100)), pi);
  19.667 μs (7 allocations: 234.62 KiB)
  13.958 μs (3 allocations: 78.22 KiB)  # this PR

For best effect it will need JuliaLang/julia#39053, which you can simulate via

@eval Base function mapreduce(f, op, A::AbstractArrayOrBroadcasted...; kw...)
   get(kw, :dims, :) === (:) && return mapreduce(A -> f(A...), op, zip(A...); kw...)
   return reduce(op, map(f, A...); kw...)
end

@DhairyaLGandhi
Copy link
Member

Yeah, this seems too specific of an optimisation to be acceptable. It also seems to need some testing with the Julialang pr you mentioned.

@mcabbott
Copy link
Member Author

It also seems to need some testing with the Julialang pr you mentioned.

Great, if you have a GPU handy, this would be a useful thing to do.

test/cuda.jl Outdated Show resolved Hide resolved
@CarloLucibello
Copy link
Member

a 30% speedup for such common operations with a few lines of code seems great

@mcabbott
Copy link
Member Author

mcabbott commented May 16, 2021

It looks like CUDA.jl supports mapreducedim fully, and efficiently, which is great -- it's ahead of Base. The interaction with FillArrays wasn't great, and only matters on CPU in the end... but after fixing that, this is a 3x improvement in both time and memory:

julia> @btime gradient((x,y) -> sum(x .* y), $(rand(100,100)), pi);
  20.125 μs (7 allocations: 234.62 KiB) # master
  6.108 μs (3 allocations: 78.22 KiB)   # this PR + FillArrays PR

julia> @btime gradient((x,y) -> sum(x ./ y), $(rand(100,100)), pi);
  19.667 μs (7 allocations: 234.62 KiB)
  6.781 μs (3 allocations: 78.22 KiB)

This is trimmed down from a bigger attempt to do this for more broadcasts. Things like ones(10,10) ./ ones(10) could also benefit (and are also common, e.g. normalising columns) but for things like ones(1,10) .* ones(10) you'd need something more general than mapreducedim, so it needs a bit of logic to avoid that. Maybe later.

Edit -- here what the full mapreduce idea would look like for 2-arg .*, FWIW. Not too bad, but would be better to do it once inside unbroadcast, not for each function. Base's mapreduce(f, op, A, B; dims) calls map(f, A, B) so this is neither faster nor slower than the present Δ .* conj.(y) way.

@adjoint function broadcasted(::typeof(*), x::AbstractArray{<:Number}, y::AbstractArray{<:Number})
  x .* y, Δ -> begin
    init = zero(promote_type(eltype(Δ), eltype(x), eltype(y)))
    Δx = if length(x) != length(Δ) && size(y) == size(Δ) # ideally size == up to trailing 1s
      mapreduce(dot, +, y, Δ; dims = ntuple(d -> size(x,d)==1 ? d : ndims(Δ)+1, ndims(Δ)), init=init)
    else
      unbroadcast(x, Δ .* conj.(y))
    end
    Δy = if length(y) != length(Δ) && size(x) == size(Δ)
      mapreduce(dot, +, x, Δ; dims = ntuple(d -> size(y,d)==1 ? d : ndims(Δ)+1, ndims(Δ)), init=init)
    else
      unbroadcast(y, Δ .* conj.(x))
    end
    (nothing, Δx, Δy)
  end
end
@adjoint function broadcasted(::typeof(*), x::Numeric, y::Numeric) # only cases with at least one scalar
  z, back = pullback(*, x, y)
  z, Δ -> (nothing, back(Δ)...)
end

Simple test case:

julia> @btime gradient((x,y) -> sum(x .* y), $(rand(100,100)), $(rand(100)));
  25.208 μs (14 allocations: 235.72 KiB)  # with or without this
  21.625 μs (12 allocations: 157.52 KiB)  # with this + mapreduce implementation

julia> @btime copy( $(rand(100,100)));  # thus 3 matrix copies, 1 avoidable
  2.708 μs (2 allocations: 78.20 KiB)

@mcabbott
Copy link
Member Author

In fact the scalar .* array cases are much easier, I was mislead by thinking about different arrays. In these cases we can just call the existing rules for un-broadcasted multiplication etc, which use dot instead of broadcasting & reduction. It'll be faster than the previous benchmarks (at least once some specialisations in FillArrays are added) and faster in other cases, and it's simpler.

@mcabbott mcabbott changed the title unbroadcast using mapreduce, sometimes Don't unbroadcast some cases which don't need broadcasting May 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants