Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More efficient mapreduce(f, op, A, B, C; init) #41001

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

mcabbott
Copy link
Contributor

@mcabbott mcabbott commented May 29, 2021

This is an attempt to make at least some cases of mapreduce(f, op, A, B C) faster. At present all cases just call reduce(op, map(f, A, B C)).

This PR generalises _mapreducedim! to allow several arrays, which is enough to make reductions with init fast. When init is not given, generalising reducedim_init looks delicate & largely orthogonal, so is left for future work.

For the case of scalar reductions, it appears to be faster to call only(_mapreducedim!(...)) when init is given. When init is not given, it uses zip exactly as in #39053. That isn't as fast, it saves memory but not always time compared to existing reduce(op, map(f, .... More could be done here too.

Times, with dims:

julia> @btime reduce(+, map(+, $(rand(100,100)), $(rand(100,100))), dims=1);
  4.315 μs (3 allocations: 79.08 KiB)

julia> @btime mapreduce(/, +, $(rand(100,100)), $(rand(100,100)), dims=1); # dims=1 uses fast linear path
  4.202 μs (3 allocations: 79.08 KiB)  # unchanged, without init
  9.708 μs (3 allocations: 79.08 KiB)  # test without mapreduce_impl option, to compare

julia> @btime mapreduce(/, +, $(rand(100,100)), $(rand(100,100)), dims=1, init=0.0);
  266.750 μs (40090 allocations: 627.27 KiB)  # with vararg mapreduce_impl, first attempt
  4.869 μs (301 allocations: 5.56 KiB)        # with explicit N==2 version of mapreduce_impl
  2.273 μs (1 allocation: 896 bytes)          # with ith_all --  this PR

julia> @btime mapreduce(/, +, $(rand(100,100)), $(rand(100,100)), dims=2); # dims=2 uses a CartesianIndices path
  5.910 μs (7 allocations: 79.16 KiB)

julia> @btime mapreduce(/, +, $(rand(100,100)), $(rand(100,100)), dims=2, init=0.0);
  2.755 μs (5 allocations: 976 bytes)       # with explicit if N==2 ...
  2.750 μs (5 allocations: 976 bytes)       # with ith_all --  this PR

Times, scalar reduction:

julia> @btime reduce(+, map(/, $(rand(100,100)), $(rand(100,100))));  # behaviour on master
  5.736 μs (2 allocations: 78.20 KiB)

julia> @btime mapreduce(/, +, $(rand(100,100)), $(rand(100,100)));
  9.291 μs (0 allocations: 0 bytes)    # with zip --  this PR, or #39053
  5.340 μs (1 allocation: 16 bytes)    # with _mapreduce, ultimately mapreduce_impl, removed for now

julia> @btime mapreduce(/, +, $(rand(100,100)), $(rand(100,100)); init=0.0);
  2.713 μs (9 allocations: 368 bytes)  # with only(..., dims=all), ultimately mapreduce_impl

Times on the example from #38558, on my computer:

julia> summary(x)
"10240-element Vector{Float64}"

julia> @btime reduce(+, map(==, $x, $y));  # behaviour of f(x,y) on master
  5.423 μs (1 allocation: 10.19 KiB)

julia> @btime f($x,$y);  # mapreduce, turned into zip by this PR (or #39053)
  5.028 μs (0 allocations: 0 bytes)

julia> @btime f2($x,$y);  # explicit loop
  1.975 μs (0 allocations: 0 bytes)

julia> @btime f3($x,$y);  # using MappedArrays
  2.042 μs (0 allocations: 0 bytes)

julia> @btime f4($x,$y);  # with zip
  5.028 μs (0 allocations: 0 bytes)

julia> @btime mapreduce(==, +, $x, $y, init=0);  # with init, for this PR
  2.528 μs (6 allocations: 288 bytes)

I haven't checked tests closely, nor thought about whether more are needed.

Comment on lines +362 to +401
_mapreduce_dim_vararg(f, op, init, As::Tuple, dims) =
_mapreduce_dim_vararg(f, op, init, As, dims, IteratorSize(Generator(f, As...)))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only purpose of Generator(f, As...) is for the traits HasShape etc. Perhaps there's a better way.

Some weird test cases:

mapreduce(/,+,[1 2 3; 4 5 6],[7,8,9,10],dims=2,init=0.0)
mapreduce(/,+,[1 2 3; 4 5 6],[7,8,9,10],dims=1,init=0.0)
mapreduce(/,+,[1 2 3; 4 5 6],rand(2,3,1),dims=1,init=0.0)
mapreduce(/,+,rand(2,3),rand(2,3,1),dims=2,init=0.0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bolder possibility would be to declare those to be errors. This would fit with restricting map(f, A, B) to size(A)==size(B), instead of this zip behaviour. For tuples, this is still how map acts; in the discussion at #42216 most people seem to prefer that to the zip.

Comment on lines +349 to +388
_mapreduce_dim_vararg(f, op, init, As::Tuple, ::Colon) =
only(_mapreduce_dim_vararg(f, op, init, As, 1:maximum(ndims, As)))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I've called _mapreduce_dim_vararg could simply be methods of _mapreduce_dim. It would be tidiest to re-order its arguments from _mapreduce_dim(f, op, init, A, dims) to _mapreduce_dim(f, op, init, dims, A, Bs...). Would this be acceptable?

According to https://juliahub.com/ui/Search?q=_mapreduce_dim&type=code there are 3 packages which overload _mapreduce_dim, so perhaps it's not a disaster to change it.

Comment on lines +259 to +270
# vararg version
@noinline function mapreduce_impl(f::F, op, A::AbstractArrayOrBroadcasted,
ifirst::Integer, ilast::Integer, blksize::Int,
Bs::Vararg{AbstractArrayOrBroadcasted,N}) where {F,N}
Xs = (A, Bs...)
if ifirst == ilast
@inbounds x1 = ith_all(ifirst, Xs)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be combined with the original version? It is a bit of an ugly signature, mapreduce_impl(f, op, A, i,j,b, Bs...); not sure how hard it would be to change that to put all the arrays last.

@dkarrasch dkarrasch added the performance Must go faster label May 29, 2021
@oscardssmith
Copy link
Member

is this ready to merge?

@mcabbott
Copy link
Contributor Author

I believe it works, but it's quite a thicket of functions, and it's possible that someone who knows it better may prefer a different approach.

For instance, mapreduce(f, g, A, B, C) could go to reduce(g, broadcasted(f, A, B, C)), maybe that would allow more re-use, including the case when init is not given. IIRC complete reductions of lazy Broadcasted objects were made fast, but with dims they are still slow.

If we do want this approach, I highlighted places above where I wasn't sure this was the best choice of signature.

@oscardssmith
Copy link
Member

As I understand it, this isn't valid since map doesn't do the same things as broadcasted

@mcabbott
Copy link
Contributor Author

It might be easy to first check for cases broadcasting allows and map does not (like extending dimensions). Cases where map stops early like zip would be harder, although it's not so clear to me whether those are intentional anyway, see #41001 (comment).

@N5N3
Copy link
Member

N5N3 commented Jan 10, 2022

Fuse Broadcasted here is a better choise to me. But for performance reason, I guess the final result is a light weight MappedArrays.jl in Base.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants