Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mapreduce is 10x slower than loop. #38558

Open
oscardssmith opened this issue Nov 24, 2020 · 11 comments · May be fixed by #55301
Open

Mapreduce is 10x slower than loop. #38558

oscardssmith opened this issue Nov 24, 2020 · 11 comments · May be fixed by #55301
Labels
fold sum, maximum, reduce, foldl, etc. performance Must go faster

Comments

@oscardssmith
Copy link
Member

oscardssmith commented Nov 24, 2020

A slack conversation led me to the realization that currently the following two functions have very different speeds.

function f(x, y)
    return @inbounds mapreduce(==,+,x, y)
end

function f2(x, y)
	total=0
	@inbounds for i in 1:length(x)
		total += x[i]==y[i]
	end
	return total
end
x = randn(10240); y = similar(x)
@btime f2(x,y)
  1.255 μs (0 allocations: 0 bytes)

@btime f(x,y)
  9.207 μs (1 allocation: 10.19 KiB)

This is clearly unfortunate as the mapreduce version is much clearer. Is there any way we can make this case for mapreduce not have O(n) allocation? (or otherwise be faster)

@JeffBezanson JeffBezanson added the performance Must go faster label Nov 24, 2020
@jishnub
Copy link
Contributor

jishnub commented Nov 24, 2020

Perhaps this is because map allocates an intermediate array? The performance improves noticeably using MappedArrays:

julia> @btime f($x,$y);
  22.352 μs (1 allocation: 10.19 KiB)

julia> @btime f2($x,$y);
  2.178 μs (0 allocations: 0 bytes)

julia> using MappedArrays

julia> function f3(x,y)
           reduce(+, mappedarray(==, x, y))
       end
f3 (generic function with 1 method)

julia> @btime f3($x,$y);
  4.934 μs (0 allocations: 0 bytes)

Still not quite as good as the loop, but almost there.

@timholy
Copy link
Member

timholy commented Nov 24, 2020

If Generators were faster we could just delete the specialization for multiple arrays and use

mapreduce(f, op, itrs...; kw...) = reduce(op, Generator(f, itrs...); kw...)

Generators are like generalized MappedArrays, but unfortunately they never have quite gotten the same performance. Working on their performance might be the best way to tackle this issue.

@antoine-levitt
Copy link
Contributor

Generators are like generalized MappedArrays, but unfortunately they never have quite gotten the same performance. Working on their performance might be the best way to tackle this issue.

Yes please. Generators are super useful and I use them all the time (much more than higher-order functions).

@JeffBezanson
Copy link
Member

I get 1.3 us for the loop, 10.4 us for mapreduce, and 7.2 us for the generator version (no allocations). So generators are already faster in this case, but yes need to be faster still.

@timholy
Copy link
Member

timholy commented Nov 24, 2020

As a reasonable goalpost, here's one more timing comparison:

  • Generator version: 5.969 μs (0 allocations: 0 bytes)
  • MappedArrays version: 1.631 μs (0 allocations: 0 bytes)
  • f2 version: 1.430 μs (0 allocations: 0 bytes)

For me the gap between the explicit loop and MappedArrays is very small.

@stevengj
Copy link
Member

f4(x,y) = mapreduce(((x,y),)->x==y,+,zip(x,y)) also has no allocations and is faster, but it's still not as fast as the loop.

@oscardssmith
Copy link
Member Author

That would be a pretty easy to do automatically, right?

oscardssmith added a commit to oscardssmith/julia that referenced this issue Dec 30, 2020
@oscardssmith
Copy link
Member Author

@timholy is there any good reason why Generators are slow? Is there low hanging fruit? If you have a place to point me, I'd be glad to take a look.

albert-oliver added a commit to albert-oliver/NewRivaraProductions.jl that referenced this issue May 6, 2022
   - A faster version is implemented.

   - mapreduce seems slow, see:
     JuliaLang/julia#38558

   - Now it returns an SVector
albert-oliver added a commit to albert-oliver/NewRivaraProductions.jl that referenced this issue May 6, 2022
   - A new function get_xyz_uvw(m) has been implemented. It preallocates
     the matrices xyz and uvw and then does a manual (threaded) loop to
     fill the matrices. As seen in the previous commit, mapreduce is
     slow (JuliaLang/julia#38558), so we have
     chosen this "manual" approach.

   - A new function get_VTKconec(m) has been implemented. It follows the
     same ideas as in get_xyz_uvw.

   - Finally, the function write_vtk(m, filename) has been modified
     accordingly.
@wheeheee
Copy link

wheeheee commented Dec 1, 2023

It's been a while, but I think I get reasonable performance from this:

f1(x, y) = mapreduce(==, +, x, y)

function f2(x, y)
    total = 0
    for i in eachindex(x, y)
        total += @inbounds x[i] == y[i]
    end
    return total
end

function mapreduce_same(f, op, A::Vararg{AbstractArray,N}; kw...) where {N}
    tup_f(i) = f(ntuple(j -> @inbounds(A[j][i]), Val(N))...)
    mapreduce(tup_f, op, eachindex(A...); kw...)
end

f3(x, y) = mapreduce_same(==, +, x, y; init=0)

My benchmarks give

julia> @btime f1($x, $y);
  2.489 μs (1 allocation: 10.19 KiB)
julia> @btime f2($x, $y);
  1.250 μs (0 allocations: 0 bytes)
julia> @btime f3($x, $y);
  1.380 μs (0 allocations: 0 bytes)

There are a lot of problems with this, like breaking current behaviour for (at least) a mixture of Vectors with different lengths from the other arrays, and not working with arbitrary indices. Aside from that, if I try to use LinearIndices like this

function mapreduce_cart(f, op, A::Vararg{AbstractArray, N}; kw...) where N
    tup_f(i) = f(ntuple(j -> @inbounds(A[j][i]), Val(N))...)
    inds = eachindex(A...)
    mapreduce(tup_f, op, LinearIndices(inds); kw...)
end
f4(x, y) = mapreduce_cart(==, +, x, y; init=0)

Performance is degraded and the function now allocates.

julia> @btime f4($x, $y);
  1.550 μs (3 allocations: 80 bytes)

Why does this happen?

@mcabbott
Copy link
Contributor

@wheeheee that's not a bad idea, similar to MappedArrays but in a few lines.

Am not sure what LinearIndices is there for; I think that instead you want to reshape in cases where eachindex gives a UnitRange, so that inds always has the same size as the first array. Then reductions with dims will also work:

function mapreduce_same_reshape(f, op, A::Vararg{AbstractArray,N}; kw...) where {N}
    tup_f(i) = f(ntuple(j -> @inbounds(A[j][i]), Val(N))...)
    # inds = reshape(eachindex(A...), axes(A[1]))  # this change allows for dims=2 etc
    ei = eachindex(A...)  # when this is CartesianIndices, reshaping it makes it slow, and isn't necc
    inds = ei isa AbstractUnitRange ? reshape(ei, axes(A[1])) : ei
    mapreduce(tup_f, op, inds; kw...)
end

mapreduce_same_reshape(*, +, [1 2; 3 4], [5 6; 7 8]; dims=1) == [26 44]

Benchmarks for all the above functions here:

https://gist.github.com/mcabbott/4746e69f321909c3ba209518dc0447bb

@wheeheee
Copy link

wheeheee commented Feb 24, 2024

For the life of me, I can't remember what I used the LinearIndices for...
Anyway, omitting inbounds in mapreduce_same, for this function at least, makes it faster. I don't know what compiler magic does this, but auto-vectorization is still quite brittle.
Also, I remember noticing that sometimes adding type parameters to f and op made the function faster for >= 3 arrays

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fold sum, maximum, reduce, foldl, etc. performance Must go faster
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants