Mapreduce is 10x slower than loop. #38558

oscardssmith · 2020-11-24T04:40:50Z

A slack conversation led me to the realization that currently the following two functions have very different speeds.

function f(x, y)
    return @inbounds mapreduce(==,+,x, y)
end

function f2(x, y)
	total=0
	@inbounds for i in 1:length(x)
		total += x[i]==y[i]
	end
	return total
end
x = randn(10240); y = similar(x)
@btime f2(x,y)
  1.255 μs (0 allocations: 0 bytes)

@btime f(x,y)
  9.207 μs (1 allocation: 10.19 KiB)

This is clearly unfortunate as the mapreduce version is much clearer. Is there any way we can make this case for mapreduce not have O(n) allocation? (or otherwise be faster)

The text was updated successfully, but these errors were encountered:

jishnub · 2020-11-24T06:47:50Z

Perhaps this is because map allocates an intermediate array? The performance improves noticeably using MappedArrays:

julia> @btime f($x,$y);
  22.352 μs (1 allocation: 10.19 KiB)

julia> @btime f2($x,$y);
  2.178 μs (0 allocations: 0 bytes)

julia> using MappedArrays

julia> function f3(x,y)
           reduce(+, mappedarray(==, x, y))
       end
f3 (generic function with 1 method)

julia> @btime f3($x,$y);
  4.934 μs (0 allocations: 0 bytes)

Still not quite as good as the loop, but almost there.

timholy · 2020-11-24T08:54:22Z

If Generators were faster we could just delete the specialization for multiple arrays and use

julia/base/reduce.jl

Line 288 in 19b3ec5

mapreduce(f, op, itrs...; kw...) = reduce(op, Generator(f, itrs...); kw...)

Generators are like generalized MappedArrays, but unfortunately they never have quite gotten the same performance. Working on their performance might be the best way to tackle this issue.

antoine-levitt · 2020-11-24T09:23:24Z

Generators are like generalized MappedArrays, but unfortunately they never have quite gotten the same performance. Working on their performance might be the best way to tackle this issue.

Yes please. Generators are super useful and I use them all the time (much more than higher-order functions).

JeffBezanson · 2020-11-24T15:55:15Z

I get 1.3 us for the loop, 10.4 us for mapreduce, and 7.2 us for the generator version (no allocations). So generators are already faster in this case, but yes need to be faster still.

timholy · 2020-11-24T16:08:34Z

As a reasonable goalpost, here's one more timing comparison:

Generator version: 5.969 μs (0 allocations: 0 bytes)
MappedArrays version: 1.631 μs (0 allocations: 0 bytes)
f2 version: 1.430 μs (0 allocations: 0 bytes)

For me the gap between the explicit loop and MappedArrays is very small.

stevengj · 2020-11-24T20:06:34Z

f4(x,y) = mapreduce(((x,y),)->x==y,+,zip(x,y)) also has no allocations and is faster, but it's still not as fast as the loop.

oscardssmith · 2020-11-24T21:12:03Z

That would be a pretty easy to do automatically, right?

Partially addresses JuliaLang#38558

oscardssmith · 2020-12-30T23:06:36Z

@timholy is there any good reason why Generators are slow? Is there low hanging fruit? If you have a place to point me, I'd be glad to take a look.

- A faster version is implemented. - mapreduce seems slow, see: JuliaLang/julia#38558 - Now it returns an SVector

- A new function get_xyz_uvw(m) has been implemented. It preallocates the matrices xyz and uvw and then does a manual (threaded) loop to fill the matrices. As seen in the previous commit, mapreduce is slow (JuliaLang/julia#38558), so we have chosen this "manual" approach. - A new function get_VTKconec(m) has been implemented. It follows the same ideas as in get_xyz_uvw. - Finally, the function write_vtk(m, filename) has been modified accordingly.

wheeheee · 2023-12-01T14:55:42Z

It's been a while, but I think I get reasonable performance from this:

f1(x, y) = mapreduce(==, +, x, y)

function f2(x, y)
    total = 0
    for i in eachindex(x, y)
        total += @inbounds x[i] == y[i]
    end
    return total
end

function mapreduce_same(f, op, A::Vararg{AbstractArray,N}; kw...) where {N}
    tup_f(i) = f(ntuple(j -> @inbounds(A[j][i]), Val(N))...)
    mapreduce(tup_f, op, eachindex(A...); kw...)
end

f3(x, y) = mapreduce_same(==, +, x, y; init=0)

My benchmarks give

julia> @btime f1($x, $y);
  2.489 μs (1 allocation: 10.19 KiB)
julia> @btime f2($x, $y);
  1.250 μs (0 allocations: 0 bytes)
julia> @btime f3($x, $y);
  1.380 μs (0 allocations: 0 bytes)

There are a lot of problems with this, like breaking current behaviour for (at least) a mixture of Vectors with different lengths from the other arrays, and not working with arbitrary indices. Aside from that, if I try to use LinearIndices like this

function mapreduce_cart(f, op, A::Vararg{AbstractArray, N}; kw...) where N
    tup_f(i) = f(ntuple(j -> @inbounds(A[j][i]), Val(N))...)
    inds = eachindex(A...)
    mapreduce(tup_f, op, LinearIndices(inds); kw...)
end
f4(x, y) = mapreduce_cart(==, +, x, y; init=0)

Performance is degraded and the function now allocates.

julia> @btime f4($x, $y);
  1.550 μs (3 allocations: 80 bytes)

Why does this happen?

mcabbott · 2024-02-23T18:29:27Z

@wheeheee that's not a bad idea, similar to MappedArrays but in a few lines.

Am not sure what LinearIndices is there for; I think that instead you want to reshape in cases where eachindex gives a UnitRange, so that inds always has the same size as the first array. Then reductions with dims will also work:

function mapreduce_same_reshape(f, op, A::Vararg{AbstractArray,N}; kw...) where {N}
    tup_f(i) = f(ntuple(j -> @inbounds(A[j][i]), Val(N))...)
    # inds = reshape(eachindex(A...), axes(A[1]))  # this change allows for dims=2 etc
    ei = eachindex(A...)  # when this is CartesianIndices, reshaping it makes it slow, and isn't necc
    inds = ei isa AbstractUnitRange ? reshape(ei, axes(A[1])) : ei
    mapreduce(tup_f, op, inds; kw...)
end

mapreduce_same_reshape(*, +, [1 2; 3 4], [5 6; 7 8]; dims=1) == [26 44]

Benchmarks for all the above functions here:

https://gist.github.com/mcabbott/4746e69f321909c3ba209518dc0447bb

wheeheee · 2024-02-24T01:03:02Z

For the life of me, I can't remember what I used the LinearIndices for...
Anyway, omitting inbounds in mapreduce_same, for this function at least, makes it faster. I don't know what compiler magic does this, but auto-vectorization is still quite brittle.
Also, I remember noticing that sometimes adding type parameters to f and op made the function faster for >= 3 arrays

JeffBezanson added the performance Must go faster label Nov 24, 2020

jishnub mentioned this issue Nov 24, 2020

Add mappedarrayreduce JuliaArrays/MappedArrays.jl#35

Open

oscardssmith added a commit to oscardssmith/julia that referenced this issue Dec 30, 2020

Faster multi-arg mapreduce

9d84428

Partially addresses JuliaLang#38558

oscardssmith mentioned this issue Dec 30, 2020

Faster multi-arg mapreduce #39053

Closed

mcabbott mentioned this issue May 29, 2021

More efficient mapreduce(f, op, A, B, C; init) #41001

Open

danielwe mentioned this issue Apr 7, 2022

Make array isapprox more generic by avoiding zip/explicit iteration #44893

Merged

albert-oliver added a commit to albert-oliver/NewRivaraProductions.jl that referenced this issue May 6, 2022

REF: Change get_conec(Tetrahedron)

2a419bb

- A faster version is implemented. - mapreduce seems slow, see: JuliaLang/julia#38558 - Now it returns an SVector

roflmaostc mentioned this issue Feb 21, 2024

mapreduce with multiple arrays allocates #53417

Open

mbauman linked a pull request Jul 29, 2024 that will close this issue

avoid intermediate map allocations in multi-arg mapreduce #55301

Draft

mbauman added the fold sum, maximum, reduce, foldl, etc. label Aug 2, 2024

giordano mentioned this issue Oct 27, 2024

minimum() (foldl(min, ary)) seems to be much slower than it can be #56353

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mapreduce is 10x slower than loop. #38558

Mapreduce is 10x slower than loop. #38558

oscardssmith commented Nov 24, 2020 •

edited

Loading

jishnub commented Nov 24, 2020

timholy commented Nov 24, 2020 •

edited

Loading

antoine-levitt commented Nov 24, 2020

JeffBezanson commented Nov 24, 2020

timholy commented Nov 24, 2020 •

edited

Loading

stevengj commented Nov 24, 2020

oscardssmith commented Nov 24, 2020

oscardssmith commented Dec 30, 2020

wheeheee commented Dec 1, 2023 •

edited

Loading

mcabbott commented Feb 23, 2024

wheeheee commented Feb 24, 2024 •

edited

Loading

Mapreduce is 10x slower than loop. #38558

Mapreduce is 10x slower than loop. #38558

Comments

oscardssmith commented Nov 24, 2020 • edited Loading

jishnub commented Nov 24, 2020

timholy commented Nov 24, 2020 • edited Loading

antoine-levitt commented Nov 24, 2020

JeffBezanson commented Nov 24, 2020

timholy commented Nov 24, 2020 • edited Loading

stevengj commented Nov 24, 2020

oscardssmith commented Nov 24, 2020

oscardssmith commented Dec 30, 2020

wheeheee commented Dec 1, 2023 • edited Loading

mcabbott commented Feb 23, 2024

wheeheee commented Feb 24, 2024 • edited Loading

oscardssmith commented Nov 24, 2020 •

edited

Loading

timholy commented Nov 24, 2020 •

edited

Loading

timholy commented Nov 24, 2020 •

edited

Loading

wheeheee commented Dec 1, 2023 •

edited

Loading

wheeheee commented Feb 24, 2024 •

edited

Loading