by does not generate correct results #2208

y1my1 · 2020-04-24T12:48:30Z

After seeing this, I tried to use the new by syntax, an MWE is

using DataFrames
using Statistics
df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
                               b = repeat([2, 1], outer=[4]),
                               c = 1:8);

julia> df
 8×3 DataFrame
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2     │ 1     │
│ 2   │ 2     │ 1     │ 2     │
│ 3   │ 3     │ 2     │ 3     │
│ 4   │ 4     │ 1     │ 4     │
│ 5   │ 1     │ 2     │ 5     │
│ 6   │ 2     │ 1     │ 6     │
│ 7   │ 3     │ 2     │ 7     │
│ 8   │ 4     │ 1     │ 8     │
julia> by(df, :, :a, :c=>mean)
 8×4 DataFrame
│ Row │ a     │ b     │ c     │ c_mean  │
│     │ Int64 │ Int64 │ Int64 │ Float64 │
├─────┼───────┼───────┼───────┼─────────┤
│ 1   │ 1     │ 2     │ 1     │ 1.0     │
│ 2   │ 2     │ 1     │ 2     │ 2.0     │
│ 3   │ 3     │ 2     │ 3     │ 3.0     │
│ 4   │ 4     │ 1     │ 4     │ 4.0     │
│ 5   │ 1     │ 2     │ 5     │ 5.0     │
│ 6   │ 2     │ 1     │ 6     │ 6.0     │
│ 7   │ 3     │ 2     │ 7     │ 7.0     │
│ 8   │ 4     │ 1     │ 8     │ 8.0     │

or

julia> by(df, :, [:a], :c=>mean)
 8×4 DataFrame
│ Row │ a     │ b     │ c     │ c_mean  │
│     │ Int64 │ Int64 │ Int64 │ Float64 │
├─────┼───────┼───────┼───────┼─────────┤
│ 1   │ 1     │ 2     │ 1     │ 1.0     │
│ 2   │ 2     │ 1     │ 2     │ 2.0     │
│ 3   │ 3     │ 2     │ 3     │ 3.0     │
│ 4   │ 4     │ 1     │ 4     │ 4.0     │
│ 5   │ 1     │ 2     │ 5     │ 5.0     │
│ 6   │ 2     │ 1     │ 6     │ 6.0     │
│ 7   │ 3     │ 2     │ 7     │ 7.0     │
│ 8   │ 4     │ 1     │ 8     │ 8.0     │

Note: DataFrames has been updated to [a93c6f00] DataFrames v0.20.0 #master (https://github.com/JuliaData/DataFrames.jl.git)

what I would expect is

julia> by(df, :, [:a], :c=>mean)
 8×4 DataFrame
│ Row │ a     │ b     │ c     │ c_mean  │
│     │ Int64 │ Int64 │ Int64 │ Float64 │
├─────┼───────┼───────┼───────┼─────────┤
│ 1   │ 1     │ 2     │ 1     │ 3.0     │
│ 2   │ 2     │ 1     │ 2     │ 4.0     │
│ 3   │ 3     │ 2     │ 3     │ 5.0     │
│ 4   │ 4     │ 1     │ 4     │ 6.0     │
│ 5   │ 1     │ 2     │ 5     │ 3.0     │
│ 6   │ 2     │ 1     │ 6     │ 4.0     │
│ 7   │ 3     │ 2     │ 7     │ 5.0     │
│ 8   │ 4     │ 1     │ 8     │ 6.0     │

BTW: I can track down this issue but maybe a little bit later. Thanks.

The text was updated successfully, but these errors were encountered:

bkamins · 2020-04-24T12:53:59Z

You use the wrong order of arguments:

julia> by(df, :a, :, :c=>mean)
8×4 DataFrame
│ Row │ a     │ b     │ c     │ c_mean  │
│     │ Int64 │ Int64 │ Int64 │ Float64 │
├─────┼───────┼───────┼───────┼─────────┤
│ 1   │ 1     │ 2     │ 1     │ 3.0     │
│ 2   │ 1     │ 2     │ 5     │ 3.0     │
│ 3   │ 2     │ 1     │ 2     │ 4.0     │
│ 4   │ 2     │ 1     │ 6     │ 4.0     │
│ 5   │ 3     │ 2     │ 3     │ 5.0     │
│ 6   │ 3     │ 2     │ 7     │ 5.0     │
│ 7   │ 4     │ 1     │ 4     │ 6.0     │
│ 8   │ 4     │ 1     │ 8     │ 6.0     │

Though it does not produce what you want exactly, because you get a different row order. This can be fixed by calling e.g.:

julia> sort!(by(df, :a, :, :c=>mean), :c)
8×4 DataFrame
│ Row │ a     │ b     │ c     │ c_mean  │
│     │ Int64 │ Int64 │ Int64 │ Float64 │
├─────┼───────┼───────┼───────┼─────────┤
│ 1   │ 1     │ 2     │ 1     │ 3.0     │
│ 2   │ 2     │ 1     │ 2     │ 4.0     │
│ 3   │ 3     │ 2     │ 3     │ 5.0     │
│ 4   │ 4     │ 1     │ 4     │ 6.0     │
│ 5   │ 1     │ 2     │ 5     │ 3.0     │
│ 6   │ 2     │ 1     │ 6     │ 4.0     │
│ 7   │ 3     │ 2     │ 7     │ 5.0     │
│ 8   │ 4     │ 1     │ 8     │ 6.0     │

but admittedly in some cases you can expect the order to be preserved without having to call sort!.

The latter thing is tracked in #2172 so I am closing this issue (but please comment if I have missed something from your original post).

y1my1 · 2020-04-24T13:15:16Z

I see. The original post in discourse suggested the wrong order. Thanks.

bkamins closed this as completed Apr 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

by does not generate correct results #2208

by does not generate correct results #2208

y1my1 commented Apr 24, 2020 •

edited

Loading

bkamins commented Apr 24, 2020

y1my1 commented Apr 24, 2020

by does not generate correct results #2208

by does not generate correct results #2208

Comments

y1my1 commented Apr 24, 2020 • edited Loading

bkamins commented Apr 24, 2020

y1my1 commented Apr 24, 2020

y1my1 commented Apr 24, 2020 •

edited

Loading