Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

by does not generate correct results #2208

Closed
y1my1 opened this issue Apr 24, 2020 · 2 comments
Closed

by does not generate correct results #2208

y1my1 opened this issue Apr 24, 2020 · 2 comments

Comments

@y1my1
Copy link

y1my1 commented Apr 24, 2020

After seeing this, I tried to use the new by syntax, an MWE is

using DataFrames
using Statistics
df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
                               b = repeat([2, 1], outer=[4]),
                               c = 1:8);

julia> df
 8×3 DataFrame
│ Row │ a     │ b     │ c     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1121     │
│ 2212     │
│ 3323     │
│ 4414     │
│ 5125     │
│ 6216     │
│ 7327     │
│ 8418     │
julia> by(df, :, :a, :c=>mean)
 8×4 DataFrame
│ Row │ a     │ b     │ c     │ c_mean  │
│     │ Int64 │ Int64 │ Int64 │ Float64 │
├─────┼───────┼───────┼───────┼─────────┤
│ 11211.0     │
│ 22122.0     │
│ 33233.0     │
│ 44144.0     │
│ 51255.0     │
│ 62166.0     │
│ 73277.0     │
│ 84188.0

or

julia> by(df, :, [:a], :c=>mean)
 8×4 DataFrame
│ Row │ a     │ b     │ c     │ c_mean  │
│     │ Int64 │ Int64 │ Int64 │ Float64 │
├─────┼───────┼───────┼───────┼─────────┤
│ 11211.0     │
│ 22122.0     │
│ 33233.0     │
│ 44144.0     │
│ 51255.0     │
│ 62166.0     │
│ 73277.0     │
│ 84188.0

Note: DataFrames has been updated to [a93c6f00] DataFrames v0.20.0 #master (https://github.com/JuliaData/DataFrames.jl.git)

what I would expect is

julia> by(df, :, [:a], :c=>mean)
 8×4 DataFrame
│ Row │ a     │ b     │ c     │ c_mean  │
│     │ Int64 │ Int64 │ Int64 │ Float64 │
├─────┼───────┼───────┼───────┼─────────┤
│ 11213.0     │
│ 22124.0     │
│ 33235.0     │
│ 44146.0     │
│ 51253.0     │
│ 62164.0     │
│ 73275.0     │
│ 84186.0

BTW: I can track down this issue but maybe a little bit later. Thanks.

@bkamins
Copy link
Member

bkamins commented Apr 24, 2020

You use the wrong order of arguments:

julia> by(df, :a, :, :c=>mean)
8×4 DataFrame
│ Row │ a     │ b     │ c     │ c_mean  │
│     │ Int64 │ Int64 │ Int64 │ Float64 │
├─────┼───────┼───────┼───────┼─────────┤
│ 1   │ 1     │ 2     │ 1     │ 3.0     │
│ 2   │ 1     │ 2     │ 5     │ 3.0     │
│ 3   │ 2     │ 1     │ 2     │ 4.0     │
│ 4   │ 2     │ 1     │ 6     │ 4.0     │
│ 5   │ 3     │ 2     │ 3     │ 5.0     │
│ 6   │ 3     │ 2     │ 7     │ 5.0     │
│ 7   │ 4     │ 1     │ 4     │ 6.0     │
│ 8   │ 4     │ 1     │ 8     │ 6.0     │

Though it does not produce what you want exactly, because you get a different row order. This can be fixed by calling e.g.:

julia> sort!(by(df, :a, :, :c=>mean), :c)
8×4 DataFrame
│ Row │ a     │ b     │ c     │ c_mean  │
│     │ Int64 │ Int64 │ Int64 │ Float64 │
├─────┼───────┼───────┼───────┼─────────┤
│ 1   │ 1     │ 2     │ 1     │ 3.0     │
│ 2   │ 2     │ 1     │ 2     │ 4.0     │
│ 3   │ 3     │ 2     │ 3     │ 5.0     │
│ 4   │ 4     │ 1     │ 4     │ 6.0     │
│ 5   │ 1     │ 2     │ 5     │ 3.0     │
│ 6   │ 2     │ 1     │ 6     │ 4.0     │
│ 7   │ 3     │ 2     │ 7     │ 5.0     │
│ 8   │ 4     │ 1     │ 8     │ 6.0     │

but admittedly in some cases you can expect the order to be preserved without having to call sort!.

The latter thing is tracked in #2172 so I am closing this issue (but please comment if I have missed something from your original post).

@bkamins bkamins closed this as completed Apr 24, 2020
@y1my1
Copy link
Author

y1my1 commented Apr 24, 2020

I see. The original post in discourse suggested the wrong order. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants