Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add select, select!, transform and transform! for GroupedDataFrame #2172

Closed
bkamins opened this issue Apr 6, 2020 · 9 comments
Closed

Add select, select!, transform and transform! for GroupedDataFrame #2172

bkamins opened this issue Apr 6, 2020 · 9 comments
Labels
grouping non-breaking The proposed change is not breaking
Milestone

Comments

@bkamins
Copy link
Member

bkamins commented Apr 6, 2020

Allow to pass GroupedDataFrame to select, select!, transform and transform!. It should work like combine but returne a GroupedDataFrame.

@bkamins bkamins added grouping non-breaking The proposed change is not breaking labels Apr 6, 2020
@bkamins bkamins added this to the 1.x milestone Apr 6, 2020
@bkamins
Copy link
Member Author

bkamins commented Apr 8, 2020

Relevant discussion: #1727 (comment)

@bkamins
Copy link
Member Author

bkamins commented Apr 8, 2020

@nalimilan

Well it's up to the caller to ensure the function returns rows in the same order as the input. We do that all the time: even transform(df, :x => f) returns something meaningless if f(x) = shuffle(x).

Sorry for not being precise - I just wanted a quick example.

assume the following options of f return value in by(df, :g, :x => f => :outcol):

  • f returns a single value - then it is not a problem - we create a row related to a first occurence of grouping value in df and this is a sensible thing to do
  • f returns number of rows exactly the same as in group - here potentially again it is not ambiguous
  • f possibly returns sometimes less/more rows than it is in group - then we have a problem - there is no way to determine what should be done when combining groups into a data frame (we can come up with "some" rule - but it will not be obvious what to do - how to order the rows then - as you do not know which rows were dropped or where the rows were inserted)

My opinion is that our API was not designed to handle this case.

My current thinking is, that if we want such a functionality we should have another function that could be called e.g. rowwise (I am not attached to the name), with the following API (I write here a simplified form as it would have to be worked out in detail - in particular I write row_fun and agg_fun but we could use all syntax we now use in select here):

rowwise(df::AbstractDataFrame, grouping_col, row_fun, agg_fun)

and agg_fun be used to calculate group aggregates and row_fun would be passed two things:

  1. values from a single row
  2. values produced by aggregation function for the group this row belongs to

I do not see other way to avoid the ambiguity in corner cases (but I might be missing something, so please comment here).

However, in general - I came to the conclusion that adding an :id column and then sort!-ing the result is easy enough. That is why originally I recommended just to stop thinking of supporting it.

@nalimilan
Copy link
Member

I still don't really see the problem. For by, rows should be ordered according to the group which generated them (in the same order as keys(gd)), and then within each group the most natural choice is to keep them in the order in which they have been returned by the user-provided function (whatever the number of rows). (Of course, when : was passed as a selector, newly computed columns are required to have the same number of rows as the input data frame.)

What I agree is more tricky is things like select! and transform!, where returned values must be matched with original rows. But in these cases we should require the returned number of rows to be the same as the input (just like for transform). Maybe your problem is that this would disallow patterns that you think are useful?

@bkamins
Copy link
Member Author

bkamins commented Apr 8, 2020

For by I agree.

But select! and transform! would not be supported for GroupDataFrame (it is immutable so we do not allow mutating it). I know I have written in the top that I want to include select! and transform! - but this simply means that I have not thought about it enough 😢.

Let us just recap if we are on the same page what select and transform should do on GroupedDataFrame. I understood that this would be an extension of map on GroupedDataFrame, i.e. they take a GroupedDataFrame and return a GroupedDataFrame, but allow all transformations that combine allows.

So in short - select and transform on a GroupedDataFrame is combine on it but returning a GroupedDataFrame and not a DataFrame. Do you see it the same way?

If yes - then - the discussion about keeping of the row order of the parent data frame is immaterial for select - in the result of calling it you will not see it anyway.

But I think I see a solution to what we discuss in changing a kwarg to combine and possibly DataFrame constructor taking a GroupedDataFrame. This kwarg, call it grouporder, would have two values:

  • true (it does what we have now) - i.e. produces a grouped data frame by merging the return values in grouping order
  • false - that looks back which rows in the original data frame given groups had and restores that order (this is doable as GroupedDataFrame remembers the original indices so we can restore them when doing combine or DataFrame)

The only thing there is that in map, select and transform we would have to make sure that the new parent data frame also "inherits" this original order. Currently this is not assured (i.e. the parent after map is reordered):

julia> df = DataFrame(g=[1,2,1,2]);

julia> gdf = groupby(df, :g);

julia> mgdf = map(x -> (a=x.g,), gdf);

julia> parent(mgdf)
4×2 DataFrame
│ Row │ g     │ a     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 1     │ 1     │
│ 3   │ 2     │ 2     │
│ 4   │ 2     │ 2     │

@nalimilan
Copy link
Member

But select! and transform! would not be supported for GroupDataFrame (it is immutable so we do not allow mutating it). I know I have written in the top that I want to include select! and transform! - but this simply means that I have not thought about it enough cry.

We could apply the transformation to the parent. That's the most useful behavior anyway, as that's the only way to add a column based on grouped operations without copying other columns.

Let us just recap if we are on the same page what select and transform should do on GroupedDataFrame. I understood that this would be an extension of map on GroupedDataFrame, i.e. they take a GroupedDataFrame and return a GroupedDataFrame, but allow all transformations that combine allows.

So in short - select and transform on a GroupedDataFrame is combine on it but returning a GroupedDataFrame and not a DataFrame. Do you see it the same way?

If yes - then - the discussion about keeping of the row order of the parent data frame is immaterial for select - in the result of calling it you will not see it anyway.

Yes, I agree. Though I think the row order is significant as you can observe it via DataFrame(gd) and parent(gd), right?

But I think I see a solution to what we discuss in changing a kwarg to combine and possibly DataFrame constructor taking a GroupedDataFrame. This kwarg, call it grouporder, would have two values:

* `true` (it does what we have now) - i.e. produces a grouped data frame by merging the return values in grouping order

* `false` - that looks back which rows in the original data frame given groups had and restores that order (this is doable as `GroupedDataFrame` remembers the original indices so we can restore them when doing `combine` or `DataFrame`)

The only thing there is that in map, select and transform we would have to make sure that the new parent data frame also "inherits" this original order. Currently this is not assured (i.e. the parent after map is reordered):
[...]

Maybe that could be the difference between combine and select/transform? combine is useful when you want to summarize information for each group; so it makes sense to get results ordered by group. select and transform preserve the grouping information, so they are only useful if you get several rows for each group, with the most frequent case being as many rows as the input (this is even required for transform, including the situation where scalars are broadcasted); so it makes sense to preserve the original order if possible. This would be similar to what dplyr does with summarize vs. mutate.

Of course it could be useful to have a keyword argument to all functions -- but with a different default. Not sure about map.

@bkamins
Copy link
Member Author

bkamins commented Apr 8, 2020

Yes - having a different default would make sense, especially as grouporder=true (current combine) will be faster than grouporder=false (target of select/transform). I understand that if someone passes grouporder=false we will require that function strictly returns the number of rows per group equal to the original number of groups or a scalar which will be broadcasted to that number of rows. Right?

@nalimilan
Copy link
Member

I understand that if someone passes grouporder=false we will require that function strictly returns the number of rows per group equal to the original number of groups or a scalar which will be broadcasted to that number of rows. Right?

That's what dplyr does with mutate, so I guess that's OK in practice. We can do that at least as a first step and see later whether we want to be more flexible.

Throwing an error will protect from the annoying case in which you return a variable number of rows, and for some reason it happens to match the original number of rows.

@bkamins
Copy link
Member Author

bkamins commented Apr 9, 2020

Throwing an error will protect from the annoying case in which you return a variable number of rows, and for some reason it happens to match the original number of rows.

Yes - this is what I meant (actually we have exactly this bug in 0.20.2 and now it got fixed on master)

This was referenced Apr 24, 2020
@bkamins
Copy link
Member Author

bkamins commented May 5, 2020

closed via #2214

@bkamins bkamins closed this as completed May 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
grouping non-breaking The proposed change is not breaking
Projects
None yet
Development

No branches or pull requests

2 participants