-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API for groupwise column transformation #1727
Comments
It would be great to be able to do this kind of things. I have three comments:
|
I think it would make sense to support But the question is that there are two orthogonal dimensions for this kind of function:
So regarding your point 1, we don't really need to difference between |
If we introduce
it will not be super fast, but I think it will be reasonably convenient. One just has to remember that the order of rows in the result will be changed. |
That doesn't sound ideal: aside from being a bit long to type, it kills type stability and requires making copies. Maybe we could allow something like |
So this is what I feel today we will be converging to :). I would even say that this might ideally be:
i.e. we would allow group-by operations in Then, if we settled for Then we would have a really SQL-like select. (if we added |
I like the idea of
But love the inclusion of |
After discussing this again with @bkamins, there are four dimensions to take into account:
The principle I think we should respect is that each of these options should be enabled (either via separate functions or view keyword arguments) in the same way for Below are two possible solutions which offer all of these combinations in a consistent way. A. Use
The advantage of this approach is that B. Use
The advantage of this approach is that it's very systematic: all operations work on whole columns by default, and there's a general way to switch to element-wise. Its drawback is that it's less convenient to apply element-wise operations with
|
Thank you for an excellent write-up. Some comments:
Also comments for "design 1":
I think we have to solve this issue within
But if we went this way In summary - personally I am in favor of option 1. It would mean the following work to do:
But, as this is opinionated other feedback is welcome (CC @pdeffebach). |
Good catch for the typo in the bottom-left cell. I agree with most of your points... except on the conclusion. :-)
Well that would go against the definition of data frames as collections of rows, so I don't think that would be a good idea. Why would you want to do that?
I would find it really weird to have
This is uncontroversial I think. |
If we say that functions on DF and GDF should behave the same then this is a natural consequence. I agree it goes against the rule. So maybe simply we should not define EDIT actually if we have
I meant them to operate consistently in both scenarios (i.e. element-wise). However, I would not strongly oppose for then working on whole columns, as, as you have noted, you can always broadcast in whole-column approach to get an element-wise code (with multiple columns it is a bit more tricky, but it is also essentially a broadcast). So - in short - I am OK to go this way. We have to decide two things:
|
I don't have strong opinions here, and definitely do not oppose strongly :) Overall, it feels like "element-based" is more natural if the data is ungrouped, and "column-based" is more natural if the data is grouped. JuliaDB works like this, i.e. However, element-based transform is much more relevant for JuliaDB for two reasons:
To exemplify point 2, in In the end, the case where this API is most essential (that is to say "less easily replaced by just writing the obvious thing") for The only concern that I have is that |
@bkamins The main reason to define @piever Yes, I agree JuliaDB is less suited to whole-column operations than DataFrames. But I'm not sure I agree with the idea that row-based is more natural for ungrouped data: in some cases it's quite natural to apply
I think broadcasting scalars would indeed make sense. That would be especially useful when applying |
It would make sense, but we currently disallow it in The rule here should be the same we have in DataFrames.jl now:
@nalimilan - so could I ask you to write down the summary what you think is best given this discussion (and hopefully we all will be in agreement)? I can then fill-in the gaps in the rules if there are any. |
I'm not sure how completely you're aiming to be able to cover the SQL select spec, but beyond GROUPBY SELECT, SQL also specifies window functions which can operate on multiple partitions (groups) in the same selection statement.
Depending on the objective here this would really require a reconsideration for the |
@dgkf - I think that the reason why @nalimilan wants to keep the whole-column approach as the default is to make this kind of operation easy if I understand what you raise correctly (at least for a single partition). Actually the key challenge we face is the following - if we allowed grouping in |
I see. The reason I raised this particular example is that SQL's window functions can apply to multiple groupings at once, so the select(df, :,
Col(:a, by = :subgrouping1) => standardize => :a_s,
Col(:x, :y, by = :subgrouping2) => + => :z,
by = :grouping) The point being that the groupings might have to be able to be defined for each "select => transform => as" input individually (similar to SQL's A very general point that I'm trying to touch on is that the SQL spec is quite involved and if the target is to have good coverage of that space, then I really think that a lot more of the spec needs to be considered when crafting a Julian API. That objective is ambitious and exciting, especially because it could very easily be used as a front end to database queries instead of the common practice of writing SQL in a string and passing it to a connection object, unifying syntax between operating on local DataFrames and with a database connection. Is the goal to eventually have this sort of coverage or is it to strike a balance between syntactic simplicity and implementing some SQL basics? |
Thank you for this comment. My thinking was that it would be currently doable by:
i.e. essentially you could pass the sub-grouping column to the function and rely on it to perform proper transformations (however, this would not be convenient and your proposal looks much nicer as you do not have to define custom functions).
Here I can tell you what my goal is (as people might differ in opinions): try to keep syntactic simplicity but have an internal design that allows us to gradually add more features in the future (possibly having a full coverage of some SQL flavor is a nice long term target). Even I think we are not yet trying to add However, thank you for raising this point as currently the design @nalimilan proposes relies on the assumption that by default Having written all this:
|
@nalimilan - after some time when I look at your original tables actually I think that we can drop |
Given what I learned when implementing #2080 (comment) maybe we can go for option 2 (i.e. |
See tidyverse/dplyr#3279 about row ordering. |
What we decide here is potentially breaking so adding 1.0 milestone to it. |
I am closing this issue as currently we settled on #2172 (the rest has already landed). The only thing not covered in current design is:
for each row in the source data frame. This pattern does not fit in the current design well and will have to be done in two steps:
However, if someone feels that we need a special functionality for this please open a new issue with a proposal of the API. |
I think this kind of pattern will be possible with |
In general this will be problematic because the moment you do The problem is that assume you run:
and now you can write
or
What you have to do is (there are more options for this, this is the simplest I think):
which I think is not that bad. Essentially the problem is that our current API allows "one row" or "multiple rows" returned by the function in apply step (and I actually like this flexibility). But this means that in the case of "multiple rows" you do not have control if apply function has not reordered the rows. |
Well it's up to the caller to ensure the function returns rows in the same order as the input. We do that all the time: even |
I am moving the discussion to #2172. |
This Follows what @nalimilan identified in a discussion on Discourse.
A common operation when transforming a column is groupwise standardization of data.
Assume you have a data frame with grouping column
:g
and data column:x
you want to mutate:x
by subtracting groupwise mean of:x
and divide by groupwise standard deviation of:x
.There are several options for this API. One of the ideas is the following change to
by
andcombine
functions:groupwise::Bool
keyword argument which defaults tofalse
and then we have a current behavior;groupwise
istrue
the following changes to the behavior happen:sort=false
then the existing rows are returned in the order present in the original data frame; ifsort=true
then they are sorted by groupsf
are assumed to return a single column with the length equal to the group length; this can be simply checked because what is proposed here behaves as-if user manually added the following entries to transformation listcol = :col => identity
for all non-grouping columns:col
in the parent data frameThen the standardization operation described above would be written e.g. as:
as opposed to what we have to do now:
(note that the order of the result is currently destroyed in comparison to the parent data frame and in the proposed implementation it would be retained if we set
sort
tofalse
)This proposal is intended to be minimally disruptive. Please comment as you might have other ideas what would be a good way to allow for this kind of operation. Currently users need to use
@transform
from DataFramesMeta.jl:its drawback is that it always destroys the order of the parent (and I think it is nice to be able to guarantee that we can preserve the order of the parent - but also please comment if you would find it useful).
The text was updated successfully, but these errors were encountered: