-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Higher Order Methods API #41112
Comments
might fit on this list #12653 API: ban mutation within groupby.apply |
Thanks @jbrockmendel - agreed and added. |
Mentioned on the call today was how agg and apply use each other (e.g. #42833), and it was asked for some examples where the results differ. After looking again through the code, they end up with the same result more cases than I had originally thought, but here are two differences I found. I do believe there to be more (and happy to find some if these are deemed insignificant).
|
Thanks for finding the example. So if you made the changes discussed on the call, which of the behaviors in that example would change? |
This got a bit long, so I'll just say first that the short version is that I think the desired behavior of agg is pretty clear, but apply sometimes isn't. I'm working on a PoC but it probably won't be ready for some time. The long version is... I don't think any of these should change. The problem is mostly with list-likes.
gives
which is inconsistent with the example above. This is because apply on lists is implemented by just calling agg. So we need to implement apply on lists. What you can't do is "just do the same thing as agg on lists, but use apply with each element of the list instead". This is because the way agg currently works on lists is to break up a DataFrame into individual Series, and then use agg with each element of the list on each Series. This poses a problem because apply and agg have different behaviors on DataFrames and Series. agg (when provided with a reducer) will always reduce dimension, DataFrame -> Series -> Scalar. On the other hand, while df.apply(foo) (when provided with a reducer) will behave just like agg (applying foo to each column), ser.apply(foo) will not (in most cases). This will attempt to apply foo to each row of the Series individually. The upshot (if the above paragraph was even understandable) is that if you were to implement apply with lists similar to how agg with lists is implemented, With this, I think the right thing to do is to just implement apply with lists as concatenating [df.apply(a) for a in arg] (where arg here is a list). But then you also want to do this with agg, otherwise the result of agg vs apply will have the MultiIndex columns that come out with different ordering of levels (aggregator names in level 0 vs level 1), and even if you swap the levels in the MultiIndex, the individual ordering within the levels is not the same. I've looked into this and am convinced that if we allow partial failure and duplicate column names, it is not possible to reorder reliably. The information you need, namely which columns have disappeared due to partial-failure, just isn't there. Changing gears, the current problem with Series.agg is that it first tries apply, and then falls back to applying the given UDF to the entire Series. For numeric types when the UDF is an aggregator, I can't come up with any examples where the "try apply first" is actually successful. Using object dtypes is easy though:
This results in
where I think the right result is the list Finally, the examples:
appear to me to be correct, but is inconsistent with |
Context
For methods that accept UDFs, I've been looking at a number of issues that would involve long-standing behavior changes, some I feel are quite significant:
For some of these, it is difficult to emit a FutureWarning. For example, I'm not sure how we would emit a warning changing the behavior of agg for UDFs. I've been working toward consolidating the methods for apply/agg/transform et al in
pandas.core.apply
, resolving consistencies as they are encountered. This is very much a work in progress, and I expect to encounter more.Proposal
Add a new submodel
pandas.core.homs
with HOMs standing for "Higher Order Methods". These are methods that specify a callable function as their primary argument. Also add a new, experimental, implementation behind the optionuse_hom_api
. Whenever possible, any changes will be contained within thepandas.core.homs
module. Progress can then be made without worrying about deprecation warnings and changing behaviors. When it's ready, we can then progress as:Goals
if get_option('use_homs_api'):
.The text was updated successfully, but these errors were encountered: