Higher Order Methods API #41112

rhshadrach · 2021-04-23T02:58:51Z

Context

For methods that accept UDFs, I've been looking at a number of issues that would involve long-standing behavior changes, some I feel are quite significant:

.agg should always reduce ENH: groupby(...).agg should only accept reducers #35725
Series.transform and DataFrame.transform should broadcast reduces (like groupby does) Behavior of new df.agg, df.transform and df.apply is very inconsistent #18103
Series.apply and DataFrame.apply use agg for list/dict arguments
Series.agg and DataFrame.agg use apply if argument is a UDF
UDF method arguments are inconsistent API: Signature of UDF methods #40112
Make groupby.apply dumb API: Should apply be smart? #39209
groupby.apply includes grouping column(s) in computation unlike groupby.agg/groupby.transform

For some of these, it is difficult to emit a FutureWarning. For example, I'm not sure how we would emit a warning changing the behavior of agg for UDFs. I've been working toward consolidating the methods for apply/agg/transform et al in pandas.core.apply, resolving consistencies as they are encountered. This is very much a work in progress, and I expect to encounter more.

Proposal

Add a new submodel pandas.core.homs with HOMs standing for "Higher Order Methods". These are methods that specify a callable function as their primary argument. Also add a new, experimental, implementation behind the option use_hom_api. Whenever possible, any changes will be contained within the pandas.core.homs module. Progress can then be made without worrying about deprecation warnings and changing behaviors. When it's ready, we can then progress as:

Add docs on the option classifying it as experimental.
Classify the option as stable.
Change the default of the option to no-default, emitting a FutureWarning when not set that the behavior of many methods will change and that users should use True.
Change the default of the option to True, keeping the option to allow users to change back if necessary.
Remove the option and old implementation.

Goals

Consistent behaviors between different Higher Order Methods
Consistent API between Higher Order Methods
Excisability: Should the project die out, it needs to be easy excise the experimental code. All changes need to be behind if get_option('use_homs_api'):.

The text was updated successfully, but these errors were encountered:

jbrockmendel · 2021-04-23T19:38:24Z

might fit on this list #12653 API: ban mutation within groupby.apply

rhshadrach · 2021-04-24T02:30:16Z

Thanks @jbrockmendel - agreed and added.

rhshadrach · 2021-08-12T01:05:51Z

Mentioned on the call today was how agg and apply use each other (e.g. #42833), and it was asked for some examples where the results differ. After looking again through the code, they end up with the same result more cases than I had originally thought, but here are two differences I found. I do believe there to be more (and happy to find some if these are deemed insignificant).

ser = pd.Series([], dtype=float)
print(ser.agg('sum'))  # Gives: 0.0
print(ser.apply('sum'))  # Gives: Series([], dtype: float64)

ser = pd.Series([1, 2, 3])
print(ser.agg(lambda x: x.sum()))  # Gives: 6
print(ser.apply(lambda x: x.sum()))  # Raises AttributeError

cc @jbrockmendel

jbrockmendel · 2021-08-19T21:05:11Z

Thanks for finding the example. So if you made the changes discussed on the call, which of the behaviors in that example would change?

rhshadrach · 2021-08-20T01:42:08Z

This got a bit long, so I'll just say first that the short version is that I think the desired behavior of agg is pretty clear, but apply sometimes isn't. I'm working on a PoC but it probably won't be ready for some time. The long version is...

I don't think any of these should change. The problem is mostly with list-likes.

ser = pd.Series([1, 2, 3])
print(ser.apply([lambda x: x.sum()]))

gives

<lambda>    6
dtype: int64

which is inconsistent with the example above. This is because apply on lists is implemented by just calling agg.

So we need to implement apply on lists. What you can't do is "just do the same thing as agg on lists, but use apply with each element of the list instead". This is because the way agg currently works on lists is to break up a DataFrame into individual Series, and then use agg with each element of the list on each Series. This poses a problem because apply and agg have different behaviors on DataFrames and Series. agg (when provided with a reducer) will always reduce dimension, DataFrame -> Series -> Scalar. On the other hand, while df.apply(foo) (when provided with a reducer) will behave just like agg (applying foo to each column), ser.apply(foo) will not (in most cases). This will attempt to apply foo to each row of the Series individually.

The upshot (if the above paragraph was even understandable) is that if you were to implement apply with lists similar to how agg with lists is implemented, df.apply([foo]) today will not behave at all like df.apply(foo). The former will do ser.apply(foo) to each Series that makes up df, which in turn will pass to each cell of df to foo individually, whereas df.apply(foo) will pass each Series to foo.

With this, I think the right thing to do is to just implement apply with lists as concatenating [df.apply(a) for a in arg] (where arg here is a list). But then you also want to do this with agg, otherwise the result of agg vs apply will have the MultiIndex columns that come out with different ordering of levels (aggregator names in level 0 vs level 1), and even if you swap the levels in the MultiIndex, the individual ordering within the levels is not the same. I've looked into this and am convinced that if we allow partial failure and duplicate column names, it is not possible to reorder reliably. The information you need, namely which columns have disappeared due to partial-failure, just isn't there.

Changing gears, the current problem with Series.agg is that it first tries apply, and then falls back to applying the given UDF to the entire Series. For numeric types when the UDF is an aggregator, I can't come up with any examples where the "try apply first" is actually successful. Using object dtypes is easy though:

ser = pd.Series([[1], [2], [3]])
print(ser.agg(lambda x: sum(x)))

This results in

0    1
1    2
2    3
dtype: int64

where I think the right result is the list [1, 2, 3]. This is easy to fix - just don't use apply at all in Series.agg.

Finally, the examples:

ser = pd.Series([], dtype=float)
print(ser.agg('sum'))  # Gives: 0.0
print(ser.apply('sum'))  # Gives: Series([], dtype: float64)

appear to me to be correct, but is inconsistent with ser.apply('sum') when the Series is non-empty. When it is non-empty, apply is identical to agg. To me it is also inconsistent that ser.apply(lambda x: x.sum()) applies to each row but ser.apply('sum') does not. But what should ser.apply('sum') do? The only answer I have is it be equivalent to ser.apply(lambda x: x.sum()) but that seems quite useless. Maybe that's okay though?

rhshadrach added API Design Master Tracker High level tracker for similar issues Apply Apply, Aggregate, Transform, Map labels Apr 23, 2021

rhshadrach mentioned this issue May 8, 2021

BUG: groupby.agg/transform casts UDF results #40790

Merged

5 tasks

rhshadrach mentioned this issue Jun 25, 2021

BUG: DataFrame.agg - why numpy.size doesn't work? #42203

Open

1 task

rhshadrach mentioned this issue Jul 31, 2021

DEPR: apply/agg using agg/apply #42833

Closed

3 tasks

This was referenced Sep 14, 2021

REF: Move rest of aggregation into apply #43572

Merged

POC: New UDF methods option #43678

Closed

rhshadrach changed the title ~~UDF methods overhaul~~ Higher Order Methods API Jan 22, 2022

This was referenced Mar 31, 2022

BUG: Series.agg() with axis=1 inconsistent with Series.groupby.agg with axis=1 #46581

Open

ENH: .pipe() on DataFrameGroupBy #46655

Open

rhshadrach mentioned this issue May 26, 2022

BUG: DataFrame.groupby.agg has inconsistent behaviour depending on DataFrame.groupby by's Iterable length and use of DataFrame.groupby.agg's *args/**kwargs #47092

Closed

3 tasks

rhshadrach mentioned this issue May 27, 2023

REF: Decouple Series.apply from Series.agg #53400

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Higher Order Methods API #41112

Higher Order Methods API #41112

rhshadrach commented Apr 23, 2021 •

edited

Loading

jbrockmendel commented Apr 23, 2021

rhshadrach commented Apr 24, 2021

rhshadrach commented Aug 12, 2021 •

edited

Loading

jbrockmendel commented Aug 19, 2021

rhshadrach commented Aug 20, 2021 •

edited

Loading

Higher Order Methods API #41112

Higher Order Methods API #41112

Comments

rhshadrach commented Apr 23, 2021 • edited Loading

Context

Proposal

Goals

jbrockmendel commented Apr 23, 2021

rhshadrach commented Apr 24, 2021

rhshadrach commented Aug 12, 2021 • edited Loading

jbrockmendel commented Aug 19, 2021

rhshadrach commented Aug 20, 2021 • edited Loading

rhshadrach commented Apr 23, 2021 •

edited

Loading

rhshadrach commented Aug 12, 2021 •

edited

Loading

rhshadrach commented Aug 20, 2021 •

edited

Loading