Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Higher Order Methods API #41112

Open
rhshadrach opened this issue Apr 23, 2021 · 5 comments
Open

Higher Order Methods API #41112

rhshadrach opened this issue Apr 23, 2021 · 5 comments
Labels
API Design Apply Apply, Aggregate, Transform, Map Master Tracker High level tracker for similar issues

Comments

@rhshadrach
Copy link
Member

rhshadrach commented Apr 23, 2021

Context

For methods that accept UDFs, I've been looking at a number of issues that would involve long-standing behavior changes, some I feel are quite significant:

For some of these, it is difficult to emit a FutureWarning. For example, I'm not sure how we would emit a warning changing the behavior of agg for UDFs. I've been working toward consolidating the methods for apply/agg/transform et al in pandas.core.apply, resolving consistencies as they are encountered. This is very much a work in progress, and I expect to encounter more.

Proposal

Add a new submodel pandas.core.homs with HOMs standing for "Higher Order Methods". These are methods that specify a callable function as their primary argument. Also add a new, experimental, implementation behind the option use_hom_api. Whenever possible, any changes will be contained within the pandas.core.homs module. Progress can then be made without worrying about deprecation warnings and changing behaviors. When it's ready, we can then progress as:

  • Add docs on the option classifying it as experimental.
  • Classify the option as stable.
  • Change the default of the option to no-default, emitting a FutureWarning when not set that the behavior of many methods will change and that users should use True.
  • Change the default of the option to True, keeping the option to allow users to change back if necessary.
  • Remove the option and old implementation.

Goals

  • Consistent behaviors between different Higher Order Methods
  • Consistent API between Higher Order Methods
  • Excisability: Should the project die out, it needs to be easy excise the experimental code. All changes need to be behind if get_option('use_homs_api'):.
@rhshadrach rhshadrach added API Design Master Tracker High level tracker for similar issues Apply Apply, Aggregate, Transform, Map labels Apr 23, 2021
@jbrockmendel
Copy link
Member

might fit on this list #12653 API: ban mutation within groupby.apply

@rhshadrach
Copy link
Member Author

Thanks @jbrockmendel - agreed and added.

@rhshadrach
Copy link
Member Author

rhshadrach commented Aug 12, 2021

Mentioned on the call today was how agg and apply use each other (e.g. #42833), and it was asked for some examples where the results differ. After looking again through the code, they end up with the same result more cases than I had originally thought, but here are two differences I found. I do believe there to be more (and happy to find some if these are deemed insignificant).

ser = pd.Series([], dtype=float)
print(ser.agg('sum'))  # Gives: 0.0
print(ser.apply('sum'))  # Gives: Series([], dtype: float64)

ser = pd.Series([1, 2, 3])
print(ser.agg(lambda x: x.sum()))  # Gives: 6
print(ser.apply(lambda x: x.sum()))  # Raises AttributeError

cc @jbrockmendel

@jbrockmendel
Copy link
Member

Thanks for finding the example. So if you made the changes discussed on the call, which of the behaviors in that example would change?

@rhshadrach
Copy link
Member Author

rhshadrach commented Aug 20, 2021

This got a bit long, so I'll just say first that the short version is that I think the desired behavior of agg is pretty clear, but apply sometimes isn't. I'm working on a PoC but it probably won't be ready for some time. The long version is...

I don't think any of these should change. The problem is mostly with list-likes.

ser = pd.Series([1, 2, 3])
print(ser.apply([lambda x: x.sum()]))

gives

<lambda>    6
dtype: int64

which is inconsistent with the example above. This is because apply on lists is implemented by just calling agg.

So we need to implement apply on lists. What you can't do is "just do the same thing as agg on lists, but use apply with each element of the list instead". This is because the way agg currently works on lists is to break up a DataFrame into individual Series, and then use agg with each element of the list on each Series. This poses a problem because apply and agg have different behaviors on DataFrames and Series. agg (when provided with a reducer) will always reduce dimension, DataFrame -> Series -> Scalar. On the other hand, while df.apply(foo) (when provided with a reducer) will behave just like agg (applying foo to each column), ser.apply(foo) will not (in most cases). This will attempt to apply foo to each row of the Series individually.

The upshot (if the above paragraph was even understandable) is that if you were to implement apply with lists similar to how agg with lists is implemented, df.apply([foo]) today will not behave at all like df.apply(foo). The former will do ser.apply(foo) to each Series that makes up df, which in turn will pass to each cell of df to foo individually, whereas df.apply(foo) will pass each Series to foo.

With this, I think the right thing to do is to just implement apply with lists as concatenating [df.apply(a) for a in arg] (where arg here is a list). But then you also want to do this with agg, otherwise the result of agg vs apply will have the MultiIndex columns that come out with different ordering of levels (aggregator names in level 0 vs level 1), and even if you swap the levels in the MultiIndex, the individual ordering within the levels is not the same. I've looked into this and am convinced that if we allow partial failure and duplicate column names, it is not possible to reorder reliably. The information you need, namely which columns have disappeared due to partial-failure, just isn't there.

Changing gears, the current problem with Series.agg is that it first tries apply, and then falls back to applying the given UDF to the entire Series. For numeric types when the UDF is an aggregator, I can't come up with any examples where the "try apply first" is actually successful. Using object dtypes is easy though:

ser = pd.Series([[1], [2], [3]])
print(ser.agg(lambda x: sum(x)))

This results in

0    1
1    2
2    3
dtype: int64

where I think the right result is the list [1, 2, 3]. This is easy to fix - just don't use apply at all in Series.agg.

Finally, the examples:

ser = pd.Series([], dtype=float)
print(ser.agg('sum'))  # Gives: 0.0
print(ser.apply('sum'))  # Gives: Series([], dtype: float64)

appear to me to be correct, but is inconsistent with ser.apply('sum') when the Series is non-empty. When it is non-empty, apply is identical to agg. To me it is also inconsistent that ser.apply(lambda x: x.sum()) applies to each row but ser.apply('sum') does not. But what should ser.apply('sum') do? The only answer I have is it be equivalent to ser.apply(lambda x: x.sum()) but that seems quite useless. Maybe that's okay though?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Apply Apply, Aggregate, Transform, Map Master Tracker High level tracker for similar issues
Projects
None yet
Development

No branches or pull requests

2 participants