-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: fix Series.apply(..., by_row), v2. #53601
BUG: fix Series.apply(..., by_row), v2. #53601
Conversation
CC: @rhshadrach. |
I don't see how this addresses #53584 (comment). The issue is with DataFrame.apply:
Users have no argument to specify |
Puff, I was more focused on the fact than in main we currently have: >>> pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5]}).apply([lambda x: 1])
a b
<lambda> 1 1 While in v2.0 we had: >>> pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5]}).apply([lambda x: 1])
a b
<lambda> <lambda>
0 1 1
1 1 1
2 1 1 but looks like you're right wrt. deprecation process, if we want to deprecate the current behavior for |
I've made a new version, where we can deprecate the old behavior by deprecating |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking the removal of by_row=True would also apply in the Series case, is that right? If not, a lot of my requests below are invalid.
pandas/core/apply.py
Outdated
kwargs, | ||
) -> None: | ||
if by_row is not False and by_row != "compat": | ||
raise NotImplementedError(f"by_row={by_row} not implemented") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the docs, I think NotImplementedError
signifies the implementation is currently incomplete, and that users can expect this to be supported once we "get around to it". Can this be a ValueError instead.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed this comment somehow. I've changed it now.
pandas/core/frame.py
Outdated
by_row : False or "compat", default "compat" | ||
If "compat", will if possible first translate the func into pandas | ||
methods (e.g. ``Series().apply(np.sum)`` will be translated to | ||
``Series().sum()``). If that doesn't work, will try call to apply again with | ||
``by_row=True`` and if that fails, will call apply again with | ||
``by_row=False`` | ||
If False, the funcs will be passed the whole Series at once. | ||
``by_row`` only has effect when ``func`` is a listlike or dictlike of funcs | ||
and the func isn't a string. | ||
``by_row=True`` has not been implemented, and will raise an | ||
``NotImplenentedError``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be good to have the callout on this only applying to list/dict-likes at the beginning, and adding in that this is compatible with previous versions. What do you think about being more vague about the compat behavior instead of trying to detail it out? Something like
by_row : False or "compat", default "compat"
Only has effect an when ``func`` is a listlike or dictlike of funcs
on the values that aren't NumPy functions (e.g. ``np.sum``) or
string-aliases for operations (e.g. ``"sum"``).
"compat" is backwards compatible with previous versions and will
sometimes operate by row and sometimes operate on the whole Series at once.
If False, the funcs will be passed the whole Series at once.
I'm also okay with keeping the more detailed description of compat if you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preference is for the other version, if that's ok. I changed it a bit though.
Unfortunately not. For >>> ser = pd.Series([1, np.nan, 2])
>>> ser.apply(np.sum, by_row=True)
0 1.0
1 NaN
2 2.0
dtype: float64
>>> ser.apply(np.sum, by_row="compat")
3.0 for
These two cases can't be combined into one parameter option, so I don't see another way forward here myself. I'll look into your detailed questions later today. |
In pandas 3.0, we will just have the by_row=False behavior, right? Why does the user need to be able to specify by_row=True? |
I've changed the
This means that for both Series and DataFrame I think this is better? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, I see now why you need the third state "_compat"
. We're now calling apply
instead of agg
from agg_or_apply_list_like
when op_name is apply. Agreed this naming is better.
I’ve updated. Yeah, it’s needed for backward compatability, unfortunately. After by_row=“compat” has been removed in v.3.0, this will become a whole lot simpler. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, just the two open requests from earlier on.
Yes, though that parameter will be deprecated and it's default value changed to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Thanks @topper-123! |
Nice. I'll scan to see if there are other changes needed for Series/DataFrame.apply/agg, but if not, this concludes the work for these methods, until the deprecation is activated. Series/DataFrame.transform still need some work to get in line with the others, though. I have not intended to work on the groupby methods, because you're taking care of those, right? |
Yes - my next step is to refactor groupby methods to by-and-large not share code with |
Fixes #53400 (comment) by making the
by_row
param take"compat"
as a parameter and using that internally that whenapply
is given dicts og lists.The compatability path is now to deprecate
by_rows=(True|"compat")
at some point, soby_rows=False
will become the default in v3.0.Supercedes #53584.
Code example: