-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF: Decouple Series.apply from Series.agg #53400
REF: Decouple Series.apply from Series.agg #53400
Conversation
Certainly agreed with where this is going, but I'm not sure this is the right way to get there. After the deprecation of the default value, the plan is to then deprecate this argument in its entirety, right? So all uses of this method would need to be changed twice
That can only be merged after the default value of The following is dependent on me being correct above, and if I'm not then it can probably be ignored. The change here and #53325 by itself wouldn't be concerning, but we've been talking about making many changes to the apply/agg/transform API, some of which hinges on the issues here being fixed. It seems complex, noisy for users, and slow to do all this piecemeal to me. This is why I prefer something like #41112. I think we should not attempt both routes simultaneously (piecemeal + something like #41112), so we should decide on a route to move forward. |
It is actually intended to be merged right after this. If you check the code path in this PR for a hypothetical This means that calls to So this PR and #53325 can be implemented as-is, AFICS (though this is obviously quite complex, and could definitely use more eyes on it to verify whether I'm missing something). Implementing #52140 will however require deprecating |
Above I just discuss your first half (before "The following is dependent..."). I think we can maybe discuss the later part in a follow-up, as I guess that discussion can depend on the conclusions for the first part. |
Thanks for the correction; I agree with your assessment. Still, I think my concerns from my previous comment remain. In particular
Do you agree with this line here? |
I may not be experienced enough with the groupby and window methods to know, but I think it's worth looking into if the TBH, this has always been a very complex area and it was only after doing #53362 ( i.e. very recently) that I have started thinking it could be possible to do this with a normal deprecation process. I may still have missed something and be proven wrong of course, but that's part of the discussion... The way I see it that after this PR and #53325 for example def agg(self):
result = super().agg()
if result is None:
obj = self.obj
f = self.f
try:
result = obj.apply(f)
except (ValueError, AttributeError, TypeError):
result = f(self.obj)
else:
msg = (
f"using {f} in {type(obj).__name__}.agg cannot aggregate and "
f"has been deprecated. Use {type(obj).__name__}.transform to "
f"keep behavior unchanged."
)
warnings.warn(msg, FutureWarning, stacklevel=find_stack_level())
return result The above just deprecates calling But if we change the above to: def agg(self):
result = super().agg()
if result is None:
obj = self.obj
f = self.f
try:
result = obj.apply(f)
except (ValueError, AttributeError, TypeError):
result = f(self.obj)
if not self._is_aggregate_value(result): # aside: how do we know something is an aggregate value?
msg = (
f"using {f} in {type(obj).__name__}.agg cannot aggregate and "
f"has been deprecated. Use {type(obj).__name__}.transform to "
f"keep behavior unchanged."
)
warnings.warn(msg, FutureWarning, stacklevel=find_stack_level())
return result then returning a non-aggregate value will emit a warning. So the question is if the above is backward compatible and, if the user fix their code to not emit warnings, their code will be compatible with v.3.0.: In pandas v3.0 the method will become: def agg(self):
result = super().agg()
if result is None:
result = self.f(self.obj)
if not self._is_aggregate_value(result):
msg = (
f"using {f} in {type(obj).__name__}.agg cannot aggregate and "
f"has been deprecated. Use {type(obj).__name__}.transform to "
f"keep behavior unchanged."
)
warnings.warn(msg, FutureWarning, stacklevel=find_stack_level())
return result The same can be done similarly with My suspicion is that if the above can be done on |
Indeed, you've made progress where I didn't think it was possible. For agg, I think pandas should not take a stance on what an aggregated value is but rather treat the return from the UDF as if it were a scalar (even when you would typically think it's not). But we can discuss this at a later point.
For UDFs (as opposed to string aliases), these implementations are largely independent of that in Series/DataFrame.
Agreed! Let's move forward with these and see where we get to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, needs tests.
I can see it needs tests for dict_likes, do you have anything else in mind? EDIT: and also |
Yep - was really just thinking by_row |
I've updated the tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests look good!
doc/source/whatsnew/v2.1.0.rst
Outdated
@@ -101,6 +101,7 @@ Other enhancements | |||
- :meth:`DataFrame.unstack` gained the ``sort`` keyword to dictate whether the resulting :class:`MultiIndex` levels are sorted (:issue:`15105`) | |||
- :meth:`SeriesGroupby.agg` and :meth:`DataFrameGroupby.agg` now support passing in multiple functions for ``engine="numba"`` (:issue:`53486`) | |||
- Added ``engine_kwargs`` parameter to :meth:`DataFrame.to_excel` (:issue:`53220`) | |||
- Added a new parameter ``array_ops_only`` to :meth:`Series.apply`. When set to ``True`` the supplied callables will always operate on the whole Series (:issue:`53400`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by_row now; not array_ops_only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, changed.
pandas/core/apply.py
Outdated
if is_groupby: | ||
engine = self.kwargs.get("engine", None) | ||
engine_kwargs = self.kwargs.get("engine_kwargs", None) | ||
kwargs = {"engine": engine, "engine_kwargs": engine_kwargs} | ||
kwds.update({"engine": engine, "engine_kwargs": engine_kwargs}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NBD, but I wonder why the change from kwargs to kwds? In pandas.core we overwhelmingly use kwargs instead of kwds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kwargs would make a line further down excedd 88 lines and be reformatted to fill 3 lines. So a stylistic preference, but not a strong opinion,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think consistency in variable names is more important here.
pandas/core/apply.py
Outdated
@@ -693,8 +717,8 @@ def values(self): | |||
def apply(self) -> DataFrame | Series: | |||
"""compute the results""" | |||
# dispatch to agg | |||
if is_list_like(self.func): | |||
return self.apply_multiple() | |||
if is_list_like(self.func) or is_dict_like(self.func): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dicts are considered list-like; no need for the 2nd check here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I changed it. I've changed the comment above instead to explain dictlike go here too.
pandas/core/apply.py
Outdated
@@ -1079,8 +1106,8 @@ def apply(self) -> DataFrame | Series: | |||
return self.apply_empty_result() | |||
|
|||
# dispatch to agg | |||
if is_list_like(self.func): | |||
return self.apply_multiple() | |||
if is_list_like(self.func) or is_dict_like(self.func): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests look good!
I've updated the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
yes, pretty weird. |
Thanks @topper-123 |
* BUG: make Series.agg aggregate when possible * fix doc build * deprecate instead of treating as a bug * CLN: Apply.agg_list_like * some cleanups * REF/CLN: func in core.apply (#53437) * REF/CLN: func in core.apply * Remove type-hint * REF: Decouple Series.apply from Series.agg (#53400) * update test * fix issues * fix issues * fix issues --------- Co-authored-by: Richard Shadrach <[email protected]>
if op_name == "apply": | ||
kwargs = {**kwargs, "by_row": False} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@topper-123: shouldn't by_row here be True for backwards compatibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, I'm thinking this should now be self.by_row
when that attribute exists. If a user calls ser.apply(["sum", "mean"], by_row=True)
(or with by_row=False
), shouldn't we be passing the argument down to the next call to apply?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are right. I'll make a new PR on that.
* BUG: make Series.agg aggregate when possible * fix doc build * deprecate instead of treating as a bug * CLN: Apply.agg_list_like * some cleanups * REF/CLN: func in core.apply (pandas-dev#53437) * REF/CLN: func in core.apply * Remove type-hint * REF: Decouple Series.apply from Series.agg (pandas-dev#53400) * update test * fix issues * fix issues * fix issues --------- Co-authored-by: Richard Shadrach <[email protected]>
This PR makes
Series.apply
not callSeries.agg
, when given a list-like or dict-like, decoupling the two methods (partly) and solves #53325 (comment). This makes the code base clearer by makingSeries.apply
andSeries.agg
going less back and forth between each other, which was very confusing. Merging this PR paves the way for merging #53325 afterwards.To decouple this, I've added a new parameter
by_row
toSeries.apply
. It defaults toTrue
, keeping current behavior for backward compatibility when given a single callable and internally callingapply
withby_row=False
, when given a list-like or dict-like, also for backward compatibility (needed when doing e.g.ser.apply([lambda x: x])
.This parameter is also relevant for #52140, where I proposed adding the parameter to solve a different set of problems with
Series.apply
. If this PR gets accepted, the solution to #52140 will be to changeby_row
fromTrue
toFalse
after a deprecation process (plus removing the parameter altogether even longer term).Also, I've renamed
apply_multiple
toapply_list_or_dict_like
, as I think that name is clearer.EDIT: Originally I called new parameter
array_ops_only
, but I've changed it toby_row