-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: groupby.agg should always agg #57706
base: main
Are you sure you want to change the base?
Conversation
I think this is ready for review. Assuming the direction this is moving us is good, still need to decide if we are okay with this being a breaking change in 3.0 (my preference) or if it should be deprecated. If we do go the deprecation route, it will be noisy (many cases where results will be the same but we can't tell so need to warn). The only way I see a deprecation working is if we add an option, e.g. cc @jorisvandenbossche @MarcoGorelli @Dr-Irv @mroeschke for any thoughts. |
I could generally be OK making this a "breaking bug change" for 3.0. Just 2 points:
|
I'm not going to review the whole code change - beyond what I understand about how this all works, but I think the example I wrote here should be in the tests: |
Great question - unfortunately the answer is no. We use
Agreed. I believe that's the case here for a UDF, but not strings (e.g.
Certainly - I added this as |
I'm not sure. In the example I created, you had 2 functions, one with 1 argument, the other with 2 arguments, and what was being passed to those 2 functions was different because of the number of arguments. I don't see how the test you created confirms that So maybe you should have a test like this that addresses the particular issue in #33242 : def twoargs(x, y):
assert isinstance(x, pd.Series)
return x.sum()
def test_two_args(self):
df = pd.DataFrame({'a': [1,2,3,4,5,6],
'b': [1,1,0,1,1,0],
'c': ['x','x','x','z','z','z'],
'd': ['s','s','s','d','d','d']})
df.groupby('c')[['a', 'b']].agg(twoargs, 0) |
We don't ever inspect the UDF to see what arguments it can take - our logic branches on whether additional arguments are passed in the call to Still, no opposition to an additional test here. Will add. |
df.agg
call passes different things to a custom function depending on whether a unused kwarg is supplied or not #39169Built on top of #57671; the diff should get better once that's merged. Still plan on splitting part of this up as a precursor (and perhaps multiple).
For the closed issues above, tests here still likely need to be added.
The goal here is to make groupby.agg more consistently handle UDFs. Currently:
My opinion is that we should treat all UDFs as reducers, regardless of what they return. Some alternatives:
For 1, we will sometimes guess wrong, and transforming isn't something we should be doing in a method called
agg
anyways. For 2, we are restricting what I think are valid use cases for aggregation, e.g.gb.agg(np.array)
orgb.agg(list)
.In implementing this, I ran into two issues:
_aggregate_frame
fails if non-scalars are returned by the UDF, and also passes all of the selected columns as a DataFrame to the UDF. This is called when there is a single grouping and args or kwargs are provided, or when there is a single grouping and passing the UDF each column individually fails with a ValueError("No objects to concatenate"). This does not seem possible to fix, would be hard to deprecate (could add a new argument or use afuture
option?), and is bad enough behavior that it seems to me we should just rip the band aid here for 3.0.Resampler.apply
is an alias forResample.agg
, and we do not want to impactResampler.apply
with these changes. For this, I kept the old paths through groupby specifically for resampler and plan to properly deprecate the current method and implement apply (by calling groupby's apply) as part of 3.x development. (Ref: BUG: resample apply is actually aggregate #38463)