Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: groupby(...).agg should only accept reducers #35725

Open
rhshadrach opened this issue Aug 14, 2020 · 7 comments
Open

ENH: groupby(...).agg should only accept reducers #35725

rhshadrach opened this issue Aug 14, 2020 · 7 comments
Labels
API Design Apply Apply, Aggregate, Transform, Map Enhancement Groupby Needs Discussion Requires discussion from core team before further action

Comments

@rhshadrach
Copy link
Member

rhshadrach commented Aug 14, 2020

Issues where agg is used with non-reducing functions:

I think agg should raise if the function(s) provided are not reducers. This can be tested by if the resulting index is equal to self.grouper.result_index. always reduce regardless of the return value. That is, treat the result of the UDF as if it a scalar (even when it's not).

@rhshadrach rhshadrach added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member API Design Needs Discussion Requires discussion from core team before further action Apply Apply, Aggregate, Transform, Map Groupby and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 14, 2020
@WillAyd
Copy link
Member

WillAyd commented Aug 19, 2020

I would be in favor of this - there's a lack of clarity in the API around when agg and apply should or shouldn't be used.

@jreback

@jreback
Copy link
Contributor

jreback commented Aug 19, 2020

yes absolutely agg should error out if the returned shape is incorrect - iow it should be strict (likely can only deprecate for now though)

similarly transform should be strict (i think it is actually)

@rhshadrach
Copy link
Member Author

take

@rhshadrach
Copy link
Member Author

@jreback @WillAyd - Rather than raising, what do you think about instead making it so that agg always aggs regardless of the result type. Some cases:

  • DataFrame.agg: always produces a Series whose index is the column labels. It could be that the values of the Series are Series/DataFrames themselves.
  • Series.agg: Always produces a single result, but can be a Series/DataFrame.
  • groupby(...).agg: with as_index=True, index is always the groups, precisely one row per group. With as_index=False would be the same, but columns instead of index.

This has the benefit of simplifying/clarifying the behavior of agg, and I think makes it more intuitive, while not forbidding a user from getting e.g. a Series full of DataFrames if they so desire.

@jreback
Copy link
Contributor

jreback commented Feb 28, 2021

i think we should do this but would require a depreciation cycle

having stronger guarantees on agg & transform that they always reduce is great

@attack68
Copy link
Contributor

attack68 commented Feb 1, 2022

I observed that the following:

df = pd.DataFrame({
    "A": Series((1000, 2000), dtype=int),
    "B": Series((1000, 2000), dtype=np.int64),
    "C": Series(["a", "b"]),
})

df.agg(["mean", "sum"])
           A       B    C
mean  1500.0  1500.0  NaN
sum   3000.0  3000.0   ab

now produces FutureWarning: ['C'] did not aggregate successfully. If any error is raised this will raise in a future version of pandas. Drop these columns/ops to avoid this warning. print(df.agg(["mean", "sum"]))

But in this circumstance I don't want to drop column C because it calculates correctly for at least one used metric, and I don't want to drop the op because it calculates correctly for at least one column.

I tried to define my own method:

def mean2(s:Series):
    try:
        ret = s.mean()
    except Exception:
        ret = pd.NA
    return ret

df.agg([mean2, "sum"])

But this resulted in ValueError: cannot combine transform and aggregation operations

The function works fine with DataFrame.apply as described in docs:

df.apply(mean2, axis=0)
A    1500.0
B    1500.0
C      <NA>
dtype: object

Is this a bug or what is the recommended approach for solving this?

@rhshadrach
Copy link
Member Author

This looks to me to be a separate issue. You have a reducer that fails on certain dtypes, whereas this issue is about supplying a non-aggregating function to agg. If I captured this correctly, can you open a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Apply Apply, Aggregate, Transform, Map Enhancement Groupby Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

4 participants