Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: Change default to observed=True in DataFrame.groupby #43999

Closed
Seon82 opened this issue Oct 12, 2021 · 7 comments · Fixed by #51811
Closed

DEPR: Change default to observed=True in DataFrame.groupby #43999

Seon82 opened this issue Oct 12, 2021 · 7 comments · Fixed by #51811
Assignees
Labels
Categorical Categorical Data Type Deprecate Functionality to remove in pandas Groupby

Comments

@Seon82
Copy link

Seon82 commented Oct 12, 2021

Is your feature request related to a problem?

The default behaviour of pandas.DataFrame.groupby is currently different depending on the type of the groupers (when one of the groupers is categorical, unobserved categories are added to the groupby by default. This behaviour can be overriden by setting the observed argument to False).

I feel like making the groupby API consistent by default and regardless of the underlying data type would provide a much better user experience.

Describe the solution you'd like

Default to observed=False in pandas.DataFrame.groupby.

API breaking implications

Would break backwards-compatibility.

Describe alternatives you've considered

So far the only option I can think of is to add observed=True to every groupby I write to make sure it will behave correctly no matter what kind of data gets passed to it.

@Seon82 Seon82 added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 12, 2021
@Seon82 Seon82 changed the title ENH: Default to observed=True in DataFrame.groupby ENH: Default to observed=True in DataFrame.groupby Oct 12, 2021
@jreback
Copy link
Contributor

jreback commented Oct 12, 2021

pls search the tracker this is a duplicate request

@Seon82
Copy link
Author

Seon82 commented Oct 12, 2021

Sorry, I completely missed it! And I still seem unable to find it no matter what synonyms I try, would you mind sending a link if you have one handy?

@jreback
Copy link
Contributor

jreback commented Oct 13, 2021

see #35967 and linked issues

i guess we don't have an actual issue for this (or maybe one of the linked ones)

cc @jseabold made a really good effort here

@rhshadrach rhshadrach added Groupby Categorical Categorical Data Type labels Oct 16, 2021
@mroeschke mroeschke removed the Needs Triage Issue that has not been reviewed by a pandas team member label Oct 16, 2021
@PMLP-novo
Copy link

An alternative suggestion could be to that the observed was determined at runtime by default. So if there will be created more groups than lets say 100,000,000 if groups are created in the Cartesian way, then we automatically change to observed = true.
I the code this should be having the default observed = None. This solution will be backwards compatible if users have set observed.

@rhshadrach rhshadrach added Deprecate Functionality to remove in pandas and removed Enhancement labels Jan 24, 2023
@rhshadrach rhshadrach changed the title ENH: Default to observed=True in DataFrame.groupby DEPR: Change default to observed=True in DataFrame.groupby Jan 24, 2023
@rhshadrach
Copy link
Member

I think we should pursue this deprecation. By defaulting to observed=False, categorical dtypes will default to behaving the same as all other dtypes. This would allow users to take advantage of the performance benefits of categorical (in particular, memory usage if string values are frequently repeated). The default of observed=False is also safer than observed=True in regards to memory and runtime, especially when there are multiple groupings.

cc @jbrockmendel @jorisvandenbossche @mroeschke @topper-123

@topper-123
Copy link
Contributor

+1 😄 . In addition, I had some arguments in #43999 on this.

I think this is quite a big ergonomic problem, e.g. beginners who don't know observed=True will often see their memory use explode when doing groupbys, giving Pandas an unjustly negative reputation performance wise and/or discouraging them to pursue Pandas further. Experienced users may also forget to set observed=True (I forget this a lot myself), getting a little annoyed at this API each time.

@jbrockmendel
Copy link
Member

+1 on deprecating the default, see also #30552

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Deprecate Functionality to remove in pandas Groupby
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants