From 7f4fd20014e2a32b7f404c11525e1b5531c94520 Mon Sep 17 00:00:00 2001 From: Richard Shadrach Date: Sat, 24 Feb 2024 07:42:30 -0500 Subject: [PATCH 1/3] DOC: Whatsnew notable bugfix on groupby behavior with unobserved groups --- doc/source/whatsnew/v3.0.0.rst | 54 ++++++++++++++++++++++++++++++++-- 1 file changed, 51 insertions(+), 3 deletions(-) diff --git a/doc/source/whatsnew/v3.0.0.rst b/doc/source/whatsnew/v3.0.0.rst index 1f9f5e85a6c4b..e99a30bbe1751 100644 --- a/doc/source/whatsnew/v3.0.0.rst +++ b/doc/source/whatsnew/v3.0.0.rst @@ -43,10 +43,58 @@ Notable bug fixes These are bug fixes that might have notable behavior changes. -.. _whatsnew_300.notable_bug_fixes.notable_bug_fix1: +.. _whatsnew_300.notable_bug_fixes.groupby_unobs_and_na: + +Improved behavior in groupby for ``observed=False`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +A number of bugs have been fixed due to improved handling of unobserved groups. In previous versions of pandas, a single grouping with :meth:`.SeriesGroupBy.agg` when used with a user-defined function would pass the unobserved groups, resulting in ``0`` below. + +.. ipython:: python + + df = pd.DataFrame( + { + "key1": pd.Categorical(["a", "a", "b", "b"], categories=["a", "b", "c"]), + "key2": [1, 1, 1, 2], + "values": [1, 2, 3, 4], + } + ) + gb = df.groupby("key1", observed=False) + gb["values"].apply(lambda x: x.sum()) + +However this was not the case when using multiple groupings. + +.. ipython:: python + + In [1]: gb = df.groupby(["key1", "key2"], observed=False) + In [2]: gb["values"].apply(lambda x: x.sum()) + Out[2]: + key1 key2 + a 1 3.0 + 2 NaN + b 1 3.0 + 2 4.0 + c 1 NaN + 2 NaN + Name: values, dtype: float64 + +Now using multiple groupings will also passed the unobserved groups to the provided function. + +.. ipython:: python + gb = df.groupby(["key1", "key2"], observed=False) + gb["values"].apply(lambda x: x.sum()) + +Similarly: + + - In previous versions of pandas the method :meth:`.DataFrameGroupBy.sum` would result in ``0`` for unobserved groups, but :meth:`.DataFrameGroupBy.prod`, :meth:`.DataFrameGroupBy.all`, and :meth:`.DataFrameGroupBy.any` would all result in NA values. Now these methods result in ``1``, ``True``, and ``False`` respectively. + - :meth:`.DataFrameGroupBy.groups` did not include unobserved groups and now does. + +These improvements also fixed certain bugs in groupby: + + - :meth:`DataFrameGroupBy.nunique` would fail when there are multiple groupings, unobserved groups, and ``as_index=False`` (:issue:`52848`) + - :meth:`DataFrameGroupBy.agg` would fail when there are multiple groupings, unobserved groups, and ``as_index=False`` (:issue:`36698`) + - :meth:`DataFrameGroupBy.sum` would have incorrect value when there are multiple groupings, unobserved groups, and non-numeric data (:issue:`43891`) -notable_bug_fix1 -^^^^^^^^^^^^^^^^ .. _whatsnew_300.notable_bug_fixes.notable_bug_fix2: From 3aa18efa3e58da1805220056b1c423afc25e64c6 Mon Sep 17 00:00:00 2001 From: Richard Shadrach Date: Sat, 24 Feb 2024 07:58:11 -0500 Subject: [PATCH 2/3] Finish up --- doc/source/whatsnew/v3.0.0.rst | 42 +++++++++++++++++----------------- 1 file changed, 21 insertions(+), 21 deletions(-) diff --git a/doc/source/whatsnew/v3.0.0.rst b/doc/source/whatsnew/v3.0.0.rst index e99a30bbe1751..78b7177404b9b 100644 --- a/doc/source/whatsnew/v3.0.0.rst +++ b/doc/source/whatsnew/v3.0.0.rst @@ -45,10 +45,12 @@ These are bug fixes that might have notable behavior changes. .. _whatsnew_300.notable_bug_fixes.groupby_unobs_and_na: -Improved behavior in groupby for ``observed=False`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Improved behavior in groupby for ``observed=False`` (:issue:`56966`) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -A number of bugs have been fixed due to improved handling of unobserved groups. In previous versions of pandas, a single grouping with :meth:`.SeriesGroupBy.agg` when used with a user-defined function would pass the unobserved groups, resulting in ``0`` below. +A number of bugs have been fixed due to improved handling of unobserved groups. All remarks in this section equally impact :class:`.SeriesGroupBy`. + +In previous versions of pandas, a single grouping with :meth:`.DataFrameGroupBy.apply` or :meth:`.DataFrameGroupBy.agg` would pass the unobserved groups to the provided function, resulting in ``0`` below. .. ipython:: python @@ -60,29 +62,29 @@ A number of bugs have been fixed due to improved handling of unobserved groups. } ) gb = df.groupby("key1", observed=False) - gb["values"].apply(lambda x: x.sum()) + gb[["values"]].apply(lambda x: x.sum()) -However this was not the case when using multiple groupings. +However this was not the case when using multiple groupings, resulting in ``NaN`` below. .. ipython:: python In [1]: gb = df.groupby(["key1", "key2"], observed=False) - In [2]: gb["values"].apply(lambda x: x.sum()) + In [2]: gb[["values"]].apply(lambda x: x.sum()) Out[2]: - key1 key2 - a 1 3.0 - 2 NaN - b 1 3.0 - 2 4.0 - c 1 NaN - 2 NaN - Name: values, dtype: float64 + values + key1 key2 + a 1 3.0 + 2 NaN + b 1 3.0 + 2 4.0 + c 1 NaN + 2 NaN Now using multiple groupings will also passed the unobserved groups to the provided function. .. ipython:: python gb = df.groupby(["key1", "key2"], observed=False) - gb["values"].apply(lambda x: x.sum()) + gb[["values"]].apply(lambda x: x.sum()) Similarly: @@ -93,8 +95,9 @@ These improvements also fixed certain bugs in groupby: - :meth:`DataFrameGroupBy.nunique` would fail when there are multiple groupings, unobserved groups, and ``as_index=False`` (:issue:`52848`) - :meth:`DataFrameGroupBy.agg` would fail when there are multiple groupings, unobserved groups, and ``as_index=False`` (:issue:`36698`) - - :meth:`DataFrameGroupBy.sum` would have incorrect value when there are multiple groupings, unobserved groups, and non-numeric data (:issue:`43891`) - + - :meth:`DataFrameGroupBy.sum` would have incorrect values when there are multiple groupings, unobserved groups, and non-numeric data (:issue:`43891`) + - :meth:`DataFrameGroupBy.groups` with ``sort=False`` would sort groups; they now occur in the order they are observed (:issue:`56966`) + - :meth:`.DataFrameGroupBy.value_counts` would produce incorrect results when used with some categorical and some non-categorical groupings and ``observed=False`` (:issue:`56016`) .. _whatsnew_300.notable_bug_fixes.notable_bug_fix2: @@ -307,12 +310,9 @@ Plotting Groupby/resample/rolling ^^^^^^^^^^^^^^^^^^^^^^^^ +- Bug in :meth:`.DataFrameGroupBy.groups` and :meth:`.SeriesGroupby.groups` that would not respect groupby argument ``dropna`` (:issue:`55919`) - Bug in :meth:`.DataFrameGroupBy.quantile` when ``interpolation="nearest"`` is inconsistent with :meth:`DataFrame.quantile` (:issue:`47942`) - Bug in :meth:`DataFrame.ewm` and :meth:`Series.ewm` when passed ``times`` and aggregation functions other than mean (:issue:`51695`) -- Bug in :meth:`.DataFrameGroupBy.groups` and :meth:`.SeriesGroupby.groups` that would not respect groupby arguments ``dropna`` and ``sort`` (:issue:`55919`, :issue:`56966`, :issue:`56851`) -- Bug in :meth:`.DataFrameGroupBy.nunique` and :meth:`.SeriesGroupBy.nunique` would fail with multiple categorical groupings when ``as_index=False`` (:issue:`52848`) -- Bug in :meth:`.DataFrameGroupBy.prod`, :meth:`.DataFrameGroupBy.any`, and :meth:`.DataFrameGroupBy.all` would result in NA values on unobserved groups; they now result in ``1``, ``False``, and ``True`` respectively (:issue:`55783`) -- Bug in :meth:`.DataFrameGroupBy.value_counts` would produce incorrect results when used with some categorical and some non-categorical groupings and ``observed=False`` (:issue:`56016`) - Reshaping From 94813c83b290d8363c3fa1dfb60ff51f838ecf40 Mon Sep 17 00:00:00 2001 From: Richard Shadrach Date: Sun, 25 Feb 2024 07:41:46 -0500 Subject: [PATCH 3/3] refinements and fixes --- doc/source/whatsnew/v3.0.0.rst | 26 ++++++++++++++------------ 1 file changed, 14 insertions(+), 12 deletions(-) diff --git a/doc/source/whatsnew/v3.0.0.rst b/doc/source/whatsnew/v3.0.0.rst index 78b7177404b9b..da03c8f50086a 100644 --- a/doc/source/whatsnew/v3.0.0.rst +++ b/doc/source/whatsnew/v3.0.0.rst @@ -45,10 +45,10 @@ These are bug fixes that might have notable behavior changes. .. _whatsnew_300.notable_bug_fixes.groupby_unobs_and_na: -Improved behavior in groupby for ``observed=False`` (:issue:`56966`) -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Improved behavior in groupby for ``observed=False`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -A number of bugs have been fixed due to improved handling of unobserved groups. All remarks in this section equally impact :class:`.SeriesGroupBy`. +A number of bugs have been fixed due to improved handling of unobserved groups (:issue:`55738`). All remarks in this section equally impact :class:`.SeriesGroupBy`. In previous versions of pandas, a single grouping with :meth:`.DataFrameGroupBy.apply` or :meth:`.DataFrameGroupBy.agg` would pass the unobserved groups to the provided function, resulting in ``0`` below. @@ -56,17 +56,18 @@ In previous versions of pandas, a single grouping with :meth:`.DataFrameGroupBy. df = pd.DataFrame( { - "key1": pd.Categorical(["a", "a", "b", "b"], categories=["a", "b", "c"]), + "key1": pd.Categorical(list("aabb"), categories=list("abc")), "key2": [1, 1, 1, 2], "values": [1, 2, 3, 4], } ) + df gb = df.groupby("key1", observed=False) gb[["values"]].apply(lambda x: x.sum()) However this was not the case when using multiple groupings, resulting in ``NaN`` below. -.. ipython:: python +.. code-block:: ipython In [1]: gb = df.groupby(["key1", "key2"], observed=False) In [2]: gb[["values"]].apply(lambda x: x.sum()) @@ -80,23 +81,24 @@ However this was not the case when using multiple groupings, resulting in ``NaN` c 1 NaN 2 NaN -Now using multiple groupings will also passed the unobserved groups to the provided function. +Now using multiple groupings will also pass the unobserved groups to the provided function. .. ipython:: python + gb = df.groupby(["key1", "key2"], observed=False) gb[["values"]].apply(lambda x: x.sum()) Similarly: - - In previous versions of pandas the method :meth:`.DataFrameGroupBy.sum` would result in ``0`` for unobserved groups, but :meth:`.DataFrameGroupBy.prod`, :meth:`.DataFrameGroupBy.all`, and :meth:`.DataFrameGroupBy.any` would all result in NA values. Now these methods result in ``1``, ``True``, and ``False`` respectively. - - :meth:`.DataFrameGroupBy.groups` did not include unobserved groups and now does. + - In previous versions of pandas the method :meth:`.DataFrameGroupBy.sum` would result in ``0`` for unobserved groups, but :meth:`.DataFrameGroupBy.prod`, :meth:`.DataFrameGroupBy.all`, and :meth:`.DataFrameGroupBy.any` would all result in NA values. Now these methods result in ``1``, ``True``, and ``False`` respectively. + - :meth:`.DataFrameGroupBy.groups` did not include unobserved groups and now does. These improvements also fixed certain bugs in groupby: - - :meth:`DataFrameGroupBy.nunique` would fail when there are multiple groupings, unobserved groups, and ``as_index=False`` (:issue:`52848`) - - :meth:`DataFrameGroupBy.agg` would fail when there are multiple groupings, unobserved groups, and ``as_index=False`` (:issue:`36698`) - - :meth:`DataFrameGroupBy.sum` would have incorrect values when there are multiple groupings, unobserved groups, and non-numeric data (:issue:`43891`) - - :meth:`DataFrameGroupBy.groups` with ``sort=False`` would sort groups; they now occur in the order they are observed (:issue:`56966`) + - :meth:`.DataFrameGroupBy.nunique` would fail when there are multiple groupings, unobserved groups, and ``as_index=False`` (:issue:`52848`) + - :meth:`.DataFrameGroupBy.agg` would fail when there are multiple groupings, unobserved groups, and ``as_index=False`` (:issue:`36698`) + - :meth:`.DataFrameGroupBy.sum` would have incorrect values when there are multiple groupings, unobserved groups, and non-numeric data (:issue:`43891`) + - :meth:`.DataFrameGroupBy.groups` with ``sort=False`` would sort groups; they now occur in the order they are observed (:issue:`56966`) - :meth:`.DataFrameGroupBy.value_counts` would produce incorrect results when used with some categorical and some non-categorical groupings and ``observed=False`` (:issue:`56016`) .. _whatsnew_300.notable_bug_fixes.notable_bug_fix2: