Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed Inconsistent GroupBy Output Shape with Duplicate Column Labels #29124

Merged
merged 61 commits into from
Nov 20, 2019
Merged
Show file tree
Hide file tree
Changes from 50 commits
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
fd53827
Added any_all test
WillAyd Oct 17, 2019
6f60cd0
Added complete test for output shape of elements
WillAyd Oct 17, 2019
0aa1813
Fixed shape issues with bool aggs
WillAyd Oct 17, 2019
4af22f6
Merge remote-tracking branch 'upstream/master' into grpby-dup-cols
WillAyd Oct 21, 2019
9756e74
Added test for transformation shape
WillAyd Oct 21, 2019
b675963
Changed transform output
WillAyd Oct 21, 2019
444d542
Updated test with required args
WillAyd Oct 21, 2019
98a9901
Fixed cummin issue
WillAyd Oct 21, 2019
c8648b1
Fixed ohlc one-off handling
WillAyd Oct 21, 2019
1626de1
Fixed output shape of nunique
WillAyd Oct 21, 2019
2a6b8d7
Fixed tshift
WillAyd Oct 21, 2019
12d1ca0
lint and black
WillAyd Oct 21, 2019
5a3fcd7
Added whatsnew
WillAyd Oct 21, 2019
dee597a
Quantile special case hack
WillAyd Oct 21, 2019
a2f1b64
Merge remote-tracking branch 'upstream/master' into grpby-dup-cols
WillAyd Oct 22, 2019
fdb36f6
Test fix for np dev
WillAyd Oct 22, 2019
8975009
Used position as labels
WillAyd Oct 22, 2019
9adde1b
Code cleanup
WillAyd Oct 22, 2019
63b35f9
docstrings
WillAyd Oct 22, 2019
9eb7c73
style and more typing
WillAyd Oct 22, 2019
0e49bdb
Fixed parallel test collection
WillAyd Oct 22, 2019
2ad7632
numpy dev warning fix
WillAyd Oct 22, 2019
11fda39
More generic ohlc handling
WillAyd Oct 22, 2019
caf8f11
Merge remote-tracking branch 'upstream/master' into grpby-dup-cols
WillAyd Oct 22, 2019
7c4bad9
Converted Dict -> Mapping
WillAyd Oct 22, 2019
b9dca96
doc fixups
WillAyd Oct 22, 2019
a878e67
numpy dev compat
WillAyd Oct 22, 2019
037f9af
Merge remote-tracking branch 'upstream/master' into grpby-dup-cols
WillAyd Oct 22, 2019
dd3b1dc
Merge remote-tracking branch 'upstream/master' into grpby-dup-cols
WillAyd Oct 23, 2019
a66d37f
Renamed index -> idx
WillAyd Oct 23, 2019
6d50448
Removed rogue TODO
WillAyd Oct 23, 2019
4dd8f5b
Added failing test for multiindex
WillAyd Oct 23, 2019
9d39862
MultiIndex support
WillAyd Oct 23, 2019
d6b197b
More exact typing
WillAyd Oct 23, 2019
3cfd1a2
Merge remote-tracking branch 'upstream/master' into grpby-dup-cols
WillAyd Oct 29, 2019
16e9512
jbrockmendel feedback
WillAyd Oct 29, 2019
d297684
Removed failing type
WillAyd Oct 29, 2019
a234bed
Merge remote-tracking branch 'upstream/master' into grpby-dup-cols
WillAyd Oct 31, 2019
e3959b0
Merge remote-tracking branch 'upstream/master' into grpby-dup-cols
WillAyd Nov 2, 2019
391d106
Aligned annotations and comments
WillAyd Nov 2, 2019
c30ca82
Merge remote-tracking branch 'upstream/master' into grpby-dup-cols
WillAyd Nov 6, 2019
3a78051
mypy fix
WillAyd Nov 6, 2019
f4f9e61
Merge remote-tracking branch 'upstream/master' into grpby-dup-cols
WillAyd Nov 7, 2019
936591e
asserts and mypy fixes
WillAyd Nov 7, 2019
5dd131e
Merge remote-tracking branch 'upstream/master' into grpby-dup-cols
WillAyd Nov 13, 2019
6cc1607
Fix issue in merge resolution
WillAyd Nov 13, 2019
d5ce753
Merge remote-tracking branch 'upstream/master' into grpby-dup-cols
WillAyd Nov 17, 2019
ce97eff
Correct merge with 29629
WillAyd Nov 17, 2019
d1a92b4
Fixed issue with dict assignment
WillAyd Nov 17, 2019
23eb803
modernized annotations
WillAyd Nov 17, 2019
4aa9f4c
Merge remote-tracking branch 'upstream/master' into grpby-dup-cols
WillAyd Nov 18, 2019
faa08c9
Merge remote-tracking branch 'upstream/master' into grpby-dup-cols
WillAyd Nov 18, 2019
b07335b
Changed to assert, removed collections
WillAyd Nov 18, 2019
fb71185
Removed breakpoint and whatsnew space
WillAyd Nov 18, 2019
d7b84a2
Fixed issue with DTA
WillAyd Nov 18, 2019
7934422
typing fixup
WillAyd Nov 18, 2019
c8f0b19
Merge remote-tracking branch 'upstream/master' into grpby-dup-cols
WillAyd Nov 19, 2019
acf22d3
packed key into namedtuple
WillAyd Nov 19, 2019
a0aae64
mypy fixes
WillAyd Nov 19, 2019
a9b411a
docstring cleanup
WillAyd Nov 19, 2019
51b8050
Merge remote-tracking branch 'upstream/master' into grpby-dup-cols
WillAyd Nov 19, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -436,6 +436,7 @@ Groupby/resample/rolling
- Bug in :meth:`DataFrame.groupby` not offering selection by column name when ``axis=1`` (:issue:`27614`)
- Bug in :meth:`DataFrameGroupby.agg` not able to use lambda function with named aggregation (:issue:`27519`)
- Bug in :meth:`DataFrame.groupby` losing column name information when grouping by a categorical column (:issue:`28787`)
- Bug in :meth:`DataFrame.groupby` where ``any``, ``all``, ``nunique`` and transform functions would incorrectly handle duplicate column labels (:issue:`21668`)

Reshaping
^^^^^^^^^
Expand Down
194 changes: 157 additions & 37 deletions pandas/core/groupby/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,17 @@
from functools import partial
from textwrap import dedent
import typing
from typing import Any, Callable, FrozenSet, Iterable, Sequence, Type, Union, cast
from typing import (
Any,
Callable,
FrozenSet,
Iterable,
Mapping,
Sequence,
Type,
Union,
cast,
)
import warnings

import numpy as np
Expand Down Expand Up @@ -322,29 +332,109 @@ def _aggregate_multiple_funcs(self, arg, _level):

return DataFrame(results, columns=columns)

def _wrap_series_output(self, output, index, names=None):
""" common agg/transform wrapping logic """
output = output[self._selection_name]
def _wrap_series_output(
self,
output: Mapping[int, Union[Series, np.ndarray]],
index: Index,
columns: Index,
) -> Union[Series, DataFrame]:
"""
Wraps the output of a SeriesGroupBy operation into the expected result.

Parameters
----------
output : Mapping[int, Union[Series, np.ndarray]]
Dict where the key represents the columnar-index and the values are
the actual results. Must be ordered from 0..n
index : pd.Index
Index to apply to the output.
columns : pd.Index
Columns to apply to the output.

if names is not None:
return DataFrame(output, index=index, columns=names)
Returns
-------
Series or DataFrame

Notes
-----
In the vast majority of cases output and columns will only contain one
element. The exception is operations that expand dimensions, like ohlc.
"""
assert len(output) == len(columns)
assert list(output.keys()) == sorted(output.keys())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is by definition always true right as this is a property of a mapping? an assertion here is confusing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn’t this what you had asked me to add asserts for previously? If not then I am confused as to what asserts you wanted

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I don't consider this to be a long term approach anyway. Probably easier to make a named tuple that holds the position and label of each item in the result and construct from there; happy to do as a follow up

Copy link
Contributor

@jreback jreback Nov 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn’t this what you had asked me to add asserts for previously? If not then I am confused as to what asserts you wanted

not at all. an assert about a property of a dict is not very useful.

I asked about an assert about the sortness of output. you are assigning .columns = columns. The keys must be in the same order. There is no guarantee on this at all (except that dicts are sorted in > 3.6 and I think you must be adding them in the same order). but this is non-obvious, and non-trivial.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a dict property though - any chance you are mixing up a dict maintaining insertion order versus being sorted?

Sorting the output would add a behavior change that I don't think is desirable either. I do agree this isn't the clearest way of managing in the long run but I'm trying to limit the scope of the PR.

Any chance we can move forward as is and I can redo the dict insertion in a follow up? I think should just use a namedtuple or something similar as the dict key that holds the position and label together

Copy link
Contributor

@jreback jreback Nov 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are still missing my point.

you are assigning .columns = columns. prove that this is the same order as the dict keys.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right there is no easy way to do this in the current iteration. I was planning to do as a follow up - is that a requirement to do here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right there is no easy way to do this in the current iteration.

sure there is, you are creating an enumeration of lables (0, 1....etc), you just need to assert that the columns are in the same order here.


result: Union[Series, DataFrame]
if len(output) > 1:
result = DataFrame(output, index=index)
result.columns = columns
else:
name = self._selection_name
if name is None:
name = self._selected_obj.name
return Series(output, index=index, name=name)
result = Series(output[0], index=index, name=columns[0])

return result

def _wrap_aggregated_output(
self, output: Mapping[int, Union[Series, np.ndarray]], columns: Index
) -> Union[Series, DataFrame]:
"""
Wraps the output of a SeriesGroupBy aggregation into the expected result.

Parameters
----------
output : Mapping[int, Union[Series, np.ndarray]]
Dict where the key represents the columnar-index and the values are
the actual results.
columns : pd.Index
Columns to apply to the output.

Returns
-------
Series or DataFrame

Notes
-----
In the vast majority of cases output and columns will only contain one
element. The exception is operations that expand dimensions, like ohlc.
"""
assert list(output.keys()) == sorted(output.keys())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this not always true?


def _wrap_aggregated_output(self, output, names=None):
result = self._wrap_series_output(
output=output, index=self.grouper.result_index, names=names
output=output, index=self.grouper.result_index, columns=columns
)
return self._reindex_output(result)._convert(datetime=True)

def _wrap_transformed_output(self, output, names=None):
return self._wrap_series_output(
output=output, index=self.obj.index, names=names
def _wrap_transformed_output(
self, output: Mapping[int, Union[Series, np.ndarray]], columns: Index
) -> Series:
"""
Wraps the output of a SeriesGroupBy aggregation into the expected result.

Parameters
----------
output : dict[int, Union[Series, np.ndarray]]
Dict with a sole key of 0 and a value of the result values.
columns : pd.Index
Columns to apply to the output.

Returns
-------
Series

Notes
-----
output and columns should only contain one element. These are containers
for generic compatability with the DataFrameGroupBy class.
"""
assert list(output.keys()) == sorted(output.keys())

result = self._wrap_series_output(
output=output, index=self.obj.index, columns=columns
)

# No transformations increase the ndim of the result
# Unfortunately need to cast for mypy to know this
result = cast(Series, result)
return result

def _wrap_applied_output(self, keys, values, not_indexed_same=False):
if len(keys) == 0:
# GH #6265
Expand Down Expand Up @@ -1104,17 +1194,6 @@ def _aggregate_item_by_item(self, func, *args, **kwargs) -> DataFrame:

return DataFrame(result, columns=result_columns)

def _decide_output_index(self, output, labels):
if len(output) == len(labels):
output_keys = labels
else:
output_keys = sorted(output)

if isinstance(labels, MultiIndex):
output_keys = MultiIndex.from_tuples(output_keys, names=labels.names)

return output_keys

def _wrap_applied_output(self, keys, values, not_indexed_same=False):
if len(keys) == 0:
return DataFrame(index=keys)
Expand Down Expand Up @@ -1579,27 +1658,66 @@ def _insert_inaxis_grouper_inplace(self, result):
if in_axis:
result.insert(0, name, lev)

def _wrap_aggregated_output(self, output, names=None):
agg_axis = 0 if self.axis == 1 else 1
agg_labels = self._obj_with_exclusions._get_axis(agg_axis)
def _wrap_aggregated_output(
self, output: Mapping[int, Union[Series, np.ndarray]], columns: Index
) -> DataFrame:
"""
Wraps the output of DataFrameGroupBy aggregations into the expected result.

Parameters
----------
output : dict[int, Union[Series, np.ndarray]]
Dict where the key represents the columnar-index and the values are
the actual results.
columns : pd.Index
Column names to apply

Returns
-------
DataFrame
"""
assert list(output.keys()) == sorted(output.keys())

output_keys = self._decide_output_index(output, agg_labels)
result = DataFrame(output)
result.columns = columns

if not self.as_index:
result = DataFrame(output, columns=output_keys)
self._insert_inaxis_grouper_inplace(result)
result = result._consolidate()
else:
index = self.grouper.result_index
result = DataFrame(output, index=index, columns=output_keys)
result.index = index

if self.axis == 1:
result = result.T

return self._reindex_output(result)._convert(datetime=True)

def _wrap_transformed_output(self, output, names=None) -> DataFrame:
return DataFrame(output, index=self.obj.index)
def _wrap_transformed_output(
self, output: Mapping[int, Union[Series, np.ndarray]], columns: Index
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here and elsewhere the Union[Series, np.ndarray] never gets EA?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it does.

I had a go at trying to avoid the cast, and needed an instance EA in _wrap_series_output to get the tests to pass.

diff --git a/pandas/core/groupby/generic.py b/pandas/core/groupby/generic.py
index 92378ca91..20cbc9c96 100644
--- a/pandas/core/groupby/generic.py
+++ b/pandas/core/groupby/generic.py
@@ -20,6 +20,7 @@ from typing import (
     Type,
     Union,
     cast,
+    overload,
 )
 import warnings
 
@@ -50,6 +51,7 @@ from pandas.core.dtypes.missing import _isna_ndarraylike, isna, notna
 
 from pandas._typing import FrameOrSeries
 import pandas.core.algorithms as algorithms
+from pandas.core.arrays import ExtensionArray
 from pandas.core.base import DataError, SpecificationError
 import pandas.core.common as com
 from pandas.core.frame import DataFrame
@@ -332,9 +334,29 @@ class SeriesGroupBy(GroupBy):
 
         return DataFrame(results, columns=columns)
 
+    # TODO: _OUTPUT should be Union[Series, np.ndarray, ExtensionArray] but since
+    # np.ndarray resolves to Any, can't be used with overloads without producing
+    # error: Overloaded function signatures 1 and 2 overlap with incompatible return
+    #  types
+    # while np.ndarray resolves to Any, it will be compatible with Series and
+    #  ExtensionArray
+    _OUTPUT = Union[Series, ExtensionArray]
+
+    @overload
+    def _wrap_series_output(
+        self, output: Mapping[int, _OUTPUT], index: Index, columns: Index
+    ) -> DataFrame:
+        ...
+
+    @overload  # noqa: F811
     def _wrap_series_output(
+        self, output: _OUTPUT, index: Index, columns: Index
+    ) -> Series:
+        ...
+
+    def _wrap_series_output(  # noqa: F811
         self,
-        output: Mapping[int, Union[Series, np.ndarray]],
+        output: Union[_OUTPUT, Mapping[int, _OUTPUT]],
         index: Index,
         columns: Index,
     ) -> Union[Series, DataFrame]:
@@ -360,17 +382,18 @@ class SeriesGroupBy(GroupBy):
         In the vast majority of cases output and columns will only contain one
         element. The exception is operations that expand dimensions, like ohlc.
         """
-        assert len(output) == len(columns)
-        assert list(output.keys()) == sorted(output.keys())
 
-        result: Union[Series, DataFrame]
-        if len(output) > 1:
+        if (
+            not isinstance(output, (Series, np.ndarray, ExtensionArray))
+            and len(output) > 1
+        ):
+            assert len(output) == len(columns)
+            assert list(output.keys()) == sorted(output.keys())
             result = DataFrame(output, index=index)
             result.columns = columns
+            return result
         else:
-            result = Series(output[0], index=index, name=columns[0])
-
-        return result
+            return Series(output, index=index, name=columns[0])
 
     def _wrap_aggregated_output(
         self, output: Mapping[int, Union[Series, np.ndarray]], columns: Index
@@ -397,8 +420,13 @@ class SeriesGroupBy(GroupBy):
         """
         assert list(output.keys()) == sorted(output.keys())
 
+        if len(output) == 1:
+            _output = output[0]
+        else:
+            _output = output
+
         result = self._wrap_series_output(
-            output=output, index=self.grouper.result_index, columns=columns
+            output=_output, index=self.grouper.result_index, columns=columns
         )
         return self._reindex_output(result)._convert(datetime=True)
 
@@ -426,13 +454,12 @@ class SeriesGroupBy(GroupBy):
         """
         assert list(output.keys()) == sorted(output.keys())
 
+        assert len(output) == 1
+
         result = self._wrap_series_output(
-            output=output, index=self.obj.index, columns=columns
+            output=output[0], index=self.obj.index, columns=columns
         )
 
-        # No transformations increase the ndim of the result
-        # Unfortunately need to cast for mypy to know this
-        result = cast(Series, result)
         return result
 
     def _wrap_applied_output(self, keys, values, not_indexed_same=False):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p.s. I don't recommend the above since still needed len(output) > 1 in the condition so the return type could not be established with the argument type alone and therefore overloading is not the answer here.

) -> DataFrame:
"""
Wraps the output of DataFrameGroupBy transformations into the expected result.

Parameters
----------
output : dict[int, Union[Series, np.ndarray]]
Dict where the key represents the columnar-index and the values are
the actual results.
columns : pd.Index
Column names to apply.

Returns
-------
DataFrame
"""
assert list(output.keys()) == sorted(output.keys())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above, I think you need to sort the columns (index is ok )

result = DataFrame(output)
result.columns = columns
result.index = self.obj.index

return result

def _wrap_agged_blocks(self, items, blocks):
if not self.as_index:
Expand Down Expand Up @@ -1719,9 +1837,11 @@ def groupby_series(obj, col=None):
if isinstance(obj, Series):
results = groupby_series(obj)
else:
# TODO: this is duplicative of how GroupBy naturally works
# Try to consolidate with normal wrapping functions
from pandas.core.reshape.concat import concat

results = [groupby_series(obj[col], col) for col in obj.columns]
results = [groupby_series(content, label) for label, content in obj.items()]
results = concat(results, axis=1)
results.columns.names = obj.columns.names

Expand Down Expand Up @@ -1763,7 +1883,7 @@ def _normalize_keyword_aggregation(kwargs):
"""
Normalize user-provided "named aggregation" kwargs.

Transforms from the new ``Dict[str, NamedAgg]`` style kwargs
Transforms from the new ``Mapping[str, NamedAgg]`` style kwargs
to the old OrderedDict[str, List[scalar]]].

Parameters
Expand All @@ -1784,7 +1904,7 @@ def _normalize_keyword_aggregation(kwargs):
>>> _normalize_keyword_aggregation({'output': ('input', 'sum')})
(OrderedDict([('input', ['sum'])]), ('output',), [('input', 'sum')])
"""
# Normalize the aggregation functions as Dict[column, List[func]],
# Normalize the aggregation functions as Mapping[column, List[func]],
# process normally, then fixup the names.
# TODO(Py35): When we drop python 3.5, change this to
# defaultdict(list)
Expand Down
Loading