BUG: "unique" and pd.Series.unique produce different results with aggregation #39920

iakremnev · 2021-02-19T22:11:19Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

import pandas as pd

df = pd.DataFrame({"a": [1, 2, 1, 2], "b": [2, 1, 1, 2]})
g = df.groupby("a")

for aggfunc in ["unique", pd.unique, pd.Series.unique]:
    try:
        print(g.agg({"b": aggfunc}))
    except ValueError as e:
        print(f"{aggfunc!r} failed with {e!r}")

Running this script produces

        b
a        
1  [2, 1]
2  [1, 2]
<function unique at 0x7f6885cc7310> failed with ValueError('Must produce aggregated value')
<function Series.unique at 0x7f688369f8b0> failed with ValueError('Must produce aggregated value')

Problem description

When using "unique" str as the aggfunc parameter, internally it is interpreted as pd.Series.unique. I expect identical results when passing "unique" and pd.Series.unique, or pd.unique but they are not.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : f904213
python : 3.9.1.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.11-arch1-1
Version : #1 SMP PREEMPT Wed, 27 Jan 2021 13:53:16 +0000
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.0.dev0+772.gf904213ffc
numpy : 1.20.1
pytz : 2021.1
dateutil : 2.8.1
pip : 20.2.3
setuptools : 49.2.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

topper-123 · 2023-04-12T21:17:56Z

Yes, I think the groupby with "unique" should fail also, because it doesn't aggregate.

rhshadrach · 2023-04-13T01:41:46Z

I think this can be a useful op. I am in favor of treating the result of any provided function as if it were an aggregation - i.e. treat the result as if it were a scalar rather than try to infer if it is a scalar or not.

Related: #51445, #35725 (and others)

topper-123 · 2023-04-13T12:36:45Z

I expected it to give the grouped version of DataFrame.agg("unique"), i.e. in this case I'd expect the result to be:

rather than:

a
1    [2, 1]
2    [1, 2]
Name: b, dtype: object

I.re. the grouping op to be different?

rhshadrach · 2023-04-13T21:15:13Z

I think first and foremost, the result of agg with as_index=True should always be uniquely indexed by the groups. I think that's what it means for an operation to be an aggregation.

If we are to assume that (I'm still open to arguments to the contrary), then it seems to me that

a
1    [2, 1]
2    [1, 2]
Name: b, dtype: object

is a reasonable and useful result.

An interesting complication is that df.agg("unique") doesn't act on a column-by-column basis. Under the hood, we're calling np.unique with the default axis=None. pandas typically defaults to acting column-by-column whereas NumPy defaults to acting across all axes. So to get the "expected pandas default" (or, at least what I think it should be), you'd need to pass axis=1. However you can't currently pass axis=1 to the NumPy function because agg itself has an axis argument (xref: #40112 (comment)). If you could, you'd get back a 2d NumPy array instead. All of this is making me wonder if we really should be passing things like this through to NumPy.

iakremnev added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 19, 2021

mroeschke added Apply Apply, Aggregate, Transform, Map Groupby Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 15, 2021

rhshadrach linked a pull request Mar 2, 2024 that will close this issue

BUG: groupby.agg should always agg #57706

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: "unique" and pd.Series.unique produce different results with aggregation #39920

BUG: "unique" and pd.Series.unique produce different results with aggregation #39920

iakremnev commented Feb 19, 2021 •

edited

Loading

INSTALLED VERSIONS

topper-123 commented Apr 12, 2023

rhshadrach commented Apr 13, 2023

topper-123 commented Apr 13, 2023

rhshadrach commented Apr 13, 2023 •

edited

Loading

BUG: "unique" and pd.Series.unique produce different results with aggregation #39920

BUG: "unique" and pd.Series.unique produce different results with aggregation #39920

Comments

iakremnev commented Feb 19, 2021 • edited Loading

Code Sample, a copy-pastable example

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

topper-123 commented Apr 12, 2023

rhshadrach commented Apr 13, 2023

topper-123 commented Apr 13, 2023

rhshadrach commented Apr 13, 2023 • edited Loading

iakremnev commented Feb 19, 2021 •

edited

Loading

Output of `pd.show_versions()`

rhshadrach commented Apr 13, 2023 •

edited

Loading