Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: "unique" and pd.Series.unique produce different results with aggregation #39920

Open
3 tasks done
iakremnev opened this issue Feb 19, 2021 · 4 comments · May be fixed by #57706
Open
3 tasks done

BUG: "unique" and pd.Series.unique produce different results with aggregation #39920

iakremnev opened this issue Feb 19, 2021 · 4 comments · May be fixed by #57706
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby Needs Discussion Requires discussion from core team before further action

Comments

@iakremnev
Copy link

iakremnev commented Feb 19, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas as pd

df = pd.DataFrame({"a": [1, 2, 1, 2], "b": [2, 1, 1, 2]})
g = df.groupby("a")

for aggfunc in ["unique", pd.unique, pd.Series.unique]:
    try:
        print(g.agg({"b": aggfunc}))
    except ValueError as e:
        print(f"{aggfunc!r} failed with {e!r}")

Running this script produces

        b
a        
1  [2, 1]
2  [1, 2]
<function unique at 0x7f6885cc7310> failed with ValueError('Must produce aggregated value')
<function Series.unique at 0x7f688369f8b0> failed with ValueError('Must produce aggregated value')

Problem description

When using "unique" str as the aggfunc parameter, internally it is interpreted as pd.Series.unique. I expect identical results when passing "unique" and pd.Series.unique, or pd.unique but they are not.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : f904213
python : 3.9.1.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.11-arch1-1
Version : #1 SMP PREEMPT Wed, 27 Jan 2021 13:53:16 +0000
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.0.dev0+772.gf904213ffc
numpy : 1.20.1
pytz : 2021.1
dateutil : 2.8.1
pip : 20.2.3
setuptools : 49.2.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@iakremnev iakremnev added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 19, 2021
@mroeschke mroeschke added Apply Apply, Aggregate, Transform, Map Groupby Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 15, 2021
@topper-123
Copy link
Contributor

Yes, I think the groupby with "unique" should fail also, because it doesn't aggregate.

@rhshadrach
Copy link
Member

I think this can be a useful op. I am in favor of treating the result of any provided function as if it were an aggregation - i.e. treat the result as if it were a scalar rather than try to infer if it is a scalar or not.

Related: #51445, #35725 (and others)

@topper-123
Copy link
Contributor

I expected it to give the grouped version of DataFrame.agg("unique"), i.e. in this case I'd expect the result to be:

   b
a
1  2
1  1
2  1
2  2

rather than:

a
1    [2, 1]
2    [1, 2]
Name: b, dtype: object

I.re. the grouping op to be different?

@rhshadrach
Copy link
Member

rhshadrach commented Apr 13, 2023

I think first and foremost, the result of agg with as_index=True should always be uniquely indexed by the groups. I think that's what it means for an operation to be an aggregation.

If we are to assume that (I'm still open to arguments to the contrary), then it seems to me that

a
1    [2, 1]
2    [1, 2]
Name: b, dtype: object

is a reasonable and useful result.

An interesting complication is that df.agg("unique") doesn't act on a column-by-column basis. Under the hood, we're calling np.unique with the default axis=None. pandas typically defaults to acting column-by-column whereas NumPy defaults to acting across all axes. So to get the "expected pandas default" (or, at least what I think it should be), you'd need to pass axis=1. However you can't currently pass axis=1 to the NumPy function because agg itself has an axis argument (xref: #40112 (comment)). If you could, you'd get back a 2d NumPy array instead. All of this is making me wonder if we really should be passing things like this through to NumPy.

@rhshadrach rhshadrach linked a pull request Mar 2, 2024 that will close this issue
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants