Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame reductions losing EA dtypes #52261

Closed
wants to merge 4 commits into from

Conversation

jbrockmendel
Copy link
Member

@jbrockmendel jbrockmendel commented Mar 28, 2023

In the absence of 2D EAs, we need reductions to sometimes pretend the EA is 2D. Enter "keepdims" adapted from numpy reductions.

This is still pretty ugly. I'm open to ideas to clean it up.

cc @rhshadrach any other particular cases need testing?

@jbrockmendel
Copy link
Member Author

gentle ping @rhshadrach not looking to merge anytime soon but for thoughts on the approach

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the ping - sorry this fell off my radar. I really like the approach here - It looks like this is close to impacting other ops like min, e.g.

df = pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5]}, dtype='Int64')
result = df.min()
print(result)  # currently is int64, should be Int64 I think

However, currently after making the values 2D, we do values[~mask] which brings it back to 1D again. I think we can instead do np.where(mask, fill_value, values). This is similar to how nanops works.

any other particular cases need testing?

Because of the above - it's not exactly clear what ops this is impacting. Could maybe add a basic test across the reduction ops and xfail any that lose EA dtype?

Comment on lines +10922 to +10927
try:
return values._reduce(name, skipna=skipna, keepdims=True, **kwds)
except (TypeError, ValueError):
# no keepdims keyword yet; ValueError gets raised by
# util validator functions
return values._reduce(name, skipna=skipna, **kwds)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will just be the try portion when fully implemented, yea?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right. there would need to be a deprecation cycle to allow 3rd party EAs to catch up

Comment on lines +983 to +991
if isinstance(res, (np.ndarray, ExtensionArray)):
# keepdims worked!
result_arrays.append(res)
else:
# TODO NaT doesn't preserve dtype, so we need to ensure to create
# a timedelta result array if original was timedelta
# what if datetime results in timedelta? (eg std)
dtype = arr.dtype if res is NaT else None
result_arrays.append(sanitize_array([res], None, dtype=dtype))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar - is the plan to be able to remove this else?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@topper-123
Copy link
Contributor

I've looked into this PR after you pointed it out. I had missed this PR unfortunately...

I like the keepdims on the arrays approach, it's a natural way to way to keep the dtype information that is being lost in reductions to scalars, rather than doing it on the frame as in my approach. I'm not a super fan of picking "keepdims" from the kwargs everywhere though, it seems brittle that it's necessary to add that to so many places.

I think adding a keepdims=False parameter to the signature of ExtensionArray._reduce and wrapping the returned scalar value there when keepdims is True before passing it on could be simpler? Then you wouldn't have to consider keepdims in all the reduction methods. That approach will also take care of wrapping the NA values that doesn't get wrapped currently.

Also, because I did #52707 from the point of solving #40669, I had a focus on frames with min_count=1. Many of the tests in #52707 fail in this PR currently (because NA result aren't wrapped in this current implementation), so those should be considered here also IMO.

I can take a closer look if the above approach works, if that's ok with you?

@jbrockmendel
Copy link
Member Author

I can take a closer look if the above approach works, if that's ok with you?

sounds good

@jbrockmendel
Copy link
Member Author

Closing in favor of #52788

@jbrockmendel jbrockmendel deleted the enh-keepdims branch April 21, 2023 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: std with array_manager and NaT results BUG: DataFrame[Int64].mean().dtype is object, should be Float64
3 participants