BUG: DataFrame reductions losing EA dtypes #52261

jbrockmendel · 2023-03-28T20:11:42Z

closes BUG: std with array_manager and NaT results #51446 (Replace xxxx with the GitHub issue number)
closes BUG: DataFrame[Int64].mean().dtype is object, should be Float64 #42895
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

In the absence of 2D EAs, we need reductions to sometimes pretend the EA is 2D. Enter "keepdims" adapted from numpy reductions.

This is still pretty ugly. I'm open to ideas to clean it up.

cc @rhshadrach any other particular cases need testing?

jbrockmendel · 2023-04-13T20:12:37Z

gentle ping @rhshadrach not looking to merge anytime soon but for thoughts on the approach

rhshadrach

thanks for the ping - sorry this fell off my radar. I really like the approach here - It looks like this is close to impacting other ops like min, e.g.

df = pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5]}, dtype='Int64')
result = df.min()
print(result)  # currently is int64, should be Int64 I think

However, currently after making the values 2D, we do values[~mask] which brings it back to 1D again. I think we can instead do np.where(mask, fill_value, values). This is similar to how nanops works.

any other particular cases need testing?

Because of the above - it's not exactly clear what ops this is impacting. Could maybe add a basic test across the reduction ops and xfail any that lose EA dtype?

rhshadrach · 2023-04-14T02:16:29Z

pandas/core/frame.py

+                try:
+                    return values._reduce(name, skipna=skipna, keepdims=True, **kwds)
+                except (TypeError, ValueError):
+                    # no keepdims keyword yet; ValueError gets raised by
+                    #  util validator functions
+                    return values._reduce(name, skipna=skipna, **kwds)


This will just be the try portion when fully implemented, yea?

right. there would need to be a deprecation cycle to allow 3rd party EAs to catch up

rhshadrach · 2023-04-14T02:31:00Z

pandas/core/internals/array_manager.py

+            if isinstance(res, (np.ndarray, ExtensionArray)):
+                # keepdims worked!
+                result_arrays.append(res)
+            else:
+                # TODO NaT doesn't preserve dtype, so we need to ensure to create
+                # a timedelta result array if original was timedelta
+                # what if datetime results in timedelta? (eg std)
+                dtype = arr.dtype if res is NaT else None
+                result_arrays.append(sanitize_array([res], None, dtype=dtype))


Similar - is the plan to be able to remove this else?

topper-123 · 2023-04-18T18:35:22Z

I've looked into this PR after you pointed it out. I had missed this PR unfortunately...

I like the keepdims on the arrays approach, it's a natural way to way to keep the dtype information that is being lost in reductions to scalars, rather than doing it on the frame as in my approach. I'm not a super fan of picking "keepdims" from the kwargs everywhere though, it seems brittle that it's necessary to add that to so many places.

I think adding a keepdims=False parameter to the signature of ExtensionArray._reduce and wrapping the returned scalar value there when keepdims is True before passing it on could be simpler? Then you wouldn't have to consider keepdims in all the reduction methods. That approach will also take care of wrapping the NA values that doesn't get wrapped currently.

Also, because I did #52707 from the point of solving #40669, I had a focus on frames with min_count=1. Many of the tests in #52707 fail in this PR currently (because NA result aren't wrapped in this current implementation), so those should be considered here also IMO.

I can take a closer look if the above approach works, if that's ok with you?

jbrockmendel · 2023-04-18T19:14:07Z

I can take a closer look if the above approach works, if that's ok with you?

sounds good

jbrockmendel · 2023-04-21T20:19:36Z

Closing in favor of #52788

jbrockmendel added 4 commits March 28, 2023 13:09

BUG: DataFrame reductions losing EA dtypes

5f7fa27

Merge branch 'main' into enh-keepdims

de2b5e3

mypy fixup

170992d

mypy fixup

182d0f8

rhshadrach reviewed Apr 14, 2023

View reviewed changes

This was referenced Apr 16, 2023

REGR: Performance of DataFrame reduction ops with axis=1 #52689

Closed

ENH: Better dtype inference for reductions on dataframes #52707

Closed

topper-123 mentioned this pull request Apr 19, 2023

ENH: better dtype inference when doing DataFrame reductions #52788

Merged

1 task

jbrockmendel closed this Apr 21, 2023

jbrockmendel deleted the enh-keepdims branch April 21, 2023 20:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: DataFrame reductions losing EA dtypes #52261

BUG: DataFrame reductions losing EA dtypes #52261

jbrockmendel commented Mar 28, 2023 •

edited

Loading

jbrockmendel commented Apr 13, 2023

rhshadrach left a comment •

edited

Loading

rhshadrach Apr 14, 2023

jbrockmendel Apr 14, 2023

rhshadrach Apr 14, 2023

jbrockmendel Apr 14, 2023

topper-123 commented Apr 18, 2023

jbrockmendel commented Apr 18, 2023

jbrockmendel commented Apr 21, 2023

BUG: DataFrame reductions losing EA dtypes #52261

BUG: DataFrame reductions losing EA dtypes #52261

Conversation

jbrockmendel commented Mar 28, 2023 • edited Loading

jbrockmendel commented Apr 13, 2023

rhshadrach left a comment • edited Loading

Choose a reason for hiding this comment

rhshadrach Apr 14, 2023

Choose a reason for hiding this comment

jbrockmendel Apr 14, 2023

Choose a reason for hiding this comment

rhshadrach Apr 14, 2023

Choose a reason for hiding this comment

jbrockmendel Apr 14, 2023

Choose a reason for hiding this comment

topper-123 commented Apr 18, 2023

jbrockmendel commented Apr 18, 2023

jbrockmendel commented Apr 21, 2023

jbrockmendel commented Mar 28, 2023 •

edited

Loading

rhshadrach left a comment •

edited

Loading