-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: DataFrame reductions losing EA dtypes #52261
Conversation
gentle ping @rhshadrach not looking to merge anytime soon but for thoughts on the approach |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the ping - sorry this fell off my radar. I really like the approach here - It looks like this is close to impacting other ops like min, e.g.
df = pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5]}, dtype='Int64')
result = df.min()
print(result) # currently is int64, should be Int64 I think
However, currently after making the values 2D, we do values[~mask]
which brings it back to 1D again. I think we can instead do np.where(mask, fill_value, values)
. This is similar to how nanops works.
any other particular cases need testing?
Because of the above - it's not exactly clear what ops this is impacting. Could maybe add a basic test across the reduction ops and xfail any that lose EA dtype?
try: | ||
return values._reduce(name, skipna=skipna, keepdims=True, **kwds) | ||
except (TypeError, ValueError): | ||
# no keepdims keyword yet; ValueError gets raised by | ||
# util validator functions | ||
return values._reduce(name, skipna=skipna, **kwds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will just be the try portion when fully implemented, yea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right. there would need to be a deprecation cycle to allow 3rd party EAs to catch up
if isinstance(res, (np.ndarray, ExtensionArray)): | ||
# keepdims worked! | ||
result_arrays.append(res) | ||
else: | ||
# TODO NaT doesn't preserve dtype, so we need to ensure to create | ||
# a timedelta result array if original was timedelta | ||
# what if datetime results in timedelta? (eg std) | ||
dtype = arr.dtype if res is NaT else None | ||
result_arrays.append(sanitize_array([res], None, dtype=dtype)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar - is the plan to be able to remove this else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
I've looked into this PR after you pointed it out. I had missed this PR unfortunately... I like the keepdims on the arrays approach, it's a natural way to way to keep the dtype information that is being lost in reductions to scalars, rather than doing it on the frame as in my approach. I'm not a super fan of picking I think adding a Also, because I did #52707 from the point of solving #40669, I had a focus on frames with I can take a closer look if the above approach works, if that's ok with you? |
sounds good |
Closing in favor of #52788 |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.In the absence of 2D EAs, we need reductions to sometimes pretend the EA is 2D. Enter "keepdims" adapted from numpy reductions.
This is still pretty ugly. I'm open to ideas to clean it up.
cc @rhshadrach any other particular cases need testing?