-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: DataFrame reductions dtypes on object input #51335
BUG: DataFrame reductions dtypes on object input #51335
Conversation
doc/source/whatsnew/v2.0.0.rst
Outdated
- Bug in :meth:`DataFrame.sem` and :meth:`Series.sem` where an erroneous ``TypeError`` would always raise when using data backed by an :class:`ArrowDtype` (:issue:`49759`) | ||
- Bug in :meth:`Series.__add__` casting to object for list and masked :class:`Series` (:issue:`22962`) | ||
- Bug in :meth:`DataFrame.query` with ``engine="numexpr"`` and column names are ``min`` or ``max`` would raise a ``TypeError`` (:issue:`50937`) | ||
- Bug in :meth:`DataFrame.min` and :meth:`DataFrame.max` with tz-aware data containing NA and ``axis=1`` would return incorrect results (:issue:`51242`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does NA here refer to pd.NaT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - will fix.
if axis is None: | ||
return result | ||
return func(df.values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm .values can be expensive, might be better to reduce twice? (... which can also be expensive. darn). Is punting on this viable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using .values
is the current behavior on main if numeric_only
is False and axis
is not 0.
Lines 10494 to 10496 in c7fa611
data = self | |
values = data.values | |
result = func(values) |
In fact, main is currently broken when numeric_only
is True and axis
is None:
df = pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5], 'c': list('xyz')})
result = df.mean(axis=None, numeric_only=True)
print(result)
# a 1.333333
# b 4.000000
# dtype: float64
result2 = df[['a', 'b']].mean(axis=None)
print(result2)
# 2.6666666666666665
Assuming this is okay for now, I can add a test for when numeric_only
is False and axis
is None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel - friendly ping
pandas/core/nanops.py
Outdated
if is_float_dtype(result_dtype): | ||
# Preserve dtype when possible | ||
# mypy doesn't infer result_dtype is not None | ||
result = getattr( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
result_type.type("nan")
?
except for the ArrayManager thing, this LGTM |
…ct_reduction_axis_1_attempt_3
…m/rhshadrach/pandas into object_reduction_axis_1_attempt_3
…ct_reduction_axis_1_attempt_3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @rhshadrach |
@rhshadrach looks like this caused a regression in https://asv-runner.github.io/asv-collection/pandas/#stat_ops.FrameOps.time_op?p-op='mean'&p-dtype='Int64'&p-axis=1&commits=b836a88f for nullable cases with axis=1. Can you take a look? |
1.5.3 is dropping down to the underlying NumPy array and calling the op with I'm still looking for a better alternatively currently, but likely I expect to need to largely restore the previous implementation and defer a better solution for 2.1. |
Is it a correctness vs perf tradeoff? If so correctness wins every time. Might be able to do something similar to _reduce_axis1 to avoid .values/transpose |
I think I can get every test to pass by using a combination of 1.5.x's Longer term (for 2.1), it does seem to me doing something like _reduce_axis1 is the way to go. For example, here is the ASV that identified this regression (using 1.5.x, so before the perf regression occurred; I'm leaving out the setup here) against a direct implementation of
|
Any interest in trying to get this improved for 2.0? |
For the case of masked arrays (the specific case from the asv regression), the transpose spends most of the time in reconstructing an EA for each row, which spends a lot of time in Improving the constructor might be worth anyway, but I agree that for the specific case of the numeric reduction with axis=1, doing a special case like |
I looked into this - I think there are some common ops / dtypes we can do this for, but we can't do it for all ops (e.g. median) and all dtypes (e.g. 3rd party EAs since I think the implementation will depend on knowing the value of the neutral element for an op like sum or prod). A straightforward way to fix this regression is to mostly revert this PR, where we still return object dtype on object input but otherwise the results are unchanged. I can put up a PR for this approach if #51923 isn't the right way forward or is too much to try to get it into 2.0. |
DataFrame.min(axis=1)
raisesFutureWarning
for timezone aware datetimes and returns wrong results #51242doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.