Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame reductions dtypes on object input #51335

Merged

Conversation

rhshadrach
Copy link
Member

@rhshadrach rhshadrach changed the title BUG: DataFrame reductions dtypes BUG: DataFrame reductions dtypes on object input Feb 12, 2023
@rhshadrach rhshadrach mentioned this pull request Feb 13, 2023
1 task
- Bug in :meth:`DataFrame.sem` and :meth:`Series.sem` where an erroneous ``TypeError`` would always raise when using data backed by an :class:`ArrowDtype` (:issue:`49759`)
- Bug in :meth:`Series.__add__` casting to object for list and masked :class:`Series` (:issue:`22962`)
- Bug in :meth:`DataFrame.query` with ``engine="numexpr"`` and column names are ``min`` or ``max`` would raise a ``TypeError`` (:issue:`50937`)
- Bug in :meth:`DataFrame.min` and :meth:`DataFrame.max` with tz-aware data containing NA and ``axis=1`` would return incorrect results (:issue:`51242`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does NA here refer to pd.NaT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - will fix.

if axis is None:
return result
return func(df.values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm .values can be expensive, might be better to reduce twice? (... which can also be expensive. darn). Is punting on this viable?

Copy link
Member Author

@rhshadrach rhshadrach Feb 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using .values is the current behavior on main if numeric_only is False and axis is not 0.

pandas/pandas/core/frame.py

Lines 10494 to 10496 in c7fa611

data = self
values = data.values
result = func(values)

In fact, main is currently broken when numeric_only is True and axis is None:

df = pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5], 'c': list('xyz')})
result = df.mean(axis=None, numeric_only=True)
print(result)
# a    1.333333
# b    4.000000
# dtype: float64

result2 = df[['a', 'b']].mean(axis=None)
print(result2)
# 2.6666666666666665

Assuming this is okay for now, I can add a test for when numeric_only is False and axis is None.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel - friendly ping

@mroeschke mroeschke added the Reduction Operations sum, mean, min, max, etc. label Feb 13, 2023
if is_float_dtype(result_dtype):
# Preserve dtype when possible
# mypy doesn't infer result_dtype is not None
result = getattr(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result_type.type("nan")?

@jbrockmendel
Copy link
Member

except for the ArrayManager thing, this LGTM

Copy link
Member

@jbrockmendel jbrockmendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mroeschke mroeschke added this to the 2.0 milestone Feb 18, 2023
@mroeschke mroeschke merged commit b836a88 into pandas-dev:main Feb 18, 2023
@mroeschke
Copy link
Member

Thanks @rhshadrach

@rhshadrach rhshadrach deleted the object_reduction_axis_1_attempt_3 branch February 18, 2023 01:33
@jbrockmendel
Copy link
Member

@rhshadrach looks like this caused a regression in https://asv-runner.github.io/asv-collection/pandas/#stat_ops.FrameOps.time_op?p-op='mean'&p-dtype='Int64'&p-axis=1&commits=b836a88f for nullable cases with axis=1. Can you take a look?

@rhshadrach
Copy link
Member Author

1.5.3 is dropping down to the underlying NumPy array and calling the op with axis=1. main is computing the transpose of the DataFrame (which is inefficient for EAs), then computing the sum via the Block manager (which I think is inefficient for EAs with wide data).

I'm still looking for a better alternatively currently, but likely I expect to need to largely restore the previous implementation and defer a better solution for 2.1.

@jbrockmendel
Copy link
Member

Is it a correctness vs perf tradeoff? If so correctness wins every time.

Might be able to do something similar to _reduce_axis1 to avoid .values/transpose

@rhshadrach
Copy link
Member Author

rhshadrach commented Mar 2, 2023

Is it a correctness vs perf tradeoff? If so correctness wins every time.

I think I can get every test to pass by using a combination of 1.5.x's _reduce and that in this PR except for test_minmax_tzaware_skipna_axis_1. For tzaware with NaT, dropping down to NumPy arrays is insufficient to compute the op. I don't see a way to fix this in any easy manner and avoid the perf regression. I think (but haven't confirmed yet) that this was also broken in 1.5.x; is it worth the perf hit for this?

Longer term (for 2.1), it does seem to me doing something like _reduce_axis1 is the way to go. For example, here is the ASV that identified this regression (using 1.5.x, so before the perf regression occurred; I'm leaving out the setup here) against a direct implementation of mean for axis=1:

%timeit df.mean(axis=1)
18.5 ms ± 40.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit sum(df[c] * (1/df.shape[1]) for c in df)
778 µs ± 1.67 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

@jbrockmendel
Copy link
Member

Any interest in trying to get this improved for 2.0?

@jorisvandenbossche
Copy link
Member

For the case of masked arrays (the specific case from the asv regression), the transpose spends most of the time in reconstructing an EA for each row, which spends a lot of time in _coerce_to_data_and_mask. There is a lot of room for improvement in that function, if we would want to do that (eg it's doing a lot of dtype checks, and we know that those are slow if you do those repeatedly many times, in some other (critical) paths we switched to dtype.kind == .. checks, IIRC)

Improving the constructor might be worth anyway, but I agree that for the specific case of the numeric reduction with axis=1, doing a special case like _reduce_axis1 already does for any/all seems quite easy to do without being very complex.
Because however we improve the constructor, constructing 10,000x a small EA in case of 10,000 rows DataFrame is just always going to be slower.

@rhshadrach
Copy link
Member Author

Might be able to do something similar to _reduce_axis1 to avoid .values/transpose

I looked into this - I think there are some common ops / dtypes we can do this for, but we can't do it for all ops (e.g. median) and all dtypes (e.g. 3rd party EAs since I think the implementation will depend on knowing the value of the neutral element for an op like sum or prod).

A straightforward way to fix this regression is to mostly revert this PR, where we still return object dtype on object input but otherwise the results are unchanged. I can put up a PR for this approach if #51923 isn't the right way forward or is too much to try to get it into 2.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Reduction Operations sum, mean, min, max, etc.
Projects
None yet
4 participants