BUG: DataFrame reductions dtypes on object input #51335

rhshadrach · 2023-02-11T21:27:39Z

closes BUG: DataFrame reductions with object dtype and axis=1 #49603 (Replace xxxx with the GitHub issue number)
closes BUG: DataFrame.min(axis=1) raises FutureWarning for timezone aware datetimes and returns wrong results #51242
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

jbrockmendel · 2023-02-13T15:51:28Z

doc/source/whatsnew/v2.0.0.rst

 - Bug in :meth:`DataFrame.sem` and :meth:`Series.sem` where an erroneous ``TypeError`` would always raise when using data backed by an :class:`ArrowDtype` (:issue:`49759`)
 - Bug in :meth:`Series.__add__` casting to object for list and masked :class:`Series` (:issue:`22962`)
 - Bug in :meth:`DataFrame.query` with ``engine="numexpr"`` and column names are ``min`` or ``max`` would raise a ``TypeError`` (:issue:`50937`)
+- Bug in :meth:`DataFrame.min` and :meth:`DataFrame.max` with tz-aware data containing NA and ``axis=1`` would return incorrect results (:issue:`51242`)


does NA here refer to pd.NaT?

Thanks - will fix.

jbrockmendel · 2023-02-13T16:04:09Z

pandas/core/frame.py

        if axis is None:
-            return result
+            return func(df.values)


hmm .values can be expensive, might be better to reduce twice? (... which can also be expensive. darn). Is punting on this viable?

Using .values is the current behavior on main if numeric_only is False and axis is not 0.

pandas/pandas/core/frame.py

Lines 10494 to 10496 in c7fa611

data = self

values = data.values

result = func(values)

In fact, main is currently broken when numeric_only is True and axis is None:

df = pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5], 'c': list('xyz')}) result = df.mean(axis=None, numeric_only=True) print(result) # a 1.333333 # b 4.000000 # dtype: float64 result2 = df[['a', 'b']].mean(axis=None) print(result2) # 2.6666666666666665

Assuming this is okay for now, I can add a test for when numeric_only is False and axis is None.

@jbrockmendel - friendly ping

pandas/core/internals/array_manager.py

jbrockmendel · 2023-02-15T23:10:42Z

pandas/core/nanops.py

+            if is_float_dtype(result_dtype):
+                # Preserve dtype when possible
+                # mypy doesn't infer result_dtype is not None
+                result = getattr(


result_type.type("nan")?

jbrockmendel · 2023-02-15T23:52:54Z

except for the ArrayManager thing, this LGTM

…ct_reduction_axis_1_attempt_3

…m/rhshadrach/pandas into object_reduction_axis_1_attempt_3

…ct_reduction_axis_1_attempt_3

jbrockmendel

LGTM

mroeschke · 2023-02-18T01:15:01Z

Thanks @rhshadrach

jbrockmendel · 2023-02-27T01:16:31Z

@rhshadrach looks like this caused a regression in https://asv-runner.github.io/asv-collection/pandas/#stat_ops.FrameOps.time_op?p-op='mean'&p-dtype='Int64'&p-axis=1&commits=b836a88f for nullable cases with axis=1. Can you take a look?

rhshadrach · 2023-02-27T23:05:00Z

1.5.3 is dropping down to the underlying NumPy array and calling the op with axis=1. main is computing the transpose of the DataFrame (which is inefficient for EAs), then computing the sum via the Block manager (which I think is inefficient for EAs with wide data).

I'm still looking for a better alternatively currently, but likely I expect to need to largely restore the previous implementation and defer a better solution for 2.1.

jbrockmendel · 2023-03-01T22:44:38Z

Is it a correctness vs perf tradeoff? If so correctness wins every time.

Might be able to do something similar to _reduce_axis1 to avoid .values/transpose

rhshadrach · 2023-03-02T01:56:19Z

Is it a correctness vs perf tradeoff? If so correctness wins every time.

I think I can get every test to pass by using a combination of 1.5.x's _reduce and that in this PR except for test_minmax_tzaware_skipna_axis_1. For tzaware with NaT, dropping down to NumPy arrays is insufficient to compute the op. I don't see a way to fix this in any easy manner and avoid the perf regression. I think (but haven't confirmed yet) that this was also broken in 1.5.x; is it worth the perf hit for this?

Longer term (for 2.1), it does seem to me doing something like _reduce_axis1 is the way to go. For example, here is the ASV that identified this regression (using 1.5.x, so before the perf regression occurred; I'm leaving out the setup here) against a direct implementation of mean for axis=1:

%timeit df.mean(axis=1)
18.5 ms ± 40.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit sum(df[c] * (1/df.shape[1]) for c in df)
778 µs ± 1.67 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

jbrockmendel · 2023-03-10T00:52:16Z

Any interest in trying to get this improved for 2.0?

jorisvandenbossche · 2023-03-10T20:32:25Z

For the case of masked arrays (the specific case from the asv regression), the transpose spends most of the time in reconstructing an EA for each row, which spends a lot of time in _coerce_to_data_and_mask. There is a lot of room for improvement in that function, if we would want to do that (eg it's doing a lot of dtype checks, and we know that those are slow if you do those repeatedly many times, in some other (critical) paths we switched to dtype.kind == .. checks, IIRC)

Improving the constructor might be worth anyway, but I agree that for the specific case of the numeric reduction with axis=1, doing a special case like _reduce_axis1 already does for any/all seems quite easy to do without being very complex.
Because however we improve the constructor, constructing 10,000x a small EA in case of 10,000 rows DataFrame is just always going to be slower.

rhshadrach · 2023-03-13T04:28:28Z

Might be able to do something similar to _reduce_axis1 to avoid .values/transpose

I looked into this - I think there are some common ops / dtypes we can do this for, but we can't do it for all ops (e.g. median) and all dtypes (e.g. 3rd party EAs since I think the implementation will depend on knowing the value of the neutral element for an op like sum or prod).

A straightforward way to fix this regression is to mostly revert this PR, where we still return object dtype on object input but otherwise the results are unchanged. I can put up a PR for this approach if #51923 isn't the right way forward or is too much to try to get it into 2.0.

rhshadrach added 2 commits February 11, 2023 16:23

BUG: DataFrame reductions dtypes

d0bc834

whatsnew

873c309

rhshadrach mentioned this pull request Feb 11, 2023

BUG: DataFrame reductions with object dtype and axis=1 #50224

Closed

5 tasks

dtype fixup; whatsnew

c10c6b3

rhshadrach changed the title ~~BUG: DataFrame reductions dtypes~~ BUG: DataFrame reductions dtypes on object input Feb 12, 2023

rhshadrach requested a review from jbrockmendel February 13, 2023 12:14

rhshadrach mentioned this pull request Feb 13, 2023

RLS: 2.0 #46776

Closed

1 task

jbrockmendel reviewed Feb 13, 2023

View reviewed changes

mroeschke added the Reduction Operations sum, mean, min, max, etc. label Feb 13, 2023

Add test, fix whatsnew

115b5c4

jbrockmendel reviewed Feb 15, 2023

View reviewed changes

pandas/core/internals/array_manager.py Show resolved Hide resolved

jbrockmendel reviewed Feb 15, 2023

View reviewed changes

rhshadrach added 7 commits February 16, 2023 16:19

Merge branch 'main' of https://github.com/pandas-dev/pandas into obje…

be8c27a

…ct_reduction_axis_1_attempt_3

Add datetime test

a93adf0

result_dtype.type

9471c13

xfail

ecee6cc

type-ignore

c9eaf90

Merge branch 'object_reduction_axis_1_attempt_3' of https://github.co…

acdaf4c

…m/rhshadrach/pandas into object_reduction_axis_1_attempt_3

Merge branch 'main' of https://github.com/pandas-dev/pandas into obje…

baff37e

…ct_reduction_axis_1_attempt_3

rhshadrach requested a review from jbrockmendel February 17, 2023 22:04

jbrockmendel approved these changes Feb 17, 2023

View reviewed changes

mroeschke added this to the 2.0 milestone Feb 18, 2023

mroeschke approved these changes Feb 18, 2023

View reviewed changes

mroeschke merged commit b836a88 into pandas-dev:main Feb 18, 2023

rhshadrach deleted the object_reduction_axis_1_attempt_3 branch February 18, 2023 01:33

rhshadrach mentioned this pull request Mar 4, 2023

BUG: DataFrame float reductions with object input #49618

Closed

rhshadrach mentioned this pull request Mar 13, 2023

REGR: Performance regression in axis=1 DataFrame ops #51923

Closed

4 tasks

This was referenced Mar 14, 2023

REGR: Performance of DataFrame axis=1 reduction ops with EA #51955

Closed

PERF: pd.BooleanDtype in row operations is 2000000 times slower #52016

Closed

rhshadrach mentioned this pull request Mar 27, 2023

REGR: Revert GH51335 #52250

Closed

5 tasks

MarcoGorelli mentioned this pull request Mar 31, 2023

BUG: accessing .dtypes in a subclass constructor with large frames causes infinite loop #50708

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: DataFrame reductions dtypes on object input #51335

BUG: DataFrame reductions dtypes on object input #51335

rhshadrach commented Feb 11, 2023

jbrockmendel Feb 13, 2023

rhshadrach Feb 13, 2023

jbrockmendel Feb 13, 2023

rhshadrach Feb 13, 2023 •

edited

Loading

rhshadrach Feb 15, 2023

jbrockmendel Feb 15, 2023

jbrockmendel commented Feb 15, 2023

jbrockmendel left a comment

mroeschke commented Feb 18, 2023

jbrockmendel commented Feb 27, 2023

rhshadrach commented Feb 27, 2023

jbrockmendel commented Mar 1, 2023

rhshadrach commented Mar 2, 2023 •

edited

Loading

jbrockmendel commented Mar 10, 2023

jorisvandenbossche commented Mar 10, 2023

rhshadrach commented Mar 13, 2023

BUG: DataFrame reductions dtypes on object input #51335

BUG: DataFrame reductions dtypes on object input #51335

Conversation

rhshadrach commented Feb 11, 2023

jbrockmendel Feb 13, 2023

Choose a reason for hiding this comment

rhshadrach Feb 13, 2023

Choose a reason for hiding this comment

jbrockmendel Feb 13, 2023

Choose a reason for hiding this comment

rhshadrach Feb 13, 2023 • edited Loading

Choose a reason for hiding this comment

rhshadrach Feb 15, 2023

Choose a reason for hiding this comment

jbrockmendel Feb 15, 2023

Choose a reason for hiding this comment

jbrockmendel commented Feb 15, 2023

jbrockmendel left a comment

Choose a reason for hiding this comment

mroeschke commented Feb 18, 2023

jbrockmendel commented Feb 27, 2023

rhshadrach commented Feb 27, 2023

jbrockmendel commented Mar 1, 2023

rhshadrach commented Mar 2, 2023 • edited Loading

jbrockmendel commented Mar 10, 2023

jorisvandenbossche commented Mar 10, 2023

rhshadrach commented Mar 13, 2023

rhshadrach Feb 13, 2023 •

edited

Loading

rhshadrach commented Mar 2, 2023 •

edited

Loading