[ArrayManager] BUG: fix setitem with non-aligned boolean dataframe #39539

jorisvandenbossche · 2021-02-01T18:44:18Z

xref indexing work item of #39146

This covers setitem with a boolean dataframe (like df[df < 0] = 2), but specifically fixes the case where the dataframe is not aligned (eg df[df[:-1] < 0] = 2 in the tests).

DataFrame.fillna infers object dtype columns getting filled for a proper dtype, but with ArrayManager internals I propose that fillna always preserves the dtype.

jbrockmendel · 2021-02-01T19:20:04Z

pandas/core/generic.py

+        if isinstance(cond, ABCDataFrame) and cond._has_array_manager:
+            if not all(is_bool_dtype(dt) for dt in cond.dtypes) and all_bool_columns:
+                cond = cond.astype(bool)
+


is there a viable way to put these patches in with the ArrayManager code?

I don't directly see one, as we don't want to change (IMO) the fillna behaviour.
Unless we we would add a keyword for this (only for internal use) to the manager fillna method, and call cond._mgr.fillna here instead of cond.fillna to be able to pass it through, but not sure that would be any cleaner (as long as it is only used once, at least, there might come up other places where this is needed later)

left same comment above. I think we are going to want to push more code to the manager (kind of the opposite that we have been doing of late though).

jbrockmendel · 2021-02-01T19:21:52Z

Might be a coincidence, but in the ArrayManager code def setitem is commented out with a # TODO what is this used for?. The answer is it is used for setting via iloc

jorisvandenbossche · 2021-02-02T18:35:02Z

Might be a coincidence, but in the ArrayManager code def setitem is commented out with a # TODO what is this used for?. The answer is it is used for setting via iloc

I was actually also implementing that while trying to get tests in frame/indexing/test_indexing.py passing, but since this where related fix and the setitem implementation are not actually related, split it off in a separate branch (but will be opening a PR for it)

jorisvandenbossche · 2021-02-08T15:41:05Z

@jbrockmendel more comments here? (suggestions for a cleaner solution / where to put this are certainly welcome)

jbrockmendel · 2021-02-08T16:36:53Z

will take another look this afternoon

pandas/tests/frame/indexing/test_where.py

jbrockmendel · 2021-02-08T22:03:38Z

Big picture: what was your takeaway from last week's call re behavior differences between ArrayManager vs BlockManager. My takeaway was that wherever feasible we should keep them in sync and do a deprecation on the existing method.

Is there a technical reason why keeping the behaviors in sync is not feasible?

jorisvandenbossche · 2021-02-08T22:55:38Z

My takeaway was that wherever feasible we should keep them in sync and do a deprecation on the existing method.

My interpretation was different (but it was not fully clear to me what the general sense was about this / what the conclusion was, so maybe this is rather stating what I would argue for myself): have new behaviour in ArrayManager, deprecate in BlockManager.

If we agree we want to deprecate a certain behaviour (and thus in practice deprecate it for the current default internals, i.e. using BlockManager), IMO it doesn't make much sense to implement the old deprecated behaviour for ArrayManager: 1) this will not always be easy (for this case of fillna downcasting that's not a problem, but eg I don't want to try to replicate the copy/view behaviours of the BlockManager in ArrayManager), 2) having the new behaviour in ArrayManager actually makes it possible to experiment with this (which I think will be useful).
Implementing the future behaviour now also ensures we can have a good overview and implementation of changing behaviour for a potential 2.0 (or 3.0) release, instead of having to do that just before that release.

Of course, specifically for the fillna downcasting behaviour, that is quite self contained and thus rather easy to add in ArrayManager as well (there is certainly no technical reason that it's not possible here). #39578 is another example of changing behaviour, and one where I think it is much less straightforward (or desired) to mimic behaviour of BlockManager.

It also might not need to be all one way or the other. We could also decide on a case by case, and eg prefer preserving behaviour for fillna while implementing new behaviour for eg indexing operations.

jreback · 2021-02-08T22:55:05Z

pandas/core/generic.py

@@ -8812,9 +8819,32 @@ def _where(
                raise ValueError("Array conditional must be same shape as self")
            cond = self._constructor(cond, **self._construct_axes_dict())

+        # Needed for DataFrames with ArrayManager, see below for details


is it possible to just do this inside the manager itself? I know this might mean a slight refactoring here, but it would be better if we could do that

So basically this code does two steps for cond: 1) align and 2) fillna.
The way that I wrote this now, those are a bit intertwined (I check something about the dataframe that is need for fillna before the alignment step), but I don't think we want to push the alignment inside the manager? (that's something that now is always done outside I think, the manager doesn't need to care about that)

As mentioned below, I could call cond._mgr.fillna(..) instead of cond.fillna(), and then add a keyword whether to force it to be boolean or not (what is now encoded in all_bool_columns).
But, in that idea, the code to determine all_bool_columns would still live here, and thus would only move part of the added complexity into the internals.

jreback · 2021-02-08T22:55:59Z

pandas/core/generic.py

+        if isinstance(cond, ABCDataFrame) and cond._has_array_manager:
+            if not all(is_bool_dtype(dt) for dt in cond.dtypes) and all_bool_columns:
+                cond = cond.astype(bool)
+


left same comment above. I think we are going to want to push more code to the manager (kind of the opposite that we have been doing of late though).

jbrockmendel · 2021-02-08T23:15:45Z

My interpretation was different (but it was not fully clear to me what the general sense was about this / what the conclusion was, so maybe this is rather stating what I would argue for myself):

yah i checked the notes and it doesnt look like we wrote anything down. easy for it to become a rorscach test

If we agree we want to deprecate a certain behaviour (and thus in practice deprecate it for the current default internals, i.e. using BlockManager)

The "if" part is important. That discussion should be orthogonal to the "getting ArrayManager fully functional" task.

this will not always be easy (for this case of fillna downcasting that's not a problem, but eg I don't want to try to replicate the copy/view behaviours of the BlockManager in ArrayManager)

No objection on the copy/view-indexing bit. But for other methods where it is easy i think the default should be to keep behavior in sync.

This will certainly reduce complexity in the test suite, and I'm guessing in the code too. e.g. am I right in thinking that the extra code in NDFrame._where (that both Jeff and I have independently commented on asking if it can go in ArrayManager) would be easier to keep in ArrayManager if it weren't for the changed behavior?

having the new behaviour in ArrayManager actually makes it possible to experiment with this (which I think will be useful).

When evaluating ArrayManager, it is important that we evaluate it in isolation. e.g. if I see a perf change I don't want to have to wonder/check if it is due to a type inference being skipped or something.

jorisvandenbossche · 2021-02-08T23:34:53Z

yah i checked the notes and it doesnt look like we wrote anything down. easy for it to become a rorscach test

Yeah, it was the first thing I thought after our meeting "we should have forced ourselves to write down some conclusions, because I already don't know anymore what those were" ..

The "if" part is important. That discussion should be orthogonal to the "getting ArrayManager fully functional" task.

Yes, I know. That's the reason I opened #39584, and I know I need to open one for the downcasting behaviour as well.

e.g. am I right in thinking that the extra code in NDFrame._where (that both Jeff and I have independently commented on asking if it can go in ArrayManager) would be easier to keep in ArrayManager if it weren't for the changed behavior?

If we don't change behaviour, there is nothing to change at all here. It's only if we change behaviour, that some part of the changes need to be handled in DataFrame._where. If fillna doesn't change behaviour, no change is needed here.
When I am saying "I don't directly see an easy way to handle this inside ArrayManager", that is only for the case of a changed fillna behaviour (which was initially my starting point for this PR).

jorisvandenbossche · 2021-02-09T12:49:32Z

When evaluating ArrayManager, it is important that we evaluate it in isolation. e.g. if I see a perf change I don't want to have to wonder/check if it is due to a type inference being skipped or something.

I am OK with keeping the fillna behaviour as is for now. So we can keep the discussion about keeping ArrayManager consistent or not for the harder changes such as indexing behaviour.
(but note that the above won't be possible in general, because also eg copy/view differences between both will have some impact)

So closing this PR, will open a new one making fillna consistent for ArrayManager.

jorisvandenbossche · 2021-02-09T13:15:17Z

-> new PR at #39697

[ArrayManager] BUG: fix indexing with non-aligned boolean dataframe

7eee352

jbrockmendel reviewed Feb 1, 2021

View reviewed changes

jreback added Indexing Related to indexing on series/frames, not to indexes themselves Internals Related to non-user accessible pandas implementation labels Feb 2, 2021

jorisvandenbossche added this to the 1.3 milestone Feb 2, 2021

jorisvandenbossche added 2 commits February 3, 2021 14:50

Merge remote-tracking branch 'upstream/master' into am-indexing-where

c4da68c

additional fixes for test_where.py

2476591

jorisvandenbossche added 2 commits February 8, 2021 16:42

Merge remote-tracking branch 'upstream/master' into am-indexing-where

f4008e4

fix mixed numeric case

024d5b8

jorisvandenbossche commented Feb 8, 2021

View reviewed changes

pandas/tests/frame/indexing/test_where.py Outdated Show resolved Hide resolved

jreback requested changes Feb 8, 2021

View reviewed changes

jorisvandenbossche closed this Feb 9, 2021

jorisvandenbossche mentioned this pull request Feb 9, 2021

[ArrayManager] Enable downcast in fillna / enable where indexing tests #39697

Merged

jorisvandenbossche deleted the am-indexing-where branch December 7, 2021 07:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ArrayManager] BUG: fix setitem with non-aligned boolean dataframe #39539

[ArrayManager] BUG: fix setitem with non-aligned boolean dataframe #39539

jorisvandenbossche commented Feb 1, 2021

jbrockmendel Feb 1, 2021

jorisvandenbossche Feb 2, 2021

jreback Feb 8, 2021

jbrockmendel commented Feb 1, 2021

jorisvandenbossche commented Feb 2, 2021

jorisvandenbossche commented Feb 8, 2021 •

edited

Loading

jbrockmendel commented Feb 8, 2021

jbrockmendel commented Feb 8, 2021

jorisvandenbossche commented Feb 8, 2021

jreback Feb 8, 2021

jorisvandenbossche Feb 8, 2021

jreback Feb 8, 2021

jbrockmendel commented Feb 8, 2021

jorisvandenbossche commented Feb 8, 2021

jorisvandenbossche commented Feb 9, 2021

jorisvandenbossche commented Feb 9, 2021

[ArrayManager] BUG: fix setitem with non-aligned boolean dataframe #39539

[ArrayManager] BUG: fix setitem with non-aligned boolean dataframe #39539

Conversation

jorisvandenbossche commented Feb 1, 2021

jbrockmendel Feb 1, 2021

Choose a reason for hiding this comment

jorisvandenbossche Feb 2, 2021

Choose a reason for hiding this comment

jreback Feb 8, 2021

Choose a reason for hiding this comment

jbrockmendel commented Feb 1, 2021

jorisvandenbossche commented Feb 2, 2021

jorisvandenbossche commented Feb 8, 2021 • edited Loading

jbrockmendel commented Feb 8, 2021

jbrockmendel commented Feb 8, 2021

jorisvandenbossche commented Feb 8, 2021

jreback Feb 8, 2021

Choose a reason for hiding this comment

jorisvandenbossche Feb 8, 2021

Choose a reason for hiding this comment

jreback Feb 8, 2021

Choose a reason for hiding this comment

jbrockmendel commented Feb 8, 2021

jorisvandenbossche commented Feb 8, 2021

jorisvandenbossche commented Feb 9, 2021

jorisvandenbossche commented Feb 9, 2021

jorisvandenbossche commented Feb 8, 2021 •

edited

Loading