Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ArrayManager] BUG: fix setitem with non-aligned boolean dataframe #39539

Conversation

jorisvandenbossche
Copy link
Member

xref indexing work item of #39146

This covers setitem with a boolean dataframe (like df[df < 0] = 2), but specifically fixes the case where the dataframe is not aligned (eg df[df[:-1] < 0] = 2 in the tests).

DataFrame.fillna infers object dtype columns getting filled for a proper dtype, but with ArrayManager internals I propose that fillna always preserves the dtype.

if isinstance(cond, ABCDataFrame) and cond._has_array_manager:
if not all(is_bool_dtype(dt) for dt in cond.dtypes) and all_bool_columns:
cond = cond.astype(bool)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a viable way to put these patches in with the ArrayManager code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't directly see one, as we don't want to change (IMO) the fillna behaviour.
Unless we we would add a keyword for this (only for internal use) to the manager fillna method, and call cond._mgr.fillna here instead of cond.fillna to be able to pass it through, but not sure that would be any cleaner (as long as it is only used once, at least, there might come up other places where this is needed later)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left same comment above. I think we are going to want to push more code to the manager (kind of the opposite that we have been doing of late though).

@jbrockmendel
Copy link
Member

Might be a coincidence, but in the ArrayManager code def setitem is commented out with a # TODO what is this used for?. The answer is it is used for setting via iloc

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Internals Related to non-user accessible pandas implementation labels Feb 2, 2021
@jorisvandenbossche
Copy link
Member Author

Might be a coincidence, but in the ArrayManager code def setitem is commented out with a # TODO what is this used for?. The answer is it is used for setting via iloc

I was actually also implementing that while trying to get tests in frame/indexing/test_indexing.py passing, but since this where related fix and the setitem implementation are not actually related, split it off in a separate branch (but will be opening a PR for it)

@jorisvandenbossche jorisvandenbossche added this to the 1.3 milestone Feb 2, 2021
@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Feb 8, 2021

@jbrockmendel more comments here? (suggestions for a cleaner solution / where to put this are certainly welcome)

@jbrockmendel
Copy link
Member

will take another look this afternoon

@jbrockmendel
Copy link
Member

Big picture: what was your takeaway from last week's call re behavior differences between ArrayManager vs BlockManager. My takeaway was that wherever feasible we should keep them in sync and do a deprecation on the existing method.

Is there a technical reason why keeping the behaviors in sync is not feasible?

@jorisvandenbossche
Copy link
Member Author

My takeaway was that wherever feasible we should keep them in sync and do a deprecation on the existing method.

My interpretation was different (but it was not fully clear to me what the general sense was about this / what the conclusion was, so maybe this is rather stating what I would argue for myself): have new behaviour in ArrayManager, deprecate in BlockManager.

If we agree we want to deprecate a certain behaviour (and thus in practice deprecate it for the current default internals, i.e. using BlockManager), IMO it doesn't make much sense to implement the old deprecated behaviour for ArrayManager: 1) this will not always be easy (for this case of fillna downcasting that's not a problem, but eg I don't want to try to replicate the copy/view behaviours of the BlockManager in ArrayManager), 2) having the new behaviour in ArrayManager actually makes it possible to experiment with this (which I think will be useful).
Implementing the future behaviour now also ensures we can have a good overview and implementation of changing behaviour for a potential 2.0 (or 3.0) release, instead of having to do that just before that release.

Of course, specifically for the fillna downcasting behaviour, that is quite self contained and thus rather easy to add in ArrayManager as well (there is certainly no technical reason that it's not possible here). #39578 is another example of changing behaviour, and one where I think it is much less straightforward (or desired) to mimic behaviour of BlockManager.

It also might not need to be all one way or the other. We could also decide on a case by case, and eg prefer preserving behaviour for fillna while implementing new behaviour for eg indexing operations.

@@ -8812,9 +8819,32 @@ def _where(
raise ValueError("Array conditional must be same shape as self")
cond = self._constructor(cond, **self._construct_axes_dict())

# Needed for DataFrames with ArrayManager, see below for details
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to just do this inside the manager itself? I know this might mean a slight refactoring here, but it would be better if we could do that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically this code does two steps for cond: 1) align and 2) fillna.
The way that I wrote this now, those are a bit intertwined (I check something about the dataframe that is need for fillna before the alignment step), but I don't think we want to push the alignment inside the manager? (that's something that now is always done outside I think, the manager doesn't need to care about that)

As mentioned below, I could call cond._mgr.fillna(..) instead of cond.fillna(), and then add a keyword whether to force it to be boolean or not (what is now encoded in all_bool_columns).
But, in that idea, the code to determine all_bool_columns would still live here, and thus would only move part of the added complexity into the internals.

if isinstance(cond, ABCDataFrame) and cond._has_array_manager:
if not all(is_bool_dtype(dt) for dt in cond.dtypes) and all_bool_columns:
cond = cond.astype(bool)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left same comment above. I think we are going to want to push more code to the manager (kind of the opposite that we have been doing of late though).

@jbrockmendel
Copy link
Member

My interpretation was different (but it was not fully clear to me what the general sense was about this / what the conclusion was, so maybe this is rather stating what I would argue for myself):

yah i checked the notes and it doesnt look like we wrote anything down. easy for it to become a rorscach test

If we agree we want to deprecate a certain behaviour (and thus in practice deprecate it for the current default internals, i.e. using BlockManager)

The "if" part is important. That discussion should be orthogonal to the "getting ArrayManager fully functional" task.

this will not always be easy (for this case of fillna downcasting that's not a problem, but eg I don't want to try to replicate the copy/view behaviours of the BlockManager in ArrayManager)

No objection on the copy/view-indexing bit. But for other methods where it is easy i think the default should be to keep behavior in sync.

This will certainly reduce complexity in the test suite, and I'm guessing in the code too. e.g. am I right in thinking that the extra code in NDFrame._where (that both Jeff and I have independently commented on asking if it can go in ArrayManager) would be easier to keep in ArrayManager if it weren't for the changed behavior?

having the new behaviour in ArrayManager actually makes it possible to experiment with this (which I think will be useful).

When evaluating ArrayManager, it is important that we evaluate it in isolation. e.g. if I see a perf change I don't want to have to wonder/check if it is due to a type inference being skipped or something.

@jorisvandenbossche
Copy link
Member Author

yah i checked the notes and it doesnt look like we wrote anything down. easy for it to become a rorscach test

Yeah, it was the first thing I thought after our meeting "we should have forced ourselves to write down some conclusions, because I already don't know anymore what those were" ..

The "if" part is important. That discussion should be orthogonal to the "getting ArrayManager fully functional" task.

Yes, I know. That's the reason I opened #39584, and I know I need to open one for the downcasting behaviour as well.

e.g. am I right in thinking that the extra code in NDFrame._where (that both Jeff and I have independently commented on asking if it can go in ArrayManager) would be easier to keep in ArrayManager if it weren't for the changed behavior?

If we don't change behaviour, there is nothing to change at all here. It's only if we change behaviour, that some part of the changes need to be handled in DataFrame._where. If fillna doesn't change behaviour, no change is needed here.
When I am saying "I don't directly see an easy way to handle this inside ArrayManager", that is only for the case of a changed fillna behaviour (which was initially my starting point for this PR).

@jorisvandenbossche
Copy link
Member Author

When evaluating ArrayManager, it is important that we evaluate it in isolation. e.g. if I see a perf change I don't want to have to wonder/check if it is due to a type inference being skipped or something.

I am OK with keeping the fillna behaviour as is for now. So we can keep the discussion about keeping ArrayManager consistent or not for the harder changes such as indexing behaviour.
(but note that the above won't be possible in general, because also eg copy/view differences between both will have some impact)

So closing this PR, will open a new one making fillna consistent for ArrayManager.

@jorisvandenbossche
Copy link
Member Author

-> new PR at #39697

@jorisvandenbossche jorisvandenbossche deleted the am-indexing-where branch December 7, 2021 07:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants