-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ArrayManager] BUG: fix setitem with non-aligned boolean dataframe #39539
[ArrayManager] BUG: fix setitem with non-aligned boolean dataframe #39539
Conversation
if isinstance(cond, ABCDataFrame) and cond._has_array_manager: | ||
if not all(is_bool_dtype(dt) for dt in cond.dtypes) and all_bool_columns: | ||
cond = cond.astype(bool) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a viable way to put these patches in with the ArrayManager code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't directly see one, as we don't want to change (IMO) the fillna
behaviour.
Unless we we would add a keyword for this (only for internal use) to the manager fillna
method, and call cond._mgr.fillna
here instead of cond.fillna
to be able to pass it through, but not sure that would be any cleaner (as long as it is only used once, at least, there might come up other places where this is needed later)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left same comment above. I think we are going to want to push more code to the manager (kind of the opposite that we have been doing of late though).
Might be a coincidence, but in the ArrayManager code |
I was actually also implementing that while trying to get tests in |
@jbrockmendel more comments here? (suggestions for a cleaner solution / where to put this are certainly welcome) |
will take another look this afternoon |
Big picture: what was your takeaway from last week's call re behavior differences between ArrayManager vs BlockManager. My takeaway was that wherever feasible we should keep them in sync and do a deprecation on the existing method. Is there a technical reason why keeping the behaviors in sync is not feasible? |
My interpretation was different (but it was not fully clear to me what the general sense was about this / what the conclusion was, so maybe this is rather stating what I would argue for myself): have new behaviour in ArrayManager, deprecate in BlockManager. If we agree we want to deprecate a certain behaviour (and thus in practice deprecate it for the current default internals, i.e. using BlockManager), IMO it doesn't make much sense to implement the old deprecated behaviour for ArrayManager: 1) this will not always be easy (for this case of fillna downcasting that's not a problem, but eg I don't want to try to replicate the copy/view behaviours of the BlockManager in ArrayManager), 2) having the new behaviour in ArrayManager actually makes it possible to experiment with this (which I think will be useful). Of course, specifically for the It also might not need to be all one way or the other. We could also decide on a case by case, and eg prefer preserving behaviour for fillna while implementing new behaviour for eg indexing operations. |
@@ -8812,9 +8819,32 @@ def _where( | |||
raise ValueError("Array conditional must be same shape as self") | |||
cond = self._constructor(cond, **self._construct_axes_dict()) | |||
|
|||
# Needed for DataFrames with ArrayManager, see below for details |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to just do this inside the manager itself? I know this might mean a slight refactoring here, but it would be better if we could do that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So basically this code does two steps for cond
: 1) align and 2) fillna.
The way that I wrote this now, those are a bit intertwined (I check something about the dataframe that is need for fillna before the alignment step), but I don't think we want to push the alignment inside the manager? (that's something that now is always done outside I think, the manager doesn't need to care about that)
As mentioned below, I could call cond._mgr.fillna(..)
instead of cond.fillna()
, and then add a keyword whether to force it to be boolean or not (what is now encoded in all_bool_columns
).
But, in that idea, the code to determine all_bool_columns
would still live here, and thus would only move part of the added complexity into the internals.
if isinstance(cond, ABCDataFrame) and cond._has_array_manager: | ||
if not all(is_bool_dtype(dt) for dt in cond.dtypes) and all_bool_columns: | ||
cond = cond.astype(bool) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left same comment above. I think we are going to want to push more code to the manager (kind of the opposite that we have been doing of late though).
yah i checked the notes and it doesnt look like we wrote anything down. easy for it to become a rorscach test
The "if" part is important. That discussion should be orthogonal to the "getting ArrayManager fully functional" task.
No objection on the copy/view-indexing bit. But for other methods where it is easy i think the default should be to keep behavior in sync. This will certainly reduce complexity in the test suite, and I'm guessing in the code too. e.g. am I right in thinking that the extra code in
When evaluating ArrayManager, it is important that we evaluate it in isolation. e.g. if I see a perf change I don't want to have to wonder/check if it is due to a type inference being skipped or something. |
Yeah, it was the first thing I thought after our meeting "we should have forced ourselves to write down some conclusions, because I already don't know anymore what those were" ..
Yes, I know. That's the reason I opened #39584, and I know I need to open one for the downcasting behaviour as well.
If we don't change behaviour, there is nothing to change at all here. It's only if we change behaviour, that some part of the changes need to be handled in |
I am OK with keeping the fillna behaviour as is for now. So we can keep the discussion about keeping ArrayManager consistent or not for the harder changes such as indexing behaviour. So closing this PR, will open a new one making |
-> new PR at #39697 |
xref indexing work item of #39146
This covers setitem with a boolean dataframe (like
df[df < 0] = 2
), but specifically fixes the case where the dataframe is not aligned (egdf[df[:-1] < 0] = 2
in the tests).DataFrame.fillna
infers object dtype columns getting filled for a proper dtype, but with ArrayManager internals I propose thatfillna
always preserves the dtype.