-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF: ignore_failures in BlockManager.reduce #35881
REF: ignore_failures in BlockManager.reduce #35881
Conversation
* DOC: Updated aggregate docstring * Doc: updated aggregate docstring * Update pandas/core/generic.py Co-authored-by: Marco Gorelli <[email protected]> * Update generic.py * Update generic.py * Revert "Update generic.py" This reverts commit 15ecaf7. * Revert "Revert "Update generic.py"" This reverts commit cc231c8. * Updated docstring of agg * Trailing whitespace removed * DOC: Updated docstring of agg * Update generic.py * Updated Docstring Co-authored-by: Marco Gorelli <[email protected]>
…f-reduce-blockwise-2
…f-reduce-blockwise-2
…f-reduce-blockwise-2
…f-reduce-blockwise-2
…f-reduce-blockwise-2
pandas/tests/frame/test_analytics.py
Outdated
@@ -1108,10 +1108,10 @@ def test_any_all_bool_only(self): | |||
True, | |||
marks=[td.skip_if_np_lt("1.15")], | |||
), | |||
(np.all, {"A": pd.Series([0, 1], dtype="category")}, False), | |||
(np.any, {"A": pd.Series([0, 1], dtype="category")}, True), | |||
(np.all, {"A": pd.Series([0, 1], dtype="category")}, True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this is the bug fix? can you add a whatsnew note
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a bugfix per se, this is the behavior that changes if we declare that ser.to_frame().all()
should be consistent with ser.all()
, xref #36076
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok i think we need to deprecate this right (that was consensus?)
also i suppose ok to just change. this is not a very large case. cc @jorisvandenbossche
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated with whatsnew
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have another PR actually trying to deprecate this right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also in #36076, you commented a few days ago "the consensus seems to be that we should deprecate the current behavior in favor of matching the Series behavior". But so this PR is not doing that? Or is this PR not handling that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Closed that one. This is a giant PITA and we should just rip the bandaid off.
…f-reduce-blockwise-2
…f-reduce-blockwise-2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you try to summarize the issue a bit more?
So with this PR, a categorical column will be skipped for any
/all
(does this also impact other reductions? Or other dtypes? Are there tests for this?).
This is because with this PR it now takes the route of BlockManager.reduce->Block.reduce where Categorical._reduce is called, which raises an error, and BlockManager.reduce catches this error and skips the column. Is that correct?
In master, on the other hand, for any
/ all
(filter reductions) we convert the full dataframe to an array (at least with the default numeric_only=None
, and then on the numpy array the any/all operation works.
But so, that also means that the behaviour with this PR depends on the presence of (another) object dtype column or not?
If we want to deprecate this, would a relatively clean option be: pass through the name of the op to Block.reduce
, and let CategoricalBlock.reduce have a special case checking for the any/all op (in which case we can raise a warning and perform the op on the ndarray), and otherwise use the normal Block.reduce implementation ?
doc/source/whatsnew/v1.2.0.rst
Outdated
@@ -244,7 +244,7 @@ Timezones | |||
|
|||
Numeric | |||
^^^^^^^ | |||
- | |||
- Bug in :class:`DataFrame` reductions incorrectly ignoring ``ExtensionArray`` behaviors (:issue:`35881`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ntoe is not very helpful for users I think. Can you list the cases we are aware of that will change? (which will also help reviewing this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly for single-column EA-dtypes where the reduction on the array would raise. Tough to put into a succinct note bc of the dropping-failures behavior. suggestions welcome
I think it means the behavior in master depends on the presence of (another) object dtpye column. Which is part of the cluster-frack that is the reason we should rip off the bandaid. |
…f-reduce-blockwise-2
…f-reduce-blockwise-2
…f-reduce-blockwise-2
…f-reduce-blockwise-2
…f-reduce-blockwise-2
…f-reduce-blockwise-2
…f-reduce-blockwise-2
…f-reduce-blockwise-2
…f-reduce-blockwise-2
Reverted behavior-changing component so that bug can be fixed separately |
lgtm. i know you are trying to cleanup this whole area, so good to go. @jorisvandenbossche any comments (I am sure this is going to be simplified over time) |
yah the real simplifications dont come until we do the actual bugfixes, but this is a step in the right direction |
great. glad finally to get this in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing some profiling on the example from #37081, and reporting a few findings here.
@@ -8595,6 +8595,7 @@ def _reduce( | |||
cols = self.columns[~dtype_is_dt] | |||
self = self[cols] | |||
|
|||
any_object = self.dtypes.apply(is_object_dtype).any() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is partly the culprit of the slowdown. See also the top post of #33252, which shows that self.dtypes.apply(..)
is slower than the method that is used a few lines above for dtype_is_dt
res = df._mgr.reduce(blk_func) | ||
out = df._constructor(res).iloc[0].rename(None) | ||
res, indexer = df._mgr.reduce(blk_func, ignore_failures=ignore_failures) | ||
out = df._constructor(res).iloc[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my profiling, this getitem
also seems to take a significant amount of the total time, although this cannot explain the recent perf degradation (but I am comparing my profile on master vs 1.1, where the iloc
was not yet present)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The getitem being iloc[0]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed
Moving towards collecting all of the ignore_failures code in one place.
The case where we have object dtypes is kept separate in this PR, will be handled in the next pass.