-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG/PERF: Avoid listifying in dispatch_to_extension_op #23155
BUG/PERF: Avoid listifying in dispatch_to_extension_op #23155
Conversation
This simplifies dispatch_to_extension_op. The remaining logic is simply unboxing Series / Indexes in favor of their underlying arrays. This forced two additional changes 1. Move some logic that IntegerArray relied on down to the IntegerArray ops. Things like handling of 0-dim ndarrays was previously broken on IntegerArray ops, but work with Serires[IntegerArray] 2. Fix pandas handling of 1 ** NA.
Hello @TomAugspurger! Thanks for submitting the PR.
|
raise NotImplementedError( | ||
"can only perform ops with 1-d structures") | ||
|
||
elif getattr(other, 'ndim', None) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is moved from dispatch_to_extension_array
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to keep this one in an elif, so that we avoid the else
block raising a TypeError.
pandas/core/arrays/integer.py
Outdated
@@ -280,6 +280,8 @@ def _coerce_to_ndarray(self): | |||
data[self._mask] = self._na_value | |||
return data | |||
|
|||
__array_priority__ = 1 # higher than ndarray so ops dispatch to us |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Index, Series, and now DataFrame all set this to 1000. Does differing from those matter?
Probably not, since we don’t use it. But I’ll set it to match.
________________________________
From: jbrockmendel <[email protected]>
Sent: Sunday, October 14, 2018 3:18:59 PM
To: pandas-dev/pandas
Cc: Tom Augspurger; Mention
Subject: Re: [pandas-dev/pandas] BUG/PERF: Avoid listifying in dispatch_to_extension_op (#23155)
@jbrockmendel commented on this pull request.
________________________________
In pandas/core/arrays/integer.py<#23155 (comment)>:
@@ -280,6 +280,8 @@ def _coerce_to_ndarray(self):
data[self._mask] = self._na_value
return data
+ __array_priority__ = 1 # higher than ndarray so ops dispatch to us
Index, Series, and now DataFrame all set this to 1000. Does differing from those matter?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#23155 (review)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABQHIg8fa88jVOc1Pkk7EGLw88WtGbmiks5uk5wzgaJpZM4XbRVy>.
|
@@ -280,7 +280,7 @@ def _coerce_to_ndarray(self): | |||
data[self._mask] = self._na_value | |||
return data | |||
|
|||
__array_priority__ = 1 # higher than ndarray so ops dispatch to us | |||
__array_priority__ = 1000 # higher than ndarray so ops dispatch to us |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we just put this in the base class? (for the ops mixin)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems a little too invasive for a base class. I’d rather leave that up to the subclasser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so what arithmetic subclass would not want this set?
is there an example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, I'm not sure if there's a way to unset it, if you don't want to set it in a subclass (you don't want to opt into numpy's array stuff at all).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just find this a detail which would likely be forgotten in any subclass, I don't see a harm and much upset in setting it onthe base class (you can always unset if you really really think you need to).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you unset it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really know if setting __array_priority__ = 0
is enough to "unset" it, and I don't know what all setting __array_priority__
in the first place opts you into.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you document this in the Mixin itself though (if you are not going to set it by defaulrt). It is so non-obvious that you need to do this.
Codecov Report
@@ Coverage Diff @@
## master #23155 +/- ##
==========================================
- Coverage 92.19% 92.18% -0.02%
==========================================
Files 169 169
Lines 50956 50967 +11
==========================================
+ Hits 46978 46983 +5
- Misses 3978 3984 +6
Continue to review full report at Codecov.
|
doc/source/extending.rst
Outdated
2. call ``result = op(values, ExtensionArray)`` | ||
3. re-box the result in a ``Series`` | ||
|
||
Similar for DataFrame. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above seems the good logic to me. But, shouldn't then the _create_comparison_method
be updated to actually do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to delete the DataFrame comment. I think it's not so relevant since DataFrames are 2D. I'm assuming most arrays will want to match NumPy's broadcasting behavior.
Which _create_comparison_method
do you mean? The one used in
ExtensionScalarOpsMixin`?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(my comment was about the full block, not especially DataFrame)
Which _create_comparison_method do you mean? The one used inExtensionScalarOpsMixin`?
The one I was looking at in the diff, is the IntegerArray one I think. But I assume for the base class mixin, the same is true.
@@ -280,7 +280,7 @@ def _coerce_to_ndarray(self): | |||
data[self._mask] = self._na_value | |||
return data | |||
|
|||
__array_priority__ = 1 # higher than ndarray so ops dispatch to us | |||
__array_priority__ = 1000 # higher than ndarray so ops dispatch to us |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just find this a detail which would likely be forgotten in any subclass, I don't see a harm and much upset in setting it onthe base class (you can always unset if you really really think you need to).
pandas/core/arrays/integer.py
Outdated
if isinstance(other, IntegerArray): | ||
other, mask = other._data, other._mask | ||
|
||
elif getattr(other, 'ndim', None) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we usually use item_from_zerodim
for this
raise NotImplementedError( | ||
"can only perform ops with 1-d structures") | ||
|
||
elif getattr(other, 'ndim', None) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
pandas/core/arrays/integer.py
Outdated
@@ -612,6 +619,10 @@ def integer_arithmetic_method(self, other): | |||
else: | |||
mask = self._mask | mask | |||
|
|||
if op_name == 'rpow': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does pow just work?
pandas/tests/arrays/test_integer.py
Outdated
s = pd.Series(data) | ||
# 1^x is 1.0 for all x, so test separately | ||
result = 1 ** s | ||
expected = pd.Series(1, index=s.index, dtype=data.dtype.numpy_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is adding a lot of duplicated code to test, can you use _check_op here for the 1 case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started down that route, but couldn't make it work. I found it hard to follow all the indirection.
I'm ok with some duplicated code in these tests, to make it clearer what's actually being tested.
Updated SparseArray, IntegerArray, and ExtensionScalarOpsMixin to do this. Question: should we add a test that asserts something like def test_op_with_series_is_not_implemented(self, data):
other = pd.Series(data)
assert data.__add__(other) is NotImplemented or is that being too opinionated in our base tests? |
I think that is a good idea. People can always still override that test, if they don't want to follow it. |
Deduplicated some tests and added the base test for asserting that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments
op_name = ufunc.__name__ | ||
op_name = aliases.get(op_name, op_name) | ||
|
||
if op_name in special and kwargs.get('out') is None: | ||
if isinstance(inputs[0], type(self)): | ||
return getattr(self, '__{}__'.format(op_name))(inputs[1]) | ||
else: | ||
return getattr(self, '__r{}__'.format(op_name))(inputs[0]) | ||
name = flipped.get(op_name, '__r{}__'.format(op_name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to @jbrockmendel since we do this a couple of times iIRC, we should have a more generic way of doing this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And if we implement __array_ufunc__
on more arrays, we'll need to do it in those places too.
I think pandas.core.ops._op_descriptions
may have enough info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't quite work since the comparison ops don't define reversed
, which may be sensible (haven't really thought it through).
pandas/core/ops.py
Outdated
if mask.any(): | ||
with np.errstate(all='ignore'): | ||
result[mask] = op(xrav[mask], y) | ||
|
||
if op == pow: | ||
result = np.where(~mask, x, result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the same comments as you have above here (e.g. 1 ** np.nan...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reworked this to update the mask for both pow
and rpow
, rather than adjusting the result.
pandas/tests/arrays/test_integer.py
Outdated
@@ -128,6 +128,9 @@ def _check_op(self, s, op_name, other, exc=None): | |||
if omask is not None: | |||
mask |= omask | |||
|
|||
if op_name == '__rpow__': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we not need the same for pow?
pandas/tests/arrays/test_integer.py
Outdated
@@ -285,6 +294,21 @@ def test_error(self, data, all_arithmetic_operators): | |||
with pytest.raises(NotImplementedError): | |||
opa(np.arange(len(s)).reshape(-1, len(s))) | |||
|
|||
def test_pow(self): | |||
a = pd.core.arrays.integer_array([1, None, None, 1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you import integer_array at the top (is it already)?
def test_rpow_one_to_na(self): | ||
# https://github.com/pandas-dev/pandas/issues/22022 | ||
# NumPy says 1 ** nan is 1. | ||
arr = integer_array([np.nan, np.nan]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm it must be as you are using it here
def test_pow(self): | ||
a = pd.core.arrays.integer_array([1, None, None, 1]) | ||
b = pd.core.arrays.integer_array([1, None, 1, None]) | ||
result = a ** b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note I don't think we have a test that uses None / np.nan in integer_array construction (e.g. that they are both actually interpreted the same) if you want to add one (this test of course implicitiy asserts this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly works, just a failing case on integer_array([None])
in #23224
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a minor documentation request. otherwise lgtm.
@@ -280,7 +280,7 @@ def _coerce_to_ndarray(self): | |||
data[self._mask] = self._na_value | |||
return data | |||
|
|||
__array_priority__ = 1 # higher than ndarray so ops dispatch to us | |||
__array_priority__ = 1000 # higher than ndarray so ops dispatch to us |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you document this in the Mixin itself though (if you are not going to set it by defaulrt). It is so non-obvious that you need to do this.
[ci skip]
Added docs to the MixIns and in extending.rst.
…On Thu, Oct 18, 2018 at 12:34 PM Jeff Reback ***@***.***> wrote:
***@***.**** approved this pull request.
just a minor documentation request. otherwise lgtm.
------------------------------
In pandas/core/arrays/integer.py
<#23155 (comment)>:
> @@ -280,7 +280,7 @@ def _coerce_to_ndarray(self):
data[self._mask] = self._na_value
return data
- __array_priority__ = 1 # higher than ndarray so ops dispatch to us
+ __array_priority__ = 1000 # higher than ndarray so ops dispatch to us
can you document this in the Mixin itself though (if you are not going to
set it by defaulrt). It is so non-obvious that you need to do this.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#23155 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIvr4jX023yfToIfD-E47Tid1W9rEks5umLufgaJpZM4XbRVy>
.
|
doc/source/extending.rst
Outdated
|
||
Regardless of the approach, you may want to implement ``__array_ufunc__`` | ||
or set ``__array_priority__`` if you want your implementation | ||
to be called when involved in binary operations with NumPy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is __array_ufunc__
directly used for binary operations?
If you use the ExtensionOpsMixin to set all the dunder methods, I don't think you need __array_ufunc__
for that? As I understood it, you can implement all the operators with __array_ufunc__
using the NDArrayOperatorsMixin (https://github.com/numpy/numpy/blob/v1.15.1/numpy/lib/mixins.py#L63-L183) (but of course it is still recommendable to implement it for ufuncs?)
Just to say that I find the note not fully clear.
Also, do you know if __array_priority__
is still used for other things? (or is it only used for ufuncs? In which case it can be clearer that this is for older numpy versions?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, do you know if array_priority is still used for other things?
I don't know. I only mentioned __array_ufunc__
, because it disables __array_priority__
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understood it, you can implement all the operators with array_ufunc using the NDArrayOperatorsMixin
You may be saying this, but NDAraryOperatorsMixin
implements all the special methods by calling __array_ufunc__
. So if your array implements __array_ufunc__
, you can get all the operators for free.
But implementing __array_ufunc__
doesn't necessitate using NDArrayOperatorsMixin
. I'm not sure what happens on a class defining __add__
, __array_ufunc__
, and __array_priority__
,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__add__
will be used and not __array_ufunc__
I would think.
But therefore, I found your note reading a bit confusing: I read it as "you may want to implement __array_ufunc__
[...] if you want your implementation to be called when involved in binary operations", but implementing __array_ufunc__
itself does not involve in binary operations (only if using it to implement the binary ops duncer methods)
I've removed references to |
OK with this docs now Joris? I feel like if someone is using |
Yep, looks good! (and when looking further into |
Actually, Tom, you were right that
https://docs.scipy.org/doc/numpy-1.15.1/reference/arrays.classes.html#numpy.class.__array_ufunc__ |
) This simplifies dispatch_to_extension_op. The remaining logic is simply unboxing Series / Indexes in favor of their underlying arrays. This forced two additional changes 1. Move some logic that IntegerArray relied on down to the IntegerArray ops. Things like handling of 0-dim ndarrays was previously broken on IntegerArray ops, but work with Serires[IntegerArray] 2. Fix pandas handling of 1 ** NA.
This simplifies dispatch_to_extension_op. The remaining logic is simply
unboxing Series / Indexes in favor of their underlying arrays. This forced
two additional changes
Things like handling of 0-dim ndarrays was previously broken on IntegerArray
ops, but work with Serires[IntegerArray]
closes #22922
closes #22022