Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/PERF: Avoid listifying in dispatch_to_extension_op #23155

Merged
merged 16 commits into from
Oct 19, 2018

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Oct 14, 2018

This simplifies dispatch_to_extension_op. The remaining logic is simply
unboxing Series / Indexes in favor of their underlying arrays. This forced
two additional changes

  1. Move some logic that IntegerArray relied on down to the IntegerArray ops.
    Things like handling of 0-dim ndarrays was previously broken on IntegerArray
    ops, but work with Serires[IntegerArray]
  2. Fix pandas handling of 1 ** NA for object dtype (used to construct expected).

closes #22922
closes #22022

This simplifies dispatch_to_extension_op. The remaining logic is simply
unboxing Series / Indexes in favor of their underlying arrays. This forced
two additional changes

1. Move some logic that IntegerArray relied on down to the IntegerArray ops.
   Things like handling of 0-dim ndarrays was previously broken on IntegerArray
   ops, but work with Serires[IntegerArray]
2. Fix pandas handling of 1 ** NA.
@TomAugspurger TomAugspurger added Numeric Operations Arithmetic, Comparison, and Logical operations ExtensionArray Extending pandas with custom dtypes or arrays. labels Oct 14, 2018
@TomAugspurger TomAugspurger added this to the 0.24.0 milestone Oct 14, 2018
@pep8speaks
Copy link

Hello @TomAugspurger! Thanks for submitting the PR.

raise NotImplementedError(
"can only perform ops with 1-d structures")

elif getattr(other, 'ndim', None) == 0:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is moved from dispatch_to_extension_array.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to keep this one in an elif, so that we avoid the else block raising a TypeError.

@@ -280,6 +280,8 @@ def _coerce_to_ndarray(self):
data[self._mask] = self._na_value
return data

__array_priority__ = 1 # higher than ndarray so ops dispatch to us
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Index, Series, and now DataFrame all set this to 1000. Does differing from those matter?

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Oct 14, 2018 via email

@@ -280,7 +280,7 @@ def _coerce_to_ndarray(self):
data[self._mask] = self._na_value
return data

__array_priority__ = 1 # higher than ndarray so ops dispatch to us
__array_priority__ = 1000 # higher than ndarray so ops dispatch to us
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just put this in the base class? (for the ops mixin)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems a little too invasive for a base class. I’d rather leave that up to the subclasser.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so what arithmetic subclass would not want this set?

is there an example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, I'm not sure if there's a way to unset it, if you don't want to set it in a subclass (you don't want to opt into numpy's array stuff at all).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just find this a detail which would likely be forgotten in any subclass, I don't see a harm and much upset in setting it onthe base class (you can always unset if you really really think you need to).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you unset it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really know if setting __array_priority__ = 0 is enough to "unset" it, and I don't know what all setting __array_priority__ in the first place opts you into.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you document this in the Mixin itself though (if you are not going to set it by defaulrt). It is so non-obvious that you need to do this.

@codecov
Copy link

codecov bot commented Oct 15, 2018

Codecov Report

Merging #23155 into master will decrease coverage by 0.01%.
The diff coverage is 90.9%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #23155      +/-   ##
==========================================
- Coverage   92.19%   92.18%   -0.02%     
==========================================
  Files         169      169              
  Lines       50956    50967      +11     
==========================================
+ Hits        46978    46983       +5     
- Misses       3978     3984       +6
Flag Coverage Δ
#multiple 90.6% <90.9%> (-0.01%) ⬇️
#single 42.26% <9.09%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/arrays/base.py 96.68% <100%> (+0.74%) ⬆️
pandas/core/arrays/integer.py 95.08% <100%> (+0.17%) ⬆️
pandas/core/arrays/sparse.py 92.17% <100%> (-0.38%) ⬇️
pandas/core/ops.py 94.22% <66.66%> (-0.42%) ⬇️
pandas/compat/numpy/function.py 86.66% <0%> (-0.47%) ⬇️
pandas/core/dtypes/concat.py 97.69% <0%> (-0.45%) ⬇️
pandas/core/indexes/category.py 97.26% <0%> (-0.28%) ⬇️
pandas/core/sorting.py 98.2% <0%> (-0.07%) ⬇️
pandas/core/indexes/datetimelike.py 98.23% <0%> (-0.03%) ⬇️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5e06c84...03a367e. Read the comment docs.

2. call ``result = op(values, ExtensionArray)``
3. re-box the result in a ``Series``

Similar for DataFrame.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above seems the good logic to me. But, shouldn't then the _create_comparison_method be updated to actually do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to delete the DataFrame comment. I think it's not so relevant since DataFrames are 2D. I'm assuming most arrays will want to match NumPy's broadcasting behavior.

Which _create_comparison_method do you mean? The one used in ExtensionScalarOpsMixin`?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(my comment was about the full block, not especially DataFrame)

Which _create_comparison_method do you mean? The one used inExtensionScalarOpsMixin`?

The one I was looking at in the diff, is the IntegerArray one I think. But I assume for the base class mixin, the same is true.

@@ -280,7 +280,7 @@ def _coerce_to_ndarray(self):
data[self._mask] = self._na_value
return data

__array_priority__ = 1 # higher than ndarray so ops dispatch to us
__array_priority__ = 1000 # higher than ndarray so ops dispatch to us
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just find this a detail which would likely be forgotten in any subclass, I don't see a harm and much upset in setting it onthe base class (you can always unset if you really really think you need to).

if isinstance(other, IntegerArray):
other, mask = other._data, other._mask

elif getattr(other, 'ndim', None) == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we usually use item_from_zerodim for this

raise NotImplementedError(
"can only perform ops with 1-d structures")

elif getattr(other, 'ndim', None) == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@@ -612,6 +619,10 @@ def integer_arithmetic_method(self, other):
else:
mask = self._mask | mask

if op_name == 'rpow':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does pow just work?

s = pd.Series(data)
# 1^x is 1.0 for all x, so test separately
result = 1 ** s
expected = pd.Series(1, index=s.index, dtype=data.dtype.numpy_dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is adding a lot of duplicated code to test, can you use _check_op here for the 1 case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started down that route, but couldn't make it work. I found it hard to follow all the indirection.

I'm ok with some duplicated code in these tests, to make it clearer what's actually being tested.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Oct 15, 2018

The above seems the good logic to me. But, shouldn't then the _create_comparison_method be updated to actually do this?

Updated SparseArray, IntegerArray, and ExtensionScalarOpsMixin to do this.

Question: should we add a test that asserts something like

def test_op_with_series_is_not_implemented(self, data):
    other = pd.Series(data)
    assert data.__add__(other) is NotImplemented

or is that being too opinionated in our base tests?

@jorisvandenbossche
Copy link
Member

Question: should we add a test that asserts something like ... or is that being too opinionated in our base tests?

I think that is a good idea. People can always still override that test, if they don't want to follow it.

@TomAugspurger
Copy link
Contributor Author

Deduplicated some tests and added the base test for asserting that ExtensionArray.__add__(Series) returns NotImplemented.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments

op_name = ufunc.__name__
op_name = aliases.get(op_name, op_name)

if op_name in special and kwargs.get('out') is None:
if isinstance(inputs[0], type(self)):
return getattr(self, '__{}__'.format(op_name))(inputs[1])
else:
return getattr(self, '__r{}__'.format(op_name))(inputs[0])
name = flipped.get(op_name, '__r{}__'.format(op_name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to @jbrockmendel since we do this a couple of times iIRC, we should have a more generic way of doing this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if we implement __array_ufunc__ on more arrays, we'll need to do it in those places too.

I think pandas.core.ops._op_descriptions may have enough info.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't quite work since the comparison ops don't define reversed, which may be sensible (haven't really thought it through).

if mask.any():
with np.errstate(all='ignore'):
result[mask] = op(xrav[mask], y)

if op == pow:
result = np.where(~mask, x, result)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the same comments as you have above here (e.g. 1 ** np.nan...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworked this to update the mask for both pow and rpow, rather than adjusting the result.

@@ -128,6 +128,9 @@ def _check_op(self, s, op_name, other, exc=None):
if omask is not None:
mask |= omask

if op_name == '__rpow__':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we not need the same for pow?

@@ -285,6 +294,21 @@ def test_error(self, data, all_arithmetic_operators):
with pytest.raises(NotImplementedError):
opa(np.arange(len(s)).reshape(-1, len(s)))

def test_pow(self):
a = pd.core.arrays.integer_array([1, None, None, 1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you import integer_array at the top (is it already)?

def test_rpow_one_to_na(self):
# https://github.com/pandas-dev/pandas/issues/22022
# NumPy says 1 ** nan is 1.
arr = integer_array([np.nan, np.nan])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm it must be as you are using it here

def test_pow(self):
a = pd.core.arrays.integer_array([1, None, None, 1])
b = pd.core.arrays.integer_array([1, None, 1, None])
result = a ** b
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note I don't think we have a test that uses None / np.nan in integer_array construction (e.g. that they are both actually interpreted the same) if you want to add one (this test of course implicitiy asserts this)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly works, just a failing case on integer_array([None]) in #23224

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a minor documentation request. otherwise lgtm.

@@ -280,7 +280,7 @@ def _coerce_to_ndarray(self):
data[self._mask] = self._na_value
return data

__array_priority__ = 1 # higher than ndarray so ops dispatch to us
__array_priority__ = 1000 # higher than ndarray so ops dispatch to us
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you document this in the Mixin itself though (if you are not going to set it by defaulrt). It is so non-obvious that you need to do this.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Oct 18, 2018 via email


Regardless of the approach, you may want to implement ``__array_ufunc__``
or set ``__array_priority__`` if you want your implementation
to be called when involved in binary operations with NumPy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is __array_ufunc__ directly used for binary operations?

If you use the ExtensionOpsMixin to set all the dunder methods, I don't think you need __array_ufunc__ for that? As I understood it, you can implement all the operators with __array_ufunc__ using the NDArrayOperatorsMixin (https://github.com/numpy/numpy/blob/v1.15.1/numpy/lib/mixins.py#L63-L183) (but of course it is still recommendable to implement it for ufuncs?)

Just to say that I find the note not fully clear.

Also, do you know if __array_priority__ is still used for other things? (or is it only used for ufuncs? In which case it can be clearer that this is for older numpy versions?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do you know if array_priority is still used for other things?

I don't know. I only mentioned __array_ufunc__, because it disables __array_priority__.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understood it, you can implement all the operators with array_ufunc using the NDArrayOperatorsMixin

You may be saying this, but NDAraryOperatorsMixin implements all the special methods by calling __array_ufunc__. So if your array implements __array_ufunc__, you can get all the operators for free.

But implementing __array_ufunc__ doesn't necessitate using NDArrayOperatorsMixin. I'm not sure what happens on a class defining __add__, __array_ufunc__, and __array_priority__,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__add__ will be used and not __array_ufunc__ I would think.

But therefore, I found your note reading a bit confusing: I read it as "you may want to implement __array_ufunc__ [...] if you want your implementation to be called when involved in binary operations", but implementing __array_ufunc__ itself does not involve in binary operations (only if using it to implement the binary ops duncer methods)

@TomAugspurger
Copy link
Contributor Author

I've removed references to __array_ufunc__.

@TomAugspurger
Copy link
Contributor Author

OK with this docs now Joris? I feel like if someone is using __array_ufunc__ and NDArrayOperatorMixin, they probably understand it better than I do :)

@jorisvandenbossche jorisvandenbossche merged commit 29e586c into pandas-dev:master Oct 19, 2018
@jorisvandenbossche
Copy link
Member

Yep, looks good! (and when looking further into __array_ufunc__ for EAs, we can always still add some notes about that)

@TomAugspurger TomAugspurger deleted the ea-no-list branch October 19, 2018 12:34
@jorisvandenbossche
Copy link
Member

Actually, Tom, you were right that __array_ufunc__ is influencing the binary operations. From the docs:

The presence of array_ufunc also influences how ndarray handles binary operations like arr + obj and arr < obj when arr is an ndarray and obj is an instance of a custom class.

https://docs.scipy.org/doc/numpy-1.15.1/reference/arrays.classes.html#numpy.class.__array_ufunc__

tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018
)

This simplifies dispatch_to_extension_op. The remaining logic is simply
unboxing Series / Indexes in favor of their underlying arrays. This forced
two additional changes

1. Move some logic that IntegerArray relied on down to the IntegerArray ops.
   Things like handling of 0-dim ndarrays was previously broken on IntegerArray
   ops, but work with Serires[IntegerArray]
2. Fix pandas handling of 1 ** NA.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
5 participants