BUG/PERF: Avoid listifying in dispatch_to_extension_op #23155

TomAugspurger · 2018-10-14T18:49:06Z

This simplifies dispatch_to_extension_op. The remaining logic is simply
unboxing Series / Indexes in favor of their underlying arrays. This forced
two additional changes

Move some logic that IntegerArray relied on down to the IntegerArray ops.
Things like handling of 0-dim ndarrays was previously broken on IntegerArray
ops, but work with Serires[IntegerArray]
Fix pandas handling of 1 ** NA for object dtype (used to construct expected).

closes #22922
closes #22022

This simplifies dispatch_to_extension_op. The remaining logic is simply unboxing Series / Indexes in favor of their underlying arrays. This forced two additional changes 1. Move some logic that IntegerArray relied on down to the IntegerArray ops. Things like handling of 0-dim ndarrays was previously broken on IntegerArray ops, but work with Serires[IntegerArray] 2. Fix pandas handling of 1 ** NA.

pep8speaks · 2018-10-14T18:49:12Z

Hello @TomAugspurger! Thanks for submitting the PR.

There are no PEP8 issues in the file pandas/core/arrays/integer.py !
There are no PEP8 issues in the file pandas/core/ops.py !
There are no PEP8 issues in the file pandas/tests/arrays/test_integer.py !
There are no PEP8 issues in the file pandas/tests/extension/test_integer.py !

TomAugspurger · 2018-10-14T18:49:49Z

pandas/core/arrays/integer.py

-                raise NotImplementedError(
-                    "can only perform ops with 1-d structures")
+
+            elif getattr(other, 'ndim', None) == 0:


This is moved from dispatch_to_extension_array.

Had to keep this one in an elif, so that we avoid the else block raising a TypeError.

jbrockmendel · 2018-10-14T20:18:53Z

pandas/core/arrays/integer.py

@@ -280,6 +280,8 @@ def _coerce_to_ndarray(self):
        data[self._mask] = self._na_value
        return data

+    __array_priority__ = 1  # higher than ndarray so ops dispatch to us


Index, Series, and now DataFrame all set this to 1000. Does differing from those matter?

TomAugspurger · 2018-10-14T20:31:43Z

Probably not, since we don’t use it. But I’ll set it to match.

________________________________ From: jbrockmendel <[email protected]> Sent: Sunday, October 14, 2018 3:18:59 PM To: pandas-dev/pandas Cc: Tom Augspurger; Mention Subject: Re: [pandas-dev/pandas] BUG/PERF: Avoid listifying in dispatch_to_extension_op (#23155) @jbrockmendel commented on this pull request.

________________________________ In pandas/core/arrays/integer.py<#23155 (comment)>:

@@ -280,6 +280,8 @@ def _coerce_to_ndarray(self):

data[self._mask] = self._na_value return data + __array_priority__ = 1 # higher than ndarray so ops dispatch to us Index, Series, and now DataFrame all set this to 1000. Does differing from those matter? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#23155 (review)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABQHIg8fa88jVOc1Pkk7EGLw88WtGbmiks5uk5wzgaJpZM4XbRVy>.

jreback · 2018-10-14T21:49:35Z

pandas/core/arrays/integer.py

@@ -280,7 +280,7 @@ def _coerce_to_ndarray(self):
        data[self._mask] = self._na_value
        return data

-    __array_priority__ = 1  # higher than ndarray so ops dispatch to us
+    __array_priority__ = 1000  # higher than ndarray so ops dispatch to us


should we just put this in the base class? (for the ops mixin)

That seems a little too invasive for a base class. I’d rather leave that up to the subclasser.

so what arithmetic subclass would not want this set?

is there an example?

To clarify, I'm not sure if there's a way to unset it, if you don't want to set it in a subclass (you don't want to opt into numpy's array stuff at all).

I just find this a detail which would likely be forgotten in any subclass, I don't see a harm and much upset in setting it onthe base class (you can always unset if you really really think you need to).

Can you unset it?

I don't really know if setting __array_priority__ = 0 is enough to "unset" it, and I don't know what all setting __array_priority__ in the first place opts you into.

can you document this in the Mixin itself though (if you are not going to set it by defaulrt). It is so non-obvious that you need to do this.

codecov · 2018-10-15T01:27:35Z

Codecov Report

Merging #23155 into master will decrease coverage by 0.01%.
The diff coverage is 90.9%.

@@            Coverage Diff             @@
##           master   #23155      +/-   ##
==========================================
- Coverage   92.19%   92.18%   -0.02%     
==========================================
  Files         169      169              
  Lines       50956    50967      +11     
==========================================
+ Hits        46978    46983       +5     
- Misses       3978     3984       +6

Flag	Coverage Δ
#multiple	`90.6% <90.9%> (-0.01%)`	⬇️
#single	`42.26% <9.09%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/arrays/base.py	`96.68% <100%> (+0.74%)`	⬆️
pandas/core/arrays/integer.py	`95.08% <100%> (+0.17%)`	⬆️
pandas/core/arrays/sparse.py	`92.17% <100%> (-0.38%)`	⬇️
pandas/core/ops.py	`94.22% <66.66%> (-0.42%)`	⬇️
pandas/compat/numpy/function.py	`86.66% <0%> (-0.47%)`	⬇️
pandas/core/dtypes/concat.py	`97.69% <0%> (-0.45%)`	⬇️
pandas/core/indexes/category.py	`97.26% <0%> (-0.28%)`	⬇️
pandas/core/sorting.py	`98.2% <0%> (-0.07%)`	⬇️
pandas/core/indexes/datetimelike.py	`98.23% <0%> (-0.03%)`	⬇️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5e06c84...03a367e. Read the comment docs.

pandas/core/arrays/integer.py

jorisvandenbossche · 2018-10-15T09:55:23Z

doc/source/extending.rst

+2. call ``result = op(values, ExtensionArray)``
+3. re-box the result in a ``Series``
+
+Similar for DataFrame.


The above seems the good logic to me. But, shouldn't then the _create_comparison_method be updated to actually do this?

I meant to delete the DataFrame comment. I think it's not so relevant since DataFrames are 2D. I'm assuming most arrays will want to match NumPy's broadcasting behavior.

Which _create_comparison_method do you mean? The one used in ExtensionScalarOpsMixin`?

(my comment was about the full block, not especially DataFrame)

Which _create_comparison_method do you mean? The one used inExtensionScalarOpsMixin`?

The one I was looking at in the diff, is the IntegerArray one I think. But I assume for the base class mixin, the same is true.

jreback · 2018-10-15T11:34:29Z

pandas/core/arrays/integer.py

@@ -280,7 +280,7 @@ def _coerce_to_ndarray(self):
        data[self._mask] = self._na_value
        return data

-    __array_priority__ = 1  # higher than ndarray so ops dispatch to us
+    __array_priority__ = 1000  # higher than ndarray so ops dispatch to us


I just find this a detail which would likely be forgotten in any subclass, I don't see a harm and much upset in setting it onthe base class (you can always unset if you really really think you need to).

jreback · 2018-10-15T11:35:40Z

pandas/core/arrays/integer.py

            if isinstance(other, IntegerArray):
                other, mask = other._data, other._mask
+
+            elif getattr(other, 'ndim', None) == 0:


we usually use item_from_zerodim for this

jreback · 2018-10-15T11:35:47Z

pandas/core/arrays/integer.py

-                raise NotImplementedError(
-                    "can only perform ops with 1-d structures")
+
+            elif getattr(other, 'ndim', None) == 0:


jreback · 2018-10-15T11:44:01Z

pandas/core/arrays/integer.py

@@ -612,6 +619,10 @@ def integer_arithmetic_method(self, other):
            else:
                mask = self._mask | mask

+            if op_name == 'rpow':


does pow just work?

jreback · 2018-10-15T11:45:49Z

pandas/tests/arrays/test_integer.py

+        s = pd.Series(data)
+        # 1^x is 1.0 for all x, so test separately
+        result = 1 ** s
+        expected = pd.Series(1, index=s.index, dtype=data.dtype.numpy_dtype)


this is adding a lot of duplicated code to test, can you use _check_op here for the 1 case

I started down that route, but couldn't make it work. I found it hard to follow all the indirection.

I'm ok with some duplicated code in these tests, to make it clearer what's actually being tested.

TomAugspurger · 2018-10-15T13:33:16Z

The above seems the good logic to me. But, shouldn't then the _create_comparison_method be updated to actually do this?

Updated SparseArray, IntegerArray, and ExtensionScalarOpsMixin to do this.

Question: should we add a test that asserts something like

def test_op_with_series_is_not_implemented(self, data):
    other = pd.Series(data)
    assert data.__add__(other) is NotImplemented

or is that being too opinionated in our base tests?

jorisvandenbossche · 2018-10-15T15:25:50Z

Question: should we add a test that asserts something like ... or is that being too opinionated in our base tests?

I think that is a good idea. People can always still override that test, if they don't want to follow it.

TomAugspurger · 2018-10-18T01:52:36Z

Deduplicated some tests and added the base test for asserting that ExtensionArray.__add__(Series) returns NotImplemented.

jreback

minor comments

jreback · 2018-10-18T15:45:31Z

pandas/core/arrays/sparse.py

        op_name = ufunc.__name__
        op_name = aliases.get(op_name, op_name)

        if op_name in special and kwargs.get('out') is None:
            if isinstance(inputs[0], type(self)):
                return getattr(self, '__{}__'.format(op_name))(inputs[1])
            else:
-                return getattr(self, '__r{}__'.format(op_name))(inputs[0])
+                name = flipped.get(op_name, '__r{}__'.format(op_name))


note to @jbrockmendel since we do this a couple of times iIRC, we should have a more generic way of doing this

And if we implement __array_ufunc__ on more arrays, we'll need to do it in those places too.

I think pandas.core.ops._op_descriptions may have enough info.

That doesn't quite work since the comparison ops don't define reversed, which may be sensible (haven't really thought it through).

jreback · 2018-10-18T15:45:53Z

pandas/core/ops.py

        if mask.any():
            with np.errstate(all='ignore'):
                result[mask] = op(xrav[mask], y)

+        if op == pow:
+            result = np.where(~mask, x, result)


can you add the same comments as you have above here (e.g. 1 ** np.nan...)

I reworked this to update the mask for both pow and rpow, rather than adjusting the result.

jreback · 2018-10-18T15:46:17Z

pandas/tests/arrays/test_integer.py

@@ -128,6 +128,9 @@ def _check_op(self, s, op_name, other, exc=None):
            if omask is not None:
                mask |= omask

+        if op_name == '__rpow__':


do we not need the same for pow?

jreback · 2018-10-18T15:46:51Z

pandas/tests/arrays/test_integer.py

@@ -285,6 +294,21 @@ def test_error(self, data, all_arithmetic_operators):
        with pytest.raises(NotImplementedError):
            opa(np.arange(len(s)).reshape(-1, len(s)))

+    def test_pow(self):
+        a = pd.core.arrays.integer_array([1, None, None, 1])


can you import integer_array at the top (is it already)?

jreback · 2018-10-18T15:47:02Z

pandas/tests/arrays/test_integer.py

+    def test_rpow_one_to_na(self):
+        # https://github.com/pandas-dev/pandas/issues/22022
+        # NumPy says 1 ** nan is 1.
+        arr = integer_array([np.nan, np.nan])


hmm it must be as you are using it here

jreback · 2018-10-18T15:47:52Z

pandas/tests/arrays/test_integer.py

+    def test_pow(self):
+        a = pd.core.arrays.integer_array([1, None, None, 1])
+        b = pd.core.arrays.integer_array([1, None, 1, None])
+        result = a ** b


note I don't think we have a test that uses None / np.nan in integer_array construction (e.g. that they are both actually interpreted the same) if you want to add one (this test of course implicitiy asserts this)

Mostly works, just a failing case on integer_array([None]) in #23224

jreback

just a minor documentation request. otherwise lgtm.

jreback · 2018-10-18T17:33:18Z

pandas/core/arrays/integer.py

@@ -280,7 +280,7 @@ def _coerce_to_ndarray(self):
        data[self._mask] = self._na_value
        return data

-    __array_priority__ = 1  # higher than ndarray so ops dispatch to us
+    __array_priority__ = 1000  # higher than ndarray so ops dispatch to us


can you document this in the Mixin itself though (if you are not going to set it by defaulrt). It is so non-obvious that you need to do this.

[ci skip]

TomAugspurger · 2018-10-18T17:45:28Z

Added docs to the MixIns and in extending.rst.

…

On Thu, Oct 18, 2018 at 12:34 PM Jeff Reback ***@***.***> wrote: ***@***.**** approved this pull request. just a minor documentation request. otherwise lgtm. ------------------------------ In pandas/core/arrays/integer.py <#23155 (comment)>: > @@ -280,7 +280,7 @@ def _coerce_to_ndarray(self): data[self._mask] = self._na_value return data - __array_priority__ = 1 # higher than ndarray so ops dispatch to us + __array_priority__ = 1000 # higher than ndarray so ops dispatch to us can you document this in the Mixin itself though (if you are not going to set it by defaulrt). It is so non-obvious that you need to do this. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#23155 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIvr4jX023yfToIfD-E47Tid1W9rEks5umLufgaJpZM4XbRVy> .

jorisvandenbossche · 2018-10-18T20:41:00Z

doc/source/extending.rst

+
+   Regardless of the approach, you may want to implement ``__array_ufunc__``
+   or set ``__array_priority__`` if you want your implementation
+   to be called when involved in binary operations with NumPy


Is __array_ufunc__ directly used for binary operations?

If you use the ExtensionOpsMixin to set all the dunder methods, I don't think you need __array_ufunc__ for that? As I understood it, you can implement all the operators with __array_ufunc__ using the NDArrayOperatorsMixin (https://github.com/numpy/numpy/blob/v1.15.1/numpy/lib/mixins.py#L63-L183) (but of course it is still recommendable to implement it for ufuncs?)

Just to say that I find the note not fully clear.

Also, do you know if __array_priority__ is still used for other things? (or is it only used for ufuncs? In which case it can be clearer that this is for older numpy versions?)

Also, do you know if array_priority is still used for other things?

I don't know. I only mentioned __array_ufunc__, because it disables __array_priority__.

As I understood it, you can implement all the operators with array_ufunc using the NDArrayOperatorsMixin

You may be saying this, but NDAraryOperatorsMixin implements all the special methods by calling __array_ufunc__. So if your array implements __array_ufunc__, you can get all the operators for free.

But implementing __array_ufunc__ doesn't necessitate using NDArrayOperatorsMixin. I'm not sure what happens on a class defining __add__, __array_ufunc__, and __array_priority__,

__add__ will be used and not __array_ufunc__ I would think.

But therefore, I found your note reading a bit confusing: I read it as "you may want to implement __array_ufunc__ [...] if you want your implementation to be called when involved in binary operations", but implementing __array_ufunc__ itself does not involve in binary operations (only if using it to implement the binary ops duncer methods)

TomAugspurger · 2018-10-18T21:27:12Z

I've removed references to __array_ufunc__.

TomAugspurger · 2018-10-19T11:56:21Z

OK with this docs now Joris? I feel like if someone is using __array_ufunc__ and NDArrayOperatorMixin, they probably understand it better than I do :)

jorisvandenbossche · 2018-10-19T12:09:18Z

Yep, looks good! (and when looking further into __array_ufunc__ for EAs, we can always still add some notes about that)

jorisvandenbossche · 2018-10-23T08:49:20Z

Actually, Tom, you were right that __array_ufunc__ is influencing the binary operations. From the docs:

The presence of array_ufunc also influences how ndarray handles binary operations like arr + obj and arr < obj when arr is an ndarray and obj is an instance of a custom class.

https://docs.scipy.org/doc/numpy-1.15.1/reference/arrays.classes.html#numpy.class.__array_ufunc__

) This simplifies dispatch_to_extension_op. The remaining logic is simply unboxing Series / Indexes in favor of their underlying arrays. This forced two additional changes 1. Move some logic that IntegerArray relied on down to the IntegerArray ops. Things like handling of 0-dim ndarrays was previously broken on IntegerArray ops, but work with Serires[IntegerArray] 2. Fix pandas handling of 1 ** NA.

TomAugspurger added Numeric Operations Arithmetic, Comparison, and Logical operations ExtensionArray Extending pandas with custom dtypes or arrays. labels Oct 14, 2018

TomAugspurger added this to the 0.24.0 milestone Oct 14, 2018

TomAugspurger commented Oct 14, 2018

View reviewed changes

jbrockmendel reviewed Oct 14, 2018

View reviewed changes

array priorities, docs

fe693d6

jreback reviewed Oct 14, 2018

View reviewed changes

jorisvandenbossche reviewed Oct 15, 2018

View reviewed changes

jreback requested changes Oct 15, 2018

View reviewed changes

TomAugspurger added 2 commits October 15, 2018 08:30

REF: Return NotImplemented for ops with series / index

cf945ee

Merge remote-tracking branch 'upstream/master' into ea-no-list

6507f43

fix eq, ne

b7906ea

added base test

0125669

TomAugspurger mentioned this pull request Oct 17, 2018

Datetimelike Array Refactor #23185

Closed

TomAugspurger added 3 commits October 17, 2018 17:43

Merge remote-tracking branch 'upstream/master' into ea-no-list

f1dd665

zerodim

07a632e

move

e378c7d

Merge remote-tracking branch 'upstream/master' into ea-no-list

b2f9243

TomAugspurger mentioned this pull request Oct 18, 2018

REF: Make PeriodArray an ExtensionArray #22862

Merged

jreback requested changes Oct 18, 2018

View reviewed changes

TomAugspurger added 2 commits October 18, 2018 11:33

mask first

42b91d0

test none nan

f03e66b

TomAugspurger added 2 commits October 18, 2018 11:43

comments

f14cf0b

Merge remote-tracking branch 'upstream/master' into ea-no-list

32757f6

jreback approved these changes Oct 18, 2018

View reviewed changes

Note __array_ufunc__ and priority

4fc1d1b

[ci skip]

jorisvandenbossche reviewed Oct 18, 2018

View reviewed changes

DOC: clarify maybe [skip ci]

03a367e

TomAugspurger mentioned this pull request Oct 19, 2018

BUG: pandas.core.ops.dispatch_to_extension_op fails with UnboundLocalError #22414

Closed

jorisvandenbossche approved these changes Oct 19, 2018

View reviewed changes

jorisvandenbossche merged commit 29e586c into pandas-dev:master Oct 19, 2018

TomAugspurger deleted the ea-no-list branch October 19, 2018 12:34

BUG/PERF: Avoid listifying in dispatch_to_extension_op #23155

BUG/PERF: Avoid listifying in dispatch_to_extension_op #23155

Conversation

TomAugspurger commented Oct 14, 2018 • edited Loading

pep8speaks commented Oct 14, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Oct 14, 2018 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Oct 15, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Oct 15, 2018 • edited Loading

jorisvandenbossche commented Oct 15, 2018

TomAugspurger commented Oct 18, 2018

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Oct 18, 2018 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Oct 18, 2018

TomAugspurger commented Oct 19, 2018

jorisvandenbossche commented Oct 19, 2018

jorisvandenbossche commented Oct 23, 2018

TomAugspurger commented Oct 14, 2018 •

edited

Loading

codecov bot commented Oct 15, 2018 •

edited

Loading

TomAugspurger commented Oct 15, 2018 •

edited

Loading