Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Support ExtensionArray operators via a mixin #21261

Merged
merged 19 commits into from
Jun 29, 2018
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.24.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ Other Enhancements
^^^^^^^^^^^^^^^^^^
- :func:`to_datetime` now supports the ``%Z`` and ``%z`` directive when passed into ``format`` (:issue:`13486`)
- :func:`to_csv` now supports ``compression`` keyword when a file handle is passed. (:issue:`21227`)
- ``ExtensionArray`` has a ``ExtensionOpsMixin`` factory that allows default operators to be defined (:issue:`20659`, :issue:`19577`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make a separate sub-section. pls expand this a bit, I know what you mean, but I doubt the average reader does.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the note here still accurate or has the factory been changed to just mixins?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've written a section for whatsnew

-

.. _whatsnew_0240.api_breaking:
Expand Down
2 changes: 1 addition & 1 deletion pandas/api/extensions/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,5 @@
register_index_accessor,
register_series_accessor)
from pandas.core.algorithms import take # noqa
from pandas.core.arrays.base import ExtensionArray # noqa
from pandas.core.arrays.base import ExtensionArray, ExtensionOpsMixin # noqa
from pandas.core.dtypes.dtypes import ExtensionDtype # noqa
2 changes: 1 addition & 1 deletion pandas/core/arrays/__init__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
from .base import ExtensionArray # noqa
from .base import ExtensionArray, ExtensionOpsMixin # noqa
from .categorical import Categorical # noqa
93 changes: 93 additions & 0 deletions pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,11 @@

from pandas.errors import AbstractMethodError
from pandas.compat.numpy import function as nv
from pandas.compat import set_function_name, PY3
import pandas.core.common as com
from pandas.core.dtypes.common import (
is_extension_array_dtype,
is_list_like)

_not_implemented_message = "{} does not implement {}."

Expand Down Expand Up @@ -610,3 +615,91 @@ def _ndarray_values(self):
used for interacting with our indexers.
"""
return np.array(self)


def ExtensionOpsMixin(include_arith_ops, include_logic_ops):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a compelling reason to make a factory for this instead of just making two mixin classes?

class ExtensionArithmeticMixin(...):

class ExtensionLogicMixin(...):

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel I am sharing the code of the operator between arithmetic and comparison operators. Alternatively, I could just create a base mixin class, and have ExtensionArithmeticMixin and ExtensionLogicMixin inherit from the base.

If you guys want me to make that change, let me know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel I implemented your suggestion. It does look cleaner. It's in the latest commit

"""A mixin factory for creating default arithmetic and logical operators,
which are based on the underlying dtype backing the ExtensionArray

Parameters
----------
include_arith_ops : boolean indicating whether arithmetic ops should be
created
include_logic_ops : boolean indicating whether logical ops should be
created

Returns
-------
A mixin class that has the associated operators defined.

Usage
------
If you have defined a subclass MyClass(ExtensionArray), then
use MyClass(ExtensionArray, ExtensionOpsMixin(True, True)) to
get both the arithmetic and logical operators
"""
class _ExtensionOpsMixin(object):
pass

def create_method(op_name):
def _binop(self, other):
def convert_values(parm):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor detail, but I find 'values' (or 'param') is clearer than 'parm'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

if isinstance(parm, ExtensionArray):
ovalues = list(parm)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it needed to convert to a list? ExtensionArrays support iterating through them (which returns the scalars)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

elif is_extension_array_dtype(parm):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A Series is also 'list-like', so I think this elif will not be reached.
But given that you simply iterate through it (which will be the same for series or extensionarray), I think you can simply remove that part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

ovalues = parm.values
elif is_list_like(parm):
ovalues = parm
else: # Assume its an object
ovalues = [parm] * len(self)
return ovalues
lvalues = convert_values(self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be needed as the calling object will always be already an ExtensionArray?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Fixed

rvalues = convert_values(other)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably do alignment as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't need to, for a few reasons:

  • The _binop method from here is a method on ExtensionArray (i.e., isinstance(self, ExtensionArray) == True
  • This method is called by dispatch_to_extension_ops in ops.py, which is called after alignment has already been done in _arith_method_SERIES and alignment is ensured in _comp_method_SERIES

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, OK, this is also tested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche Yes, this is tested because I am using the same tests that are used for general operators on Series, which tests things that are misaligned, etc.

See pandas/tests/extension/tests/decimal/test_decimal.py that leverages the TestSeriesOperators.test_operators method from pandas\tests\series\test_operators .

As an aside, doing it this way uncovered the issues with .get and .combine that I submitted.


# Get the method for each object.
def callfunc(a, b):
f = getattr(a, op_name, None)
if f is not None:
return f(b)
else:
return NotImplemented
res = [callfunc(a, b) for (a, b) in zip(lvalues, rvalues)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we would do op(a, b) directly (so in practice doing a + b instead of a.__add__(b)), you would not need the checking for NotImplemented below?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar as

def _make_compare(op):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check for NotImplemented is needed because the underlying ExtensionDtype class (which are the objects in lvalues and rvalues above) may not implement the specific operator.

Not sure what you are referring to when you wrote "similar as" the code in `pandas/core/indexes/category.py". That implementation is not doing things element by element, which is what I'm doing here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you doing it this way, dynamically generating this? there is a limited set of ops, just write them out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check for NotImplemented is needed because the underlying ExtensionDtype class (which are the objects in lvalues and rvalues above) may not implement the specific operator.

Yes, but if you do op(a, b), then this will always raise a TypeError, and not return NotImplemented, and then you don't need to check manually whether any NotImplemented object was returned (and the generated error message will already be fine).

Not sure what you are referring to when you wrote "similar as" the code in `pandas/core/indexes/category.py". That implementation is not doing things element by element, which is what I'm doing here.

I simply referenced it because it does something very similar (generating the comparison methods), and does this by using the operator. methods (which is what I mean with op(a, b) above).
Whether it then does not do it element-wise does is a detail.

So what I mean is the following. You basically do

op_name = '__add__'
f = getattr(a, op_name, None)
f(a, b)

which translates to doing a.__add__(b) and might return NotImplemented.
What I meant with my comment was doing instead:

f = operator.add
f = f(a, b)

which translates to a + b and will directly raise the appropriate error.
And the code I referenced is an example how the the __add__ and operator.add are linked in generating the methods.

Copy link
Member

@jorisvandenbossche jorisvandenbossche May 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you doing it this way, dynamically generating this? there is a limited set of ops, just write them out.

@jreback are you speaking about this create_method function that you commented on? Or the fact below the create_method function is used in a loop instead of manually repeated?
Because if it is the first: that is how we do it almost everywhere in the pandas codebase. If it is the second: yes that is true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

21d76b3

is what I mean.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback This is a bit of the "what comes first" discussion held at #20889 (comment) with @jorisvandenbossche . My goal with the mixin is that if you already have the ExtensionDtype subclass (e.g. Decimal) that has the operators defined, you just pull in the mixin, and you're all set with the operators.

In 21d76b3 you are creating the basis for a more general operator support. I could make the mixin use the more general framework you've proposed, but I guess that should be merged in before I do the mixin. (That's the "what comes first" issue)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche My latest commit does as you suggest. This removes the tests for NotImplemented

@jreback My latest commit uses a similar pattern to what you had in the referenced commit, just creating an operator one by one instead of using a loop

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orthogonal to @jorisvandenbossche 's comment, instead of [callfunc(a, b) for (a, b) in zip(lvalues, rvalues)] consider libops.vec_binop(lvalues, rvalues, callfunc). Should be equivalent but more performant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't seem to work properly?

In [64]: a = np.random.randn(100000).astype(object)

In [65]: b = np.random.randn(100000).astype(object)

In [66]: def func(a, b):
    ...:     return a + b
    ...: 
    ...: 

In [67]: a + b
Out[67]: 
array([-1.7373671804522872, -0.44608341233368065, -3.5963900176223706, ...,
       -1.0538754426986237, -2.0261407625013654, -0.6206810372160029], dtype=object)

In [68]: pd._libs.ops.vec_binop(a, b, func)
Out[68]: array([nan, nan, nan, ..., nan, nan, nan], dtype=object)

In [69]: np.array([func(aa, bb) for aa, bb in zip(a, b)], dtype=object)
Out[69]: 
array([-1.7373671804522872, -0.44608341233368065, -3.5963900176223706, ...,
       -1.0538754426986237, -2.0261407625013654, -0.6206810372160029], dtype=object)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's because it does a maybe_convert_bool at then end, I suppose. Is there a version that does not do this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I guess that explains the "bin" part of "vec_binop". So it looks like a) the original suggestion is incorrect and b) it would probably be easy to implement and may improve perf.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought "binary" meant that it is an op between 2 values, not that the result in boolean. So I think the naming is just confusing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on both counts.


# We can't use (NotImplemented in res) because the
# results might be objects that have overridden __eq__
if any(isinstance(r, type(NotImplemented)) for r in res):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the comment above. Is there a scenario in which isinstance(r, type(NotImplemented)) and not (r is NotImplemented)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment was about using in. Having said that, I think I can use (r is NotImplemented) vs. isinstance(r, type(NotImplemented))

But now I have removed that in my working version based on @jorisvandenbossche suggestion to use the op rather than the op_name

msg = "invalid operation {opn} between {one} and {two}"
raise TypeError(msg.format(opn=op_name,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I would not care catching the error.
Eg for the decimal test case:

In [8]: import decimal

In [9]: d = decimal.Decimal('1.1')

In [10]: d + 'a'
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-c0db2fc01a1f> in <module>()
----> 1 d + 'a'

TypeError: unsupported operand type(s) for +: 'decimal.Decimal' and 'str'

which is already a nice error (and IMO even clearer, because it does not contain references to "Extension" which normal users don't necessarily are familiar with)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Fixed.

one=type(lvalues),
two=type(rvalues)))

res_values = com._values_from_object(res)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this does anything for a list (it just passes it through), so this can be removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I removed it and things still work. Had copied this from elsewhere.


try:
res_values = self._from_sequence(res_values)
except TypeError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it makes sense to have a separate _make_comparision_op and _make_aithmetics_op ?
The the comparison one should not need to try the conversion to ExtensionArray (if you want this to happen like in your case, you can still use the arithmetic one to add the comparison ops manually on your class)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. But if someone else had the same use case as me, where the comparison ops need to return an object rather than a boolean, then they'd have to know the workaround you suggest, and so then the question is whether we document it.

I'm leaving this as is for now, but will accept if you really want me to change it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One reason to split it up would be to avoid the extra overhead to try to convert it to an ExtensionArray in case of the boolean ops (although for a good implementation this probably should not be too much overhead ..)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've done the split for the next commit. I added a parameter to create_method() to specify whether to try the coercion to the underlying dtype, and said to not do it for the comparison operators.

pass

return res_values

name = '__{name}__'.format(name=op_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you defining create_method as anything other than NotImplementedError

this is way too opionated - if u want to define a function to do this and an author can use this ok
but it should not be the default

maybe if u want to create an ExtensionOpsMixinForScalars (long name though)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a closer look at the diff if you give such an opinionated (pun intended :-)) comment

  • this is not the default, it's an optional mixin extension authors can reuse (so what you suggest?)
  • so it is exactly the point to be opinionated and fallback to the scalars (we might try to think of a better name that reflects this, as you suggested)
  • it is what we discussed in the other PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche maybe I wasn't clear.

ExtensionOpsBase should simply have NotImplementedError for the actual implementation of the arithmethic and comaprison ops. Scalar implementations of this will IMHO rarely be used and simply cluttering things up. This can be put as a separate MixIn class if you wish for inclusion by an author, including it here is forcing a pretty bad idiom: coupling the interface and implementation, and is quite error prone because its not opt in. If an author extends and adds ops, they immediate get an implementation, which could very easiy be buggy / not work / work in unexpected ways.

I think the ops implementation is so important that this must be an explicit choice.

The point of the mixin is to avoid the boilerplate of writing __div__ - .... and so on. NOT the implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be put as a separate MixIn class if you wish for inclusion by an author,

Again, this is what this PR is doing.
To repeat from above, is might be it is just the name of ExtensionOpsBase which is not clear? (with which I agree)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback @jorisvandenbossche Maybe the naming is misleading. So let me clarify the intent of the PR.

The goal here is that if someone extends ExtensionArray (let's call it MyArray for sake of discussion) where they are storing objects (let's call them MyClass) that already have the arithmetic and/or comparison operators defined for the class MyClass, they can use these mixins to get those operators defined on the MyArray subclass by simply adding the mixin as one of the base classes of MyArray.

An example is Decimal. The operators are defined there, so the mixin gives you the operators on DecimalArray "for free". I also have my own application that extends ExtensionArray and it uses the same principal.

The use of these 3 classes (ExtensionOpsBase, ExtensionArithmeticMixin and ExtensionComparisonMixin) is optional for people who are writing their own subclass of ExtensionArray. The ExtensionOpsBase class is just a base class for the 2 mixin classes. It contains the create_method() method that allows one to create an individual operator, under the assumption that MyClass has the operator defined. Aside from the use of the two mixin classes that define all of the operators, ExtensionOpsBase.createmethod() can be used to define individual methods. So, for example, if the MyClass class only supports addition, you can choose to not use the mixin, and just do something like the following in MyArray to create the operator.

    __add__ = ExtensionOpsBase.create_method(operator.add)

IMHO, it seems that @jorisvandenbossche understands what I'm trying to do, but @jreback may not fully understand it.

If you think there is a better name for ExtensionOpsBase, I'm open to suggestions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche I've tried a variety of implementations without the base class, and couldn't get it to work. I'll rename as you suggest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renaming looks ok to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next commit will use the renaming

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on the naming conventions. Maybe the NotImplemented versions of the methods belong on the base ExtensionArray?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel The NotImplemented versions are there by default for any class, so I don't think we need to do anything. In other words, if you write a class MyClass that extends ExtensionArray and you don't define the methods for operators, they are by default NotImplemented.

return set_function_name(_binop, name, _ExtensionOpsMixin)

if include_arith_ops:
arithops = ['__add__', '__radd__', '__sub__', '__rsub__', '__mul__',
'__rmul__', '__pow__', '__rpow__', '__mod__', '__rmod__',
'__floordiv__', '__rfloordiv__', '__truediv__',
'__rtruediv__', '__divmod__', '__rdivmod__']
if not PY3:
arithops.extend(['__div__', '__rdiv__'])

for op_name in arithops:
setattr(_ExtensionOpsMixin, op_name, create_method(op_name))

if include_logic_ops:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these be called comparison_ops instead of logic_ops? I tend to think of the latter as meaning `&|^'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

logicops = ['__eq__', '__ne__', '__lt__', '__gt__',
'__le__', '__ge__']
for op_name in logicops:
setattr(_ExtensionOpsMixin, op_name, create_method(op_name))

return _ExtensionOpsMixin
10 changes: 7 additions & 3 deletions pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -2972,16 +2972,20 @@ def get_value(self, series, key):
# use this, e.g. DatetimeIndex
s = getattr(series, '_values', None)
if isinstance(s, (ExtensionArray, Index)) and is_scalar(key):
# GH 20825
# GH 20882, 21257
# Unify Index and ExtensionArray treatment
# First try to convert the key to a location
# If that fails, see if key is an integer, and
# If that fails, raise a KeyError if an integer
# index, otherwise, see if key is an integer, and
# try that
try:
iloc = self.get_loc(key)
return s[iloc]
except KeyError:
if is_integer(key):
if (len(self) > 0 and
self.inferred_type in ['integer', 'boolean']):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What makes this necessary? I'm not clear on how it relates to the Mixin classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this you can comment on the relevant other PR (#21260)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel the other PR (#21260) was needed to make the tests here work right.

raise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls move all non-ops code to another PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same answer as above: this is needed for the test to pass untill the other PRs are merged (to say: it is already moved to separate PRs)

elif is_integer(key):
return s[key]

s = com._values_from_object(series)
Expand Down
21 changes: 21 additions & 0 deletions pandas/core/ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
is_bool_dtype,
is_list_like,
is_scalar,
is_extension_array_dtype,
_ensure_object)
from pandas.core.dtypes.cast import (
maybe_upcast_putmask, find_common_type,
Expand Down Expand Up @@ -990,6 +991,20 @@ def _construct_divmod_result(left, result, index, name, dtype):
)


def dispatch_to_extension_op(left, right, op_name):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be accomplished with:

return dispatch_to_index_op(op, left, right, left.values.__class__)

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we require that the ExtensionArray constructor handles objects of itself (and does not copy them), we cannot do the your example code (and currently we don't require this from extension array implementations I think.
It is also don't think it necessarily improves readability of the code by reusing that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. Eventually dispatch_to_index_op is intended to give way to an EA-based dispatch.

"""
Assume that left is a Series backed by an ExtensionArray,
apply the operator defined by op_name.
"""

method = getattr(left.values, op_name, None)
res_values = method(right)

res_name = get_op_result_name(left, right)
return left._constructor(res_values, index=left.index,
name=res_name)


def _arith_method_SERIES(cls, op, special):
"""
Wrapper function for Series arithmetic operations, to avoid
Expand Down Expand Up @@ -1058,6 +1073,9 @@ def wrapper(left, right):
raise TypeError("{typ} cannot perform the operation "
"{op}".format(typ=type(left).__name__, op=str_rep))

elif is_extension_array_dtype(left):
return dispatch_to_extension_op(left, right, op_name)

lvalues = left.values
rvalues = right
if isinstance(rvalues, ABCSeries):
Expand Down Expand Up @@ -1208,6 +1226,9 @@ def wrapper(self, other, axis=None):
return self._constructor(res_values, index=self.index,
name=res_name)

elif is_extension_array_dtype(self):
return dispatch_to_extension_op(self, other, op_name)

elif isinstance(other, ABCSeries):
# By this point we have checked that self._indexed_same(other)
res_values = na_op(self.values, other.values)
Expand Down
27 changes: 18 additions & 9 deletions pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -2196,23 +2196,22 @@ def _binop(self, other, func, level=None, fill_value=None):
result.name = None
return result

def combine(self, other, func, fill_value=np.nan):
def combine(self, other, func, fill_value=None):
"""
Perform elementwise binary operation on two Series using given function
with optional fill value when an index is missing from one Series or
the other

Parameters
----------
other : Series or scalar value
func : function
Function that takes two scalars as inputs and return a scalar
fill_value : scalar value

The default specifies to use the appropriate NaN value for
the underlying dtype of the Series
Returns
-------
result : Series

Examples
--------
>>> s1 = Series([1, 2])
Expand All @@ -2221,26 +2220,36 @@ def combine(self, other, func, fill_value=np.nan):
0 0
1 2
dtype: int64

See Also
--------
Series.combine_first : Combine Series values, choosing the calling
Series's values first
"""
self_is_ext = is_extension_array_dtype(self.values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move all of this (combine stuff) to another PR on top of this one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if fill_value is None:
fill_value = na_value_for_dtype(self.dtype, False)

if isinstance(other, Series):
new_index = self.index.union(other.index)
new_name = ops.get_op_result_name(self, other)
new_values = np.empty(len(new_index), dtype=self.dtype)
for i, idx in enumerate(new_index):
new_values = []
for idx in new_index:
lv = self.get(idx, fill_value)
rv = other.get(idx, fill_value)
with np.errstate(all='ignore'):
new_values[i] = func(lv, rv)
new_values.append(func(lv, rv))
else:
new_index = self.index
with np.errstate(all='ignore'):
new_values = func(self._values, other)
new_values = [func(lv, other) for lv in self._values]
new_name = self.name

if self_is_ext and not is_categorical_dtype(self.values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use shorthands like this

make this simpler, something like

if is_categorical_dtype(...):
    pass
elif is_extension_dtype(...):
   ....

much more readable

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment that in #21183

try:
new_values = self._values._from_sequence(new_values)
except TypeError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you catching a TypeError?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is also for the other PR, but: because the result might not necessarily be an ExtensionArray.

This is of course a bit an unclear area of combine, but currently there is no restriction on the function that the dtype needs to be preserved in the result of the function.

pass

return self._constructor(new_values, index=new_index, name=new_name)

def combine_first(self, other):
Expand Down
2 changes: 1 addition & 1 deletion pandas/tests/extension/base/getitem.py
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,7 @@ def test_get(self, data):
expected = s.iloc[[0, 1]]
self.assert_series_equal(result, expected)

assert s.get(-1) == s.iloc[-1]
assert s.get(-1) is None
assert s.get(s.index.max() + 1) is None

s = pd.Series(data[:6], index=list('abcdef'))
Expand Down
8 changes: 5 additions & 3 deletions pandas/tests/extension/decimal/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
import numpy as np

import pandas as pd
from pandas.core.arrays import ExtensionArray
from pandas.core.arrays import ExtensionArray, ExtensionOpsMixin
from pandas.core.dtypes.base import ExtensionDtype


Expand All @@ -24,11 +24,13 @@ def construct_from_string(cls, string):
"'{}'".format(cls, string))


class DecimalArray(ExtensionArray):
class DecimalArray(ExtensionArray, ExtensionOpsMixin(True, True)):
dtype = DecimalDtype()

def __init__(self, values):
assert all(isinstance(v, decimal.Decimal) for v in values)
for val in values:
if not isinstance(val, self.dtype.type):
raise TypeError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

erorr message

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed (and also in pandas/tests/extension/json/array.py)

values = np.asarray(values, dtype=object)

self._data = values
Expand Down
36 changes: 36 additions & 0 deletions pandas/tests/extension/decimal/test_decimal.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,9 @@

from pandas.tests.extension import base

from pandas.tests.series.test_operators import TestSeriesOperators
from pandas.util._decorators import cache_readonly

from .array import DecimalDtype, DecimalArray, make_data


Expand Down Expand Up @@ -183,3 +186,36 @@ def test_dataframe_constructor_with_different_dtype_raises():
xpr = "Cannot coerce extension array to dtype 'int64'. "
with tm.assert_raises_regex(ValueError, xpr):
pd.DataFrame({"A": arr}, dtype='int64')


_ts = pd.Series(DecimalArray(make_data()))


class TestOperator(BaseDecimal, TestSeriesOperators):
@cache_readonly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never do this. use a fixture

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unavoidable, because I am subclassing the tests.series.tests_operators.TestSeriesOperators class. That class inherits from tests.series.common.TestData, which defines ts as a @cache_readonly property. The method I use for testing (TestSeriesOperators.test_operators()) has references to self.ts, i.e., it doesn't use a fixture, and I can't see how to override that definition so we get a DecimalArray series and reuse all that testing code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind, I redid the tests as you suggested elsewhere

def ts(self):
ts = _ts.copy()
ts.name = 'ts'
return ts

def test_operators(self):
def absfunc(v):
if isinstance(v, pd.Series):
vals = v.values
return pd.Series(vals._from_sequence([abs(i) for i in vals]))
else:
return abs(v)
context = decimal.getcontext()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what the heck is all this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am reusing the tests that are in the class pandas/tests/series/test_operators/TestSeriesOperators. For decimal, the default when dividing by zero is to raise an error, as opposed to in numpy, where inf is returned. So by changing the defaults for Decimal, we get the same behavior as numpy.

There is a related issue with respect to abs(), because the tests in TestSeriesOperators call np.abs(), but that doesn't work for the Decimal class.

divbyzerotrap = context.traps[decimal.DivisionByZero]
invalidoptrap = context.traps[decimal.InvalidOperation]
context.traps[decimal.DivisionByZero] = 0
context.traps[decimal.InvalidOperation] = 0
super(TestOperator, self).test_operators(absfunc)
context.traps[decimal.DivisionByZero] = divbyzerotrap
context.traps[decimal.InvalidOperation] = invalidoptrap

def test_operators_corner(self):
pytest.skip("Cannot add empty Series of float64 to DecimalArray")

def test_divmod(self):
pytest.skip("divmod not appropriate for Decimal type")
6 changes: 3 additions & 3 deletions pandas/tests/series/test_operators.py
Original file line number Diff line number Diff line change
Expand Up @@ -1216,11 +1216,11 @@ def test_neg(self):
def test_invert(self):
assert_series_equal(-(self.series < 0), ~(self.series < 0))

def test_operators(self):
def test_operators(self, absfunc=np.abs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this change related to the PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was leftover from the previous method of testing, so I will revert it back.

def _check_op(series, other, op, pos_only=False,
check_dtype=True):
left = np.abs(series) if pos_only else series
right = np.abs(other) if pos_only else other
left = absfunc(series) if pos_only else series
right = absfunc(other) if pos_only else other

cython_or_numpy = op(left, right)
python = left.combine(right, op)
Expand Down
7 changes: 6 additions & 1 deletion pandas/util/testing.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,8 @@
is_categorical_dtype,
is_interval_dtype,
is_sequence,
is_list_like)
is_list_like,
is_extension_array_dtype)
from pandas.io.formats.printing import pprint_thing
from pandas.core.algorithms import take_1d
import pandas.core.common as com
Expand Down Expand Up @@ -1225,6 +1226,10 @@ def assert_series_equal(left, right, check_dtype=True,
right = pd.IntervalIndex(right)
assert_index_equal(left, right, obj='{obj}.index'.format(obj=obj))

elif (is_extension_array_dtype(left) and not is_categorical_dtype(left) and
is_extension_array_dtype(right) and not is_categorical_dtype(right)):
return assert_extension_array_equal(left.values, right.values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should only do this if check_dtype=True ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow your logic for this. Before, we had no code that really compared two EA's for testing. However, that code doesn't work correctly if they are categorical dtypes. The check_dtype is a separate check, so if check_dtype was False, you'd still do this test the same way.


else:
_testing.assert_almost_equal(left.get_values(), right.get_values(),
check_less_precise=check_less_precise,
Expand Down