-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Support operators for ExtensionArray #20889
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -53,6 +53,13 @@ class ExtensionArray(object): | |
* factorize / _values_for_factorize | ||
* argsort / _values_for_argsort | ||
|
||
For logical operators, the default is to return a Series of boolean. | ||
However, if the underlying ExtensionDtype overrides the logical | ||
operators, then the implementer may want to have an ExtensionArray | ||
subclass contain the result. This can be done by changing the property | ||
_logical_result from its default value of None to the _from_sequence | ||
method of the ExtensionArray subclass. | ||
|
||
This class does not inherit from 'abc.ABCMeta' for performance reasons. | ||
Methods and properties required by the interface raise | ||
``pandas.errors.AbstractMethodError`` and no ``register`` method is | ||
|
@@ -567,6 +574,9 @@ def copy(self, deep=False): | |
""" | ||
raise AbstractMethodError(self) | ||
|
||
# See documentation above | ||
_logical_result = None | ||
|
||
# ------------------------------------------------------------------------ | ||
# Block-related methods | ||
# ------------------------------------------------------------------------ | ||
|
@@ -610,3 +620,14 @@ def _ndarray_values(self): | |
used for interacting with our indexers. | ||
""" | ||
return np.array(self) | ||
|
||
# ------------------------------------------------------------------------ | ||
# Utilities for use by subclasses | ||
# ------------------------------------------------------------------------ | ||
def is_sequence_of_dtype(self, seq): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For what is this needed? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Indeed, this is expected to always be true. If it isn't I'd recommend making a superclass that has all the scalar types like I do in https://github.com/ContinuumIO/cyberpandas/blob/c66bbecaf5193bd284a0fddfde65395d119aad41/cyberpandas/ip_array.py#L22 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's suppose that you don't implement the |
||
""" | ||
Given a sequence, determine whether all members have the appropriate | ||
type for this instance of an ExtensionArray | ||
""" | ||
thistype = self.dtype.type | ||
return all(isinstance(i, thistype) for i in seq) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,6 +6,7 @@ | |
# necessary to enforce truediv in Python 2.X | ||
from __future__ import division | ||
import operator | ||
import inspect | ||
|
||
import numpy as np | ||
import pandas as pd | ||
|
@@ -30,7 +31,7 @@ | |
is_bool_dtype, | ||
is_list_like, | ||
is_scalar, | ||
_ensure_object) | ||
_ensure_object, is_extension_array_dtype) | ||
from pandas.core.dtypes.cast import ( | ||
maybe_upcast_putmask, find_common_type, | ||
construct_1d_object_array_from_listlike) | ||
|
@@ -990,6 +991,93 @@ def _construct_divmod_result(left, result, index, name, dtype): | |
) | ||
|
||
|
||
def dispatch_to_extension_op(left, right, op_name=None, is_logical=False): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The reason for the difference is as follows. The way I implemented this, we first look to see if the operator is defined for the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If the op is not defined on ExtensionArray, calling it directly ( |
||
""" | ||
Assume that left is a Series backed by an ExtensionArray, | ||
apply the operator defined by op_name. | ||
""" | ||
|
||
method = getattr(left.values, op_name, None) | ||
deflen = len(left) | ||
excons = type(left.values)._from_sequence | ||
exclass = type(left.values) | ||
testseq = left.values | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm having trouble understanding these names. (method makes sense). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
if is_logical: | ||
if exclass._logical_result is not None: | ||
excons = exclass._logical_result | ||
else: | ||
excons = None # Indicates boolean | ||
|
||
# The idea here is as follows. First we see if the op is | ||
# defined in the ExtensionArray subclass, and returns a | ||
# result that is not NotImplemented. If so, we use that | ||
# result. If that fails, then we try an | ||
# element by element operator, invoking the operator | ||
# on each element | ||
|
||
# First see if the extension array object supports the op | ||
res = NotImplemented | ||
if method is not None and inspect.ismethod(method): | ||
rvalues = right | ||
if is_extension_array_dtype(right) and isinstance(right, ABCSeries): | ||
rvalues = right.values | ||
try: | ||
res = method(rvalues) | ||
except TypeError: | ||
pass | ||
except Exception as e: | ||
raise e | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not doing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See above. In the code above, I'm testing to see if the |
||
|
||
def convert_values(parm): | ||
if is_extension_array_dtype(parm): | ||
ovalues = parm.values | ||
elif is_list_like(parm): | ||
ovalues = parm | ||
else: # Assume its an object | ||
ovalues = [parm] * deflen | ||
return ovalues | ||
|
||
if res is NotImplemented: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you explain this fallback a bit more? If the EA doesn't define ops, then I'm perfectly fine with raising There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See above. The idea here is that either the EA defines the ops, or the The idea here is that if you know that your underlying I used |
||
# Try it on each element. Support operation to another | ||
# ExtensionArray, or something that is list like, or | ||
# a single object. This allows a result of an operator | ||
# to be an object or any type | ||
lvalues = convert_values(left) | ||
rvalues = convert_values(right) | ||
|
||
# Get the method for each object. | ||
def callfunc(a, b): | ||
f = getattr(a, op_name, None) | ||
if f is not None: | ||
return f(b) | ||
else: | ||
return NotImplemented | ||
res = [callfunc(a, b) for (a, b) in zip(lvalues, rvalues)] | ||
|
||
# We can't use (NotImplemented in res) because the | ||
# results might be objects that have overridden __eq__ | ||
if any(isinstance(r, type(NotImplemented)) for r in res): | ||
msg = "invalid operation {opn} between {one} and {two}" | ||
raise TypeError(msg.format(opn=op_name, | ||
one=type(lvalues), | ||
two=type(rvalues))) | ||
|
||
# At this point we have the result | ||
# always return a full value series here | ||
res_values = com._values_from_object(res) | ||
if excons is not None: | ||
if testseq.is_sequence_of_dtype(res_values): | ||
# Convert to the ExtensionArray type if each result is of that | ||
# type. If _logical_result was not None, this will then use | ||
# the function set there to return an appropriate result | ||
res_values = excons(res_values) | ||
|
||
res_name = get_op_result_name(left, right) | ||
return left._constructor(res_values, index=left.index, | ||
name=res_name) | ||
|
||
|
||
def _arith_method_SERIES(cls, op, special): | ||
""" | ||
Wrapper function for Series arithmetic operations, to avoid | ||
|
@@ -1058,6 +1146,9 @@ def wrapper(left, right): | |
raise TypeError("{typ} cannot perform the operation " | ||
"{op}".format(typ=type(left).__name__, op=str_rep)) | ||
|
||
elif is_extension_array_dtype(left): | ||
return dispatch_to_extension_op(left, right, op_name) | ||
|
||
lvalues = left.values | ||
rvalues = right | ||
if isinstance(rvalues, ABCSeries): | ||
|
@@ -1208,6 +1299,9 @@ def wrapper(self, other, axis=None): | |
return self._constructor(res_values, index=self.index, | ||
name=res_name) | ||
|
||
elif is_extension_array_dtype(self): | ||
return dispatch_to_extension_op(self, other, op_name, True) | ||
|
||
elif isinstance(other, ABCSeries): | ||
# By this point we have checked that self._indexed_same(other) | ||
res_values = na_op(self.values, other.values) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it needed to have this property? Can't we simply detect whether the result is a boolean numpy array or again an ExtensionArray ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this use-case a bit more? I think we will certainly want
Series <compare> Series
to always be an ndarray of booleans.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't speak for the author, but my assumption was that this has to do with some of the spaghetti-code in
ops._bool_method_SERIES
, where sometimes a bool-dtype is returned and other times an int-dtype is returned (and datetimelike are currently all broken, see #19972, #19759). Straightening out this mess independently of EA implementations is part of the plan referred to above.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, in my use case, I need the boolean operators to return an object that represents the relation. I'm using pandas on top of 2 different libraries (that functionally are the same) where the operators (x <= y), (x >= y) and (x == y) are not booleans, but objects representing the relations.