Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExtensionArray Ops discussion #19577

Closed
TomAugspurger opened this issue Feb 7, 2018 · 6 comments · Fixed by #21261
Closed

ExtensionArray Ops discussion #19577

TomAugspurger opened this issue Feb 7, 2018 · 6 comments · Fixed by #21261
Labels
API Design Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@TomAugspurger
Copy link
Contributor

Split from #19520

How to handle things like __eq__, __add__, etc.

Neither the Python 2 or Python 3 object defaults are appropriate, so I think we should make it abstract or provide a default implementation for some / all of these.

We should be able to do a default implementation for the simple case of binop(extension_array, extension_array) by

  • masking nulls in either self | other
  • cast self / other to ndarrays
  • compare and apply null mask

@jorisvandenbossche raises the point that for some extension arrays, casting to an object ndarray can be expensive. It'd be unfortunate to do the casting if we're just going to raise a TypeError when the binop is done. In this case the subclass will need to override all those ops (or we could add class attribute like _comparable and check that before doing __eq__ or __lt__, etc).

How much coercion do we want to attempt? I would vote to stay simple and only attempt a comparison if

other is an instance of the same ExtensionArray type
other is a scalar of self.dtype.type.

Otherwise we return NotImplemented. Subclasses can of course choose more or less aggressive coercion rules.

When boxed in a Series or Index, I think the goal should be

  1. dispatch to the underlying .values comparsion
  2. Reconstruct the appropriate output (box in series / index with name).
@TomAugspurger TomAugspurger added API Design Numeric Operations Arithmetic, Comparison, and Logical operations labels Feb 7, 2018
@TomAugspurger TomAugspurger added this to the Next Major Release milestone Feb 7, 2018
@jorisvandenbossche jorisvandenbossche modified the milestones: Next Major Release, 0.23.0 Feb 7, 2018
@jbrockmendel
Copy link
Member

Migrated from #19520:

A thorny issue: What to do with comparison methods

I've been planning on moving the Index/Series arithmetic/comparison ops into Array subclasses, so that Series.__op__(other) could just wrap self._data.blocks[0].__op__(other) and we get consistency between Series/DataFrame ops for free.

Most of the comparison ops are ready to make that jump; I can prioritize it if getting that done quickly will help you out.

One caveat I'm concerned about is that the existing implementations in the Index subclasses have gotten tangled up with a bunch of unrelated Index machinery. Ideally I'd like these operations to go into self-contained mixin classes that rely on constructor methods, but are independent of slicing/concat/reindex/dropna/... mentioned above.

@jbrockmendel
Copy link
Member

@TomAugspurger two quick namespace-gameplan questions.

Assuming the arith/comparison methods currently in DatetimeIndexOpsMixin/DTI/TDI/PI get moved into analogous array classes, do you envision these getting a) mixed into the appropriate Index/Block subclasses or b) accessed via composition? If the latter, what name (DatetimeIndex._??_values) should these use to get at the pdarray implementation?

Because the datetimelike methods wrap some of the base Index methods, some of those will need to move up too. Where do you envision something like BaseArray living?

@jorisvandenbossche
Copy link
Member

It will be composition (that is what Block is already doing), and for Index it will be the "_values" attribute IIRC

@jreback jreback modified the milestones: 0.23.0, Next Major Release Apr 14, 2018
@Dr-Irv
Copy link
Contributor

Dr-Irv commented Apr 23, 2018

@TomAugspurger So now I've hit upon this issue of how to deal with the binary operators, and the question is what should these operators return (in the case of the arithmetic operators). Consider the decimal example in tests/extension/decimal and the following code:

In [1]: import decimal

In [2]: import pandas as pd

In [3]: from pandas.tests.extension.decimal.array import DecimalArray
   ...:
   ...: a = DecimalArray([decimal.Decimal(str(i)) for i in range(5)])
   ...: b = a.values
   ...: sa = pd.Series(a)
   ...: sb = pd.Series(b)
   ...: tb = sb + sb
   ...:

In [4]: tb
Out[4]:
0    0
1    2
2    4
3    6
4    8
dtype: object

In [5]: ta = sa + sa
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\ops.py in na_op(x, y)
   1008         try:
-> 1009             result = expressions.evaluate(op, str_rep, x, y, **eval_kwargs)
   1010         except TypeError:

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\computation\expressions.py in evaluate(op, op_str, a, b, use_numexpr, **eval_kwargs)
    204     if use_numexpr:
--> 205         return _evaluate(op, op_str, a, b, **eval_kwargs)
    206     return _evaluate_standard(op, op_str, a, b)

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\computation\expressions.py in _evaluate_numexpr(op, op_str, a, b, truediv, reversed, **eval_kwargs)
    119     if result is None:
--> 120         result = _evaluate_standard(op, op_str, a, b)
    121

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\computation\expressions.py in _evaluate_standard(op, op_str, a, b, **eval_kwargs)
     64     with np.errstate(all='ignore'):
---> 65         return op(a, b)
     66

TypeError: unsupported operand type(s) for +: 'DecimalArray' and 'DecimalArray'

During handling of the above exception, another exception occurred:

AssertionError                            Traceback (most recent call last)
<ipython-input-5-8612afbce1b2> in <module>()
----> 1 ta = sa + sa

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\ops.py in wrapper(left, right)
   1064             rvalues = rvalues.values
   1065
-> 1066         result = safe_na_op(lvalues, rvalues)
   1067         return construct_result(left, result,
   1068                                 index=left.index, name=res_name, dtype=None)

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\ops.py in safe_na_op(lvalues, rvalues)
   1028         try:
   1029             with np.errstate(all='ignore'):
-> 1030                 return na_op(lvalues, rvalues)
   1031         except Exception:
   1032             if is_object_dtype(lvalues):

C:\EclipseWorkspaces\LiClipseWorkspace\pandas-dev\pandas36\pandas\core\ops.py in na_op(x, y)
   1015                 result[mask] = op(x[mask], com._values_from_object(y[mask]))
   1016             else:
-> 1017                 assert isinstance(x, np.ndarray)
   1018                 result = np.empty(len(x), dtype=x.dtype)
   1019                 mask = notna(x)

AssertionError:

So if the Decimal values are in a Series with dtype object, I can add them just fine, and I get a Series with dtype object. In the case of adding the Series that is backed by a DecimalArray, then it fails. It's not clear to me if we want to result to be the type of the subclass of ExtensionArray or if the person implementing the subclass can specify which class is used for each operator. (One could imagine that the result of a binary operator could be a different ExtensionArray subclass, or even that the result of different binary operators could be different subclasses. For example, arithmetic operators produce objects, and relational operators produce booleans).

So there are 3 options as I see it:

  1. Make it the responsibility of the person subclassing ExtensionArray to implement the operators to determine the return type
  2. Make the result of all ExtensionArraySubclass.binop(ExtensionArraySubclass) always be ExtensionArraySubclass
  3. Set up something in the ExtensionArray implementation that allows the class containing results of operators to be defined by the person implementing the subclass.

I think I'd favor (3), but I could live with (2). I'd prefer to not do (1), as that is a lot of effort that it seems people would have to repeat.

Thoughts?

@jorisvandenbossche jorisvandenbossche modified the milestones: Next Major Release, 0.24.0 Apr 23, 2018
@jorisvandenbossche
Copy link
Member

I think initially we should go for option 1, and make sure that pandas actually dispatches to it. But I don't fully understand your option 3. Can you clarify a bit more?

I don't think option 2 is actually an option (IMO it is also independent of deciding where the actual operation is implemented), because, as you mention, the result of an operation does not necessarily need to be of the same type (additional example: substraction of datetimes gives timedelta. That is definitely a case we need to handle)

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Apr 23, 2018

@jorisvandenbossche Yes, I see now that option 2 isn't viable.

My idea on option 3 is something like this. The subclass implements a _binop_result_class class attribute that contains a dictionary. The keys to that dictionary are the different operators, and the values for each key are the array class that the operator returns. Any operators not in the dictionary use default pandas behaviors (e.g., return np.array([], dtype='object') for arithmetic operators, and booleans for logical ones).

The default behavior is to just do an element-by-element operation. If the class of the underlying dtype has implemented the operator, then it gets called automatically, and all is well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants