Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Array Interface and Categorical internals Refactor #19268

Merged
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
2ef5216
REF: Define extension base classes
TomAugspurger Jan 15, 2018
57e8b0f
Updated for comments
TomAugspurger Jan 18, 2018
01bd42f
Remove metaclasses from PeriodDtype and IntervalDtype
TomAugspurger Jan 18, 2018
ce81706
Fixup form_blocks rebase
TomAugspurger Jan 18, 2018
87a70e3
Restore concat casting cat -> object
TomAugspurger Jan 18, 2018
8c61886
Remove _slice, clarify semantics around __getitem__
TomAugspurger Jan 19, 2018
cb41803
Document and use take.
TomAugspurger Jan 19, 2018
65d5a61
Clarify type, kind, init
TomAugspurger Jan 19, 2018
57c749b
Remove base
TomAugspurger Jan 19, 2018
6736b0f
API: Remove unused __iter__ and get_values
TomAugspurger Jan 21, 2018
e4acb59
API: Implement repr and str
TomAugspurger Jan 21, 2018
0e9337b
Merge remote-tracking branch 'upstream/master' into pandas-array-inte…
TomAugspurger Jan 26, 2018
df68f3b
Remove default value_counts for now
TomAugspurger Jan 26, 2018
2746a43
Fixed merge conflicts
TomAugspurger Jan 27, 2018
34d2b99
Remove implementation of construct_from_string
TomAugspurger Jan 27, 2018
a484d61
Example implementation of take
TomAugspurger Jan 27, 2018
04b2e72
Cleanup ExtensionBlock
TomAugspurger Jan 27, 2018
df0fa12
Merge remote-tracking branch 'upstream/master' into pandas-array-inte…
TomAugspurger Jan 27, 2018
e778053
Pass through ndim
TomAugspurger Jan 27, 2018
d15a722
Use series._values
TomAugspurger Jan 27, 2018
b5f736d
Removed repr, updated take doc
TomAugspurger Jan 27, 2018
240e8f6
Various cleanups
TomAugspurger Jan 28, 2018
f9b0b49
Handle get_values, to_dense, is_view
TomAugspurger Jan 28, 2018
7913186
Docs
TomAugspurger Jan 30, 2018
df18c3b
Remove is_extension, is_bool
TomAugspurger Jan 30, 2018
ab2f045
Sparse formatter
TomAugspurger Jan 30, 2018
520876f
Revert "Sparse formatter"
TomAugspurger Jan 30, 2018
4dfa39c
Unbox SparseSeries
TomAugspurger Jan 30, 2018
e252103
Added test for sparse consolidation
TomAugspurger Jan 30, 2018
7110b2a
Docs
TomAugspurger Jan 30, 2018
c59dca0
Merge remote-tracking branch 'upstream/master' into pandas-array-inte…
TomAugspurger Jan 31, 2018
fc688a5
Moved to errors
TomAugspurger Jan 31, 2018
fbc8466
Handle classmethods, properties
TomAugspurger Jan 31, 2018
030bb19
Use our AbstractMethodError
TomAugspurger Jan 31, 2018
0f4c2d7
Lint
TomAugspurger Jan 31, 2018
f9316e0
Cleanup
TomAugspurger Feb 1, 2018
9c06b13
Move ndim validation to a method.
TomAugspurger Feb 1, 2018
7d2cf9c
Try this
TomAugspurger Feb 1, 2018
afae8ae
Make ExtensionBlock._holder a property
TomAugspurger Feb 1, 2018
cd0997e
Make _holder a property for all
TomAugspurger Feb 1, 2018
1d6eb04
Refactored validate_ndim
TomAugspurger Feb 1, 2018
92aed49
fixup! Refactored validate_ndim
TomAugspurger Feb 1, 2018
34134f2
lint
TomAugspurger Feb 1, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions pandas/core/arrays/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
from .base import ExtensionArray # noqa
from .categorical import Categorical # noqa
201 changes: 201 additions & 0 deletions pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
"""An interface for extending pandas with custom arrays."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd advocate leaving base.py open for (near-)future usage as pandas-internal base and putting the "use this if you want to write your own" file in e.g. extension.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a slight preference for base.py since it's a base class for all extension arrays. I don't think that having ExtensionArray in arrays.base precludes having a pandas-internal base there as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be publicly exposed through pd.api.extensions anyway I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding stuff to the public API is waiting on #19304

import abc

import numpy as np

from pandas.compat import add_metaclass


_not_implemented_message = "{} does not implement {}."


@add_metaclass(abc.ABCMeta)
class ExtensionArray(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any expected requirements for the constructor __init__?

Copy link
Contributor Author

@TomAugspurger TomAugspurger Jan 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we should figure out what those are and document them. At the very least, we expected ExtensionArray(extension_array) to work correctly. I'll look for other assumptions we make. Or that could be pushed to another classmethod.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also expect that ExtensionArray(), with no arguments, works so that subclasses don't have to implement construct_from_string.

Rather than imposing that on subclasses, we could require some kind of .empty alternative constructor.

"""Abstract base class for custom array types

pandas will recognize instances of this class as proper arrays
with a custom type and will not attempt to coerce them to objects.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs much more detail about what is expected here (or at least some examples), e.g. 1-D array-like or whatever

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can leave general docs until this is actually working (the above sentence is at the moment not yet true), which will only be in follow-up PRs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a bit about 1-D and some high-level examples.


Subclasses are expected to implement the following methods.
"""
# ------------------------------------------------------------------------
# Must be a Sequence
# ------------------------------------------------------------------------
@abc.abstractmethod
def __getitem__(self, item):
"""Select a subset of self

Notes
-----
As a sequence, __getitem__ should expect integer or slice ``key``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also boolean mask?


For slice ``key``, you should return an instance of yourself, even
if the slice is length 0 or 1.

For scalar ``key``, you may return a scalar suitable for your type.
The scalar need not be an instance or subclass of your array type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "need not be" enough? (compared to "should not be")
I mean, we won't run into problems in the internals in pandas by seeing arrays where we expect scalars?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll clarify this to say

    For scalar ``item``, you should return a scalar value suitable
    for your type. This should be an instance of ``self.dtype.type``.

My earlier phrasing was to explain that the return value for scalars needn't be the type of item that's actually stored in your array. E.g. for my IPAddress example, the array holds two uint64s, but a scalar slice returns an ipaddress.IPv4Address instance.

"""
# type (Any) -> Any

def __setitem__(self, key, value):
# type: (Any, Any) -> None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already use AbstractMethodError elsewhere in the code base, use instead

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setitem must be defined (it certainly does not need to actually set inplace), but since we use this is must appear as a mutable object

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of having it here as NotImplementedError is because an ExtensionArray does not necessarily needs to support setting elements (being mutable), at least that's one possible decision (leave the decision up to the extension author).
The error will then just bubble up to the user if he/she tries to set something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I don't think we can assume that all extension arrays will implement it.

raise NotImplementedError(_not_implemented_message.format(
type(self), '__setitem__')
)

@abc.abstractmethod
def __iter__(self):
# type: () -> Iterator
pass

@abc.abstractmethod
def __len__(self):
# type: () -> int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

document

pass

# ------------------------------------------------------------------------
# Required attributes
# ------------------------------------------------------------------------
@property
def base(self):
"""The base array I am a view of. None by default."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example here?

Perhaps it would also help to explain how is this used by pandas?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used in Block.is_view, which AFAICT is only used for chained assignment?

If that's correct, then I think we're OK with saying this purely for compatibility with NumPy arrays, and has no effect. I've currently defined ExtensionArray.is_view to always be False, so I don't even make use of it in the changes so far.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is the case, I would remove this for now (we can always later extend the interface if it turns out to be needed for something).

However, was just wondering: your ExtensionArray could be a view on another ExtensionArray (eg by slicing). Is this something we need to consider?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also remove this. NumPy doesn't always maintain this properly, so it can't actually be essential.


@property
@abc.abstractmethod
def dtype(self):
"""An instance of 'ExtensionDtype'."""
# type: () -> ExtensionDtype
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please drop pass from all these methods. It's not needed (docstrings alone suffice).


@property
def shape(self):
# type: () -> Tuple[int, ...]
return (len(self),)

@property
def ndim(self):
# type: () -> int
"""Extension Arrays are only allowed to be 1-dimensional."""
return 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be tested on registration of the sub-type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean "registration"? We could override ABC.register, but I don't think there's an (easy) way to validate this if they just subclass ExtensionArray.

If people want to mess with this, that's fine, their stuff just won't work with pandas.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what I mean is that when you register things, we should actually test that the interface is respected. If we had final methods this would not be necessary, but if someone override ndim this is a problem.


@property
@abc.abstractmethod
def nbytes(self):
"""The number of bytes needed to store this object in memory."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should a user do if this is expensive or otherwise difficult to calculate properly? For example, if it's a numpy array with dtype=object.

We should probably note that it's OK for this to be an approximate answer (a lower bound) on the number of required bytes, and consider adding another method for the benefit of memory_usage().

# type: () -> int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type comments come before the docstring: http://mypy.readthedocs.io/en/latest/python2.html

pass

# ------------------------------------------------------------------------
# Additional Methods
# ------------------------------------------------------------------------
@abc.abstractmethod
def isna(self):
"""Boolean NumPy array indicating if each value is missing."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same length as self

# type: () -> np.ndarray
pass

# ------------------------------------------------------------------------
# Indexing methods
# ------------------------------------------------------------------------
@abc.abstractmethod
def take(self, indexer, allow_fill=True, fill_value=None):
# type: (Sequence, bool, Optional[Any]) -> ExtensionArray
"""For slicing"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should clarify what valid values of indexer are. Does -1 indicate a fill value?


def take_nd(self, indexer, allow_fill=True, fill_value=None):
"""For slicing"""
# TODO: this isn't really nescessary for 1-D
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove this

return self.take(indexer, allow_fill=allow_fill,
fill_value=fill_value)

@abc.abstractmethod
def copy(self, deep=False):
# type: (bool) -> ExtensionArray
"""Return a copy of the array."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

document.


# ------------------------------------------------------------------------
# Block-related methods
# ------------------------------------------------------------------------
@property
def _fill_value(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to list this in the very top doc-string

"""The missing value for this type, e.g. np.nan"""
# type: () -> Any
return None

@abc.abstractmethod
def _formatting_values(self):
# type: () -> np.ndarray
# At the moment, this has to be an array since we use result.dtype
"""An array of values to be printed in, e.g. the Series repr"""

@classmethod
@abc.abstractmethod
def _concat_same_type(cls, to_concat):
# type: (Sequence[ExtensionArray]) -> ExtensionArray
"""Concatenate multiple array

Parameters
----------
to_concat : sequence of this type

Returns
-------
ExtensionArray
"""

@abc.abstractmethod
def get_values(self):
# type: () -> np.ndarray
"""Get the underlying values backing your data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't very clear. How does this differ from .base?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, .base is used to get the base array self is a view of (which will likely be None for our purposes?).

.get_values() is to convert self to an ndarray. This may be useful if np.asarray(self), which would fall back to self.__iter__, doesn't do the right thing.

I'll try to give some examples of where this is used.

"""
pass

def _can_hold_na(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to list this in the very top doc-string

"""Whether your array can hold missing values. True by default.

Notes
-----
Setting this to false will optimize some operations like fillna.
"""
# type: () -> bool
return True

@property
def is_sparse(self):
"""Whether your array is sparse. True by default."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction: False by default :)

This should clarify what it means to be a sparse array. How does pandas treat sparse arrays differently?

I would consider dropping this if it isn't strictly necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's unnecessary.

# type: () -> bool
return False

def _slice(self, slicer):
# type: (Union[tuple, Sequence, int]) -> 'ExtensionArray'
"""Return a new array sliced by `slicer`.

Parameters
----------
slicer : slice or np.ndarray
If an array, it should just be a boolean mask

Returns
-------
array : ExtensionArray
Should return an ExtensionArray, even if ``self[slicer]``
would return a scalar.
"""
return type(self)(self[slicer])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This default implementation is likely to fail for some obvious implementations. Perhaps we can have a constructor method _from_scalar() instead that converts a scalar into a length 1 array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me see if I can verify that this is always called with a slice object. In that case, __getitem__ will return an ExtensionArray, and we don't have to worry about the scalar case. Unless I'm missing something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would try to get rid of this if possible, and just ask that __getitem__ can deal with this (of course, alternative is to add separate methods for different __getitem__ functionalities like _slice, but then also _mask, but I don't really see the advantage of this).


def value_counts(self, dropna=True):
"""Optional method for computing the histogram of the counts.

Parameters
----------
dropna : bool, default True
whether to exclude missing values from the computation

Returns
-------
counts : Series
"""
from pandas.core.algorithms import value_counts
mask = ~np.asarray(self.isna())
values = self[mask] # XXX: this imposes boolean indexing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a dedicated method for boolean indexing or document this as part of the expected interface for __getitem__.

return value_counts(np.asarray(values), dropna=dropna)
18 changes: 17 additions & 1 deletion pandas/core/arrays/categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@
from pandas.util._validators import validate_bool_kwarg
from pandas.core.config import get_option

from .base import ExtensionArray


def _cat_compare_op(op):
def f(self, other):
Expand Down Expand Up @@ -149,7 +151,7 @@ def _maybe_to_categorical(array):
"""


class Categorical(PandasObject):
class Categorical(ExtensionArray, PandasObject):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By having our internal arrays inherit from PandasObject, they also get a _constructor method. So we should either make sure this is never used (apart from in methods inside the array itself), or add this to the interface (my preference would be the first)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, the methods in PandasObject needs to be ABC in the ExtensionArray

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, I don't think all methods/attributes of PandasObject should be added to the public ExtensionArray (to keep those internal + to not clutter the ExtensionArray API)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I'm consistently testing these changes against

  1. An implementation of IntervalArary: https://github.com/TomAugspurger/pandas/compare/pandas-array-interface-3...TomAugspurger:pandas-array-upstream+interval?expand=1
  2. A branch on pandas-ip: https://github.com/ContinuumIO/pandas-ip/tree/pandas-array-upstream-compat

Neither inherit from PandasObject at the moment, so we're OK.

"""
Represents a categorical variable in classic R / S-plus fashion

Expand Down Expand Up @@ -2131,6 +2133,20 @@ def repeat(self, repeats, *args, **kwargs):
return self._constructor(values=codes, categories=self.categories,
ordered=self.ordered, fastpath=True)

# Interface things
# can_hold_na, concat_same_type, formatting_values
@property
def _can_hold_na(self):
return True

@classmethod
def _concat_same_type(self, to_concat):
from pandas.types.concat import union_categoricals
return union_categoricals(to_concat)

def _formatting_values(self):
return self

# The Series.cat accessor


Expand Down
92 changes: 92 additions & 0 deletions pandas/core/dtypes/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
"""Extend pandas with custom array types"""
import abc

from pandas.compat import add_metaclass


@add_metaclass(abc.ABCMeta)
class ExtensionDtype(object):
"""A custom data type for your array.
"""
@property
def type(self):
"""Typically a metaclass inheriting from 'type' with no methods."""
return type(self.name, (), {})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure what array.dtype.type is used for. This passes the test suite, but may break things like

array1.dtype.type is array2.dtype.type

since the object IDs will be different (I think).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In NumPy, dtype.type is the corresponding scalar type, e.g.,

>>> np.dtype(np.float64).type
numpy.float64

I don't know where "Typically a metaclass inheriting from 'type' with no methods." comes from.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that makes sense. Would you say that object is a good default here? I'll work through a test array that uses something meaningful like numbers.Real or int. We do use values.dtype.type in a few places, like figuring out which block type to use.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's a good default value here. Generally the right choice is to return the corresponding scalar type, e.g., Interval for IntervalDtype.


@property
def kind(self):
"""A character code (one of 'biufcmMOSUV'), default 'O'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should clarify how it's used. How is this useful?

Perhaps "This should match dtype.kind when arrays with this dtype are cast to numpy arrays"?


See Also
--------
numpy.dtype.kind
"""
return 'O'

@property
@abc.abstractmethod
def name(self):
"""An string identifying the data type.

Will be used in, e.g. ``Series.dtype``
"""

@property
def names(self):
"""Ordered list of field names, or None if there are no fields"""
return None

@classmethod
def construct_from_string(cls, string):
"""Attempt to construct this type from a string.

Parameters
----------
string : str

Returns
-------
self : instance of 'cls'

Raises
------
TypeError

Notes
-----
The default implementation checks if 'string' matches your
type's name. If so, it calls your class with no arguments.
"""
if string == cls.name:
return cls()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least, this requirement for the constructor should be documented.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to document this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it's best to just remove the default implementation and make it abstract? I'll document current implementation as a possible default.

else:
raise TypeError("Cannot construct a '{}' from "
"'{}'".format(cls, string))

@classmethod
def is_dtype(cls, dtype):
"""Check if we match 'dtype'

Parameters
----------
dtype : str or dtype

Returns
-------
is_dtype : bool

Notes
-----
The default implementation is True if

1. 'dtype' is a string that returns true for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"returns true" -> "does not raise" ?

``cls.construct_from_string``
2. 'dtype' is ``cls`` or a subclass of ``cls``.
"""
if isinstance(dtype, str):
try:
return isinstance(cls.construct_from_string(dtype), cls)
except TypeError:
return False
else:
return issubclass(dtype, cls)
32 changes: 32 additions & 0 deletions pandas/core/dtypes/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -1685,6 +1685,38 @@ def is_extension_type(arr):
return False


def is_extension_array_dtype(arr_or_dtype):
"""Check if an object is a pandas extension array type

Parameters
----------
arr_or_dtype : object

Returns
-------
bool

Notes
-----
This checks whether an object implements the pandas extension
array interface. In pandas, this includes:

* Categorical
* PeriodArray
* IntervalArray
* SparseArray

Third-party libraries may implement arrays or types satisfying
this interface as well.
"""
from pandas.core.arrays import ExtensionArray

# we want to unpack series, anything else?
if isinstance(arr_or_dtype, ABCSeries):
arr_or_dtype = arr_or_dtype.values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only work if .values will return such a PeriodArray or IntervalArray, and I am not sure we already decided on that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is Series.values, what else would it return? An object-typed NumPy array? I think the ship has sailed on Series.values always being a NumPy array.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use Series._values for now? That gets the values of the block, and is certainly an ExtensionArray in case the series holds one.
The we can postpone the decision on what .values returns?

return isinstance(arr_or_dtype, (ExtensionDtype, ExtensionArray))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to call _get_dtype_type here, this can only have a result of ExtensionDtype and NOT ExtensionArray, which doesn't make any sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you pass this function an ExtensionArray subclass, you will get that here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can only have a result of ExtensionDtype and NOT ExtensionArray, which doesn't make any sense.

The result is just True or False. The argument can be either an array or dtype.

I'm not sure that _get_dtype_or_type does what we want here. That grabs arr.dtype.type, which is a scalar like str or Interval or ipaddress.IPv4Address. What would we do with that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not following what we do in all other cases that's my point. pls use _get_dtype_or_type

Copy link
Contributor Author

@TomAugspurger TomAugspurger Feb 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not following what we do in all other cases that's my point.

That won't work unfortunately.

In [1]: import pandas as pd; import pandas_ip as ip

In [2]: arr = ip.IPAddress([1, 2, 3])

In [3]: pd.core.dtypes.common._get_dtype_type(arr)
Out[3]: pandas_ip.block.IPv4v6Base

IPv4v6Base isn't an instance of ExtensionType. It's the type scalars belong to.

In [4]: isinstance(arr[0], ip.block.IPv4v6Base)
Out[4]: True

In [5]: issubclass(ip.block.IPv4v6Base, pd.core.dtypes.base.ExtensionDtype)
Out[5]: False

_get_dtype_or_type works for our extension types, since if we get a CategoricalDtypeType we can say "this is a categorical". But we can't do that for arbitrary 3rd party types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then _get_dtype_or_type needs adjustment. This is the point of compatibility, there shouldn't be the need to have special cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that would make get_dtype_or_type inconsistent, as it would no longer always return a dtype type, but in certain cases a dtype

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_get_dtype_type does exactly what it's supposed to do, values.dtype.type. But that's not useful here!

What's the issue with the function as defined? I need a way to tell if an array or dtype is an ExtensionArray or ExtensionDtype. Someday, when Categorical, SparseArray, IntervalArray, PeriodArray, datetimetz, etc are extension arrays then all the special cases currently in _get_dtype_type and friends can be removed, but we aren't there yet. We're doing things in small steps.



def is_complex_dtype(arr_or_dtype):
"""
Check whether the provided array or dtype is of a complex dtype.
Expand Down
Loading