Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/API: ExtensionArray.factorize #20361

Merged
merged 64 commits into from
Mar 27, 2018
Merged
Show file tree
Hide file tree
Changes from 41 commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
0ec3600
ENH: Sorting of ExtensionArrays
TomAugspurger Feb 19, 2018
4707273
REF: Split argsort into two parts
TomAugspurger Mar 2, 2018
b61fb8d
Fixed docstring
TomAugspurger Mar 2, 2018
44b6d72
Remove _values_for_argsort
TomAugspurger Mar 2, 2018
5be3917
Revert "Remove _values_for_argsort"
TomAugspurger Mar 2, 2018
e474c20
Merge remote-tracking branch 'upstream/master' into fu1+sort
TomAugspurger Mar 2, 2018
c2578c3
Workaround Py2
TomAugspurger Mar 2, 2018
b73e303
Indexer as array
TomAugspurger Mar 2, 2018
0db9e97
Fixed dtypes
TomAugspurger Mar 4, 2018
baf624c
Fixed docstring
TomAugspurger Mar 12, 2018
ce92f7b
Merge remote-tracking branch 'upstream/master' into fu1+sort
TomAugspurger Mar 12, 2018
8cbfc36
Merge remote-tracking branch 'upstream/master' into fu1+sort
TomAugspurger Mar 13, 2018
425fb2a
Merge remote-tracking branch 'upstream/master' into fu1+sort
TomAugspurger Mar 13, 2018
7bbe796
Update docs
TomAugspurger Mar 14, 2018
31ed4c9
ENH/API: ExtensionArray.factorize
TomAugspurger Mar 13, 2018
434df7d
fixup! ENH/API: ExtensionArray.factorize
TomAugspurger Mar 15, 2018
505ad44
REF: Changed ExtensionDtype inheritance
TomAugspurger Mar 15, 2018
77a10b6
Merge remote-tracking branch 'upstream/master' into ea-factorize-2
TomAugspurger Mar 16, 2018
b59656f
Fix factorize equivalence test
TomAugspurger Mar 16, 2018
201e029
Shared factorize doc
TomAugspurger Mar 16, 2018
9b0c2a9
Move to algorithms
TomAugspurger Mar 16, 2018
eb19488
BUG: py2 bug
TomAugspurger Mar 16, 2018
cbfee1a
Typo, ref
TomAugspurger Mar 16, 2018
35a8977
Change name
TomAugspurger Mar 17, 2018
7efece2
Merge remote-tracking branch 'upstream/master' into fu1+sort
TomAugspurger Mar 17, 2018
ef8e6cb
Fixed docs
TomAugspurger Mar 17, 2018
dd3bf1d
Merge remote-tracking branch 'upstream/master' into ea-factorize-2
TomAugspurger Mar 17, 2018
6a6034f
Wording
TomAugspurger Mar 17, 2018
5c758aa
Merge remote-tracking branch 'upstream/master' into fu1+sort
TomAugspurger Mar 18, 2018
5526398
fixup! Wording
TomAugspurger Mar 19, 2018
cd5c2db
Merge remote-tracking branch 'upstream/master' into ea-factorize-2
TomAugspurger Mar 19, 2018
d5e8198
Back to _values_for_argsort
TomAugspurger Mar 19, 2018
30941cb
Example with _from_factorize
TomAugspurger Mar 19, 2018
3574273
Merge remote-tracking branch 'upstream/master' into ea-factorize-2
TomAugspurger Mar 20, 2018
c776133
Unskip most JSON tests
TomAugspurger Mar 20, 2018
2a79315
Merge branch 'fu1+sort' into ea-factorize-2
TomAugspurger Mar 20, 2018
6ca65f8
Overridable na_value too.
TomAugspurger Mar 22, 2018
bbedd8c
Reverted sorting changes
TomAugspurger Mar 22, 2018
96ecab7
Remove a bit more argsort
TomAugspurger Mar 22, 2018
1010417
Merge remote-tracking branch 'upstream/master' into ea-factorize-2
TomAugspurger Mar 23, 2018
c288d67
Mask values going into hashtables
TomAugspurger Mar 23, 2018
55c9e31
remove stale comment
TomAugspurger Mar 23, 2018
163bfa3
wip
TomAugspurger Mar 23, 2018
872c24a
ENH: Parametrized NA sentinel for factorize
TomAugspurger Mar 23, 2018
3c18428
REF: Moved to get_labels
TomAugspurger Mar 23, 2018
703ab8a
Remove python-level use_na_value
TomAugspurger Mar 23, 2018
ab32e0f
REF: More cleanup
TomAugspurger Mar 24, 2018
62fa538
API: Make it non-public
TomAugspurger Mar 24, 2018
28fad50
Revert formatting changes in pxd
TomAugspurger Mar 24, 2018
8580754
linting
TomAugspurger Mar 24, 2018
cf14ee1
Handle bool
TomAugspurger Mar 24, 2018
8141131
Merge remote-tracking branch 'upstream/master' into parametrized-fact…
TomAugspurger Mar 24, 2018
a23d451
Specify dtypes
TomAugspurger Mar 24, 2018
b25f3d4
Remove unused variable.
TomAugspurger Mar 24, 2018
dfcda85
REF: Removed check_nulls
TomAugspurger Mar 25, 2018
eaff342
BUG: NaT for period
TomAugspurger Mar 26, 2018
c05c807
Merge remote-tracking branch 'upstream/master' into parametrized-fact…
TomAugspurger Mar 26, 2018
e786253
Other hashtable
TomAugspurger Mar 26, 2018
465d458
na_value_for_dtype PeriodDtype
TomAugspurger Mar 26, 2018
6f8036e
Merge remote-tracking branch 'upstream/master' into ea-factorize-2
TomAugspurger Mar 26, 2018
bca4cdf
Merge branch 'parametrized-factorize-na-value' into ea-factorize-2+pa…
TomAugspurger Mar 26, 2018
69c3ea2
Use na_value
TomAugspurger Mar 26, 2018
fa8e221
Merge remote-tracking branch 'upstream/master' into ea-factorize-2
TomAugspurger Mar 27, 2018
c06da3a
Typing
TomAugspurger Mar 27, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
140 changes: 118 additions & 22 deletions pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@
"""
from __future__ import division
from warnings import warn, catch_warnings
from textwrap import dedent

import numpy as np

from pandas.core.dtypes.cast import (
Expand Down Expand Up @@ -34,7 +36,10 @@
from pandas.core import common as com
from pandas._libs import algos, lib, hashtable as htable
from pandas._libs.tslib import iNaT
from pandas.util._decorators import deprecate_kwarg
from pandas.util._decorators import (Appender, Substitution,
deprecate_kwarg)

_shared_docs = {}


# --------------- #
Expand Down Expand Up @@ -146,10 +151,9 @@ def _reconstruct_data(values, dtype, original):
Returns
-------
Index for extension types, otherwise ndarray casted to dtype

"""
from pandas import Index
if is_categorical_dtype(dtype):
if is_extension_array_dtype(dtype):
pass
elif is_datetime64tz_dtype(dtype) or is_period_dtype(dtype):
values = Index(original)._shallow_copy(values, name=None)
Expand Down Expand Up @@ -464,32 +468,124 @@ def _factorize_array(values, check_nulls, na_sentinel=-1, size_hint=None):
return labels, uniques


@deprecate_kwarg(old_arg_name='order', new_arg_name=None)
def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None):
"""
Encode input values as an enumerated type or categorical variable
_shared_docs['factorize'] = """
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is only a single entry, can we make this a variable (eg _shared_docstring_factorize) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a slight preference for keeping it as a dictionary, since it looks like the docstrings for unique and value_counts can be shared between ops and base. #20390

Encode the object as an enumerated type or categorical variable.

This method is useful for obtaining a numeric representation of an
array when all that matters is identifying distinct values. `factorize`
is available as both a top-level function :func:`pandas.factorize`,
and as a method :meth:`Series.factorize` and :meth:`Index.factorize`.

Parameters
----------
values : Sequence
ndarrays must be 1-D. Sequences that aren't pandas objects are
coereced to ndarrays before factorization.
sort : boolean, default False
Sort by values
%(values)s%(sort)s%(order)s
na_sentinel : int, default -1
Value to mark "not found"
size_hint : hint to the hashtable sizer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to show order as deprecated

Value to mark "not found".
%(size_hint)s\

Returns
-------
labels : the indexer to the original array
uniques : ndarray (1-d) or Index
the unique values. Index is returned when passed values is Index or
Series
labels : ndarray
An integer ndarray that's an indexer into `uniques`.
``uniques.take(labels)`` will have the same values as `values`.
uniques : ndarray, Index, or Categorical
The unique valid values. When `values` is Categorical, `uniques`
is a Categorical. When `values` is some other pandas object, an
`Index` is returned. Otherwise, a 1-D ndarray is returned.

.. note ::

Even if there's a missing value in `values`, `uniques` will
*not* contain an entry for it.

See Also
--------
pandas.cut : Discretize continuous-valued array.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add pandas.unique here

pandas.unique : Find the unique valuse in an array.

Examples
--------
These examples all show factorize as a top-level method like
``pd.factorize(values)``. The results are identical for methods like
:meth:`Series.factorize`.

>>> labels, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'])
>>> labels
array([0, 0, 1, 2, 0])
>>> uniques
array(['b', 'a', 'c'], dtype=object)

With ``sort=True``, the `uniques` will be sorted, and `labels` will be
shuffled so that the relationship is the maintained.

>>> labels, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'], sort=True)
>>> labels
array([1, 1, 0, 2, 1])
>>> uniques
array(['a', 'b', 'c'], dtype=object)

Missing values are indicated in `labels` with `na_sentinel`
(``-1`` by default). Note that missing values are never
included in `uniques`.

>>> labels, uniques = pd.factorize(['b', None, 'a', 'c', 'b'])
>>> labels
array([ 0, -1, 1, 2, 0])
>>> uniques
array(['b', 'a', 'c'], dtype=object)

note: an array of Periods will ignore sort as it returns an always sorted
PeriodIndex.
Thus far, we've only factorized lists (which are internally coerced to
NumPy arrays). When factorizing pandas objects, the type of `uniques`
will differ. For Categoricals, a `Categorical` is returned.

>>> cat = pd.Categorical(['a', 'a', 'c'], categories=['a', 'b', 'c'])
>>> labels, uniques = pd.factorize(cat)
>>> labels
array([0, 0, 1])
>>> uniques
[a, c]
Categories (3, object): [a, b, c]

Notice that ``'b'`` is in ``uniques.categories``, desipite not being
present in ``cat.values``.

For all other pandas objects, an Index of the appropriate type is
returned.

>>> cat = pd.Series(['a', 'a', 'c'])
>>> labels, uniques = pd.factorize(cat)
>>> labels
array([0, 0, 1])
>>> uniques
Index(['a', 'c'], dtype='object')
"""


@Substitution(
values=dedent("""\
values : sequence
A 1-D seqeunce. Sequences that aren't pandas objects are
coereced to ndarrays before factorization.
"""),
order=dedent("""\
order
.. deprecated:: 0.23.0

This parameter has no effect and is deprecated.
"""),
sort=dedent("""\
sort : bool, default False
Sort `uniques` and shuffle `labels` to maintain the
relationship.
"""),
size_hint=dedent("""\
size_hint : int, optional
Hint to the hashtable sizer.
"""),
)
@Appender(_shared_docs['factorize'])
@deprecate_kwarg(old_arg_name='order', new_arg_name=None)
def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None):
# Implementation notes: This method is responsible for 3 things
# 1.) coercing data to array-like (ndarray, Index, extension array)
# 2.) factorizing labels and uniques
Expand All @@ -502,9 +598,9 @@ def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None):
values = _ensure_arraylike(values)
original = values

if is_categorical_dtype(values):
if is_extension_array_dtype(values):
values = getattr(values, '_values', values)
labels, uniques = values.factorize()
labels, uniques = values.factorize(na_sentinel=na_sentinel)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this was a bug in #19938 where I forgot to pass this through. It's covered by our extension tests.

dtype = original.dtype
else:
values, dtype, _ = _ensure_data(values)
Expand Down
78 changes: 78 additions & 0 deletions pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,24 @@ def _constructor_from_sequence(cls, scalars):
"""
raise AbstractMethodError(cls)

@classmethod
def _from_factorized(cls, values, original):
"""Reconstruct an ExtensionArray after factorization.

Parameters
----------
values : ndarray
An integer ndarray with the factorized values.
original : ExtensionArray
The original ExtensionArray that factorize was called on.

See Also
--------
pandas.factorize
ExtensionArray.factorize
"""
raise AbstractMethodError(cls)

# ------------------------------------------------------------------------
# Must be a Sequence
# ------------------------------------------------------------------------
Expand Down Expand Up @@ -353,6 +371,66 @@ def unique(self):
uniques = unique(self.astype(object))
return self._constructor_from_sequence(uniques)

def _values_for_factorize(self):
"""Return an array and missing value suitable for factorization.

Returns
-------
values : ndarray
An array suitable for factoraization. This should maintain order
and be a supported dtype. By default, the extension array is cast
to object dtype.
"""
return self.astype(object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a comment here that those values might be modified (in the current state of the PR). So that you eg don't just return the ordinals / codes for Period (which is not the case for Categorical right now, because it does astype(int64) which always copies)


def factorize(self, na_sentinel=-1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need a sort=False arg here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want / need that? It complicates the implementation a bit. Any idea what it's actually used for?

"""Encode the extension array as an enumerated type.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, need to merge this one too maybe. Will have to check on import order...


Parameters
----------
na_sentinel : int, default -1
Value to use in the `labels` array to indicate missing values.

Returns
-------
labels : ndarray
An interger NumPy array that's an indexer into the original
ExtensionArray.
uniques : ExtensionArray
An ExtensionArray containing the unique values of `self`.

.. note::

uniques should *not* contain a value for the NA sentinel,
if values in `self` are missing.

See Also
--------
pandas.factorize : top-level factorize method that dispatches here.

Notes
-----
:meth:`pandas.factorize` offers a `sort` keyword as well.
"""
from pandas.core.algorithms import _factorize_array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're OK with using this private API here?

Because an extension authors might want to copy paste this method and change the arr = self.astype(object) line? (how do you implement factorize in cyberpandas?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite similar to this. https://github.com/ContinuumIO/cyberpandas/blob/468644bcbdc9320a1a33b0df393d4fa4bef57dd7/cyberpandas/base.py#L72

In that case I think going to object dtypes is unavoidable, since there's no easy way to factorize a 2-D array, and I didn't want to write a new hashtable implementation :)

Copy link
Contributor Author

@TomAugspurger TomAugspurger Mar 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

w.r.t. using _factorize_array, I don't think it's avoidable.

We might consider making it public / semi-public (keep the _ to not confuse users, but point EA authors to it).

import pandas.core.dtypes.common as com
from pandas._libs.tslib import iNaT

mask = self.isna()
arr = self._values_for_factorize()

# Mask values going into the hash table with the appropriate
# NA type.
if com.is_signed_integer_dtype(arr):
arr[mask] = iNaT
elif com.is_float_dtype(arr) or com.is_object_dtype(arr):
arr[mask] = np.nan
Copy link
Member

@jorisvandenbossche jorisvandenbossche Mar 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original idea with a mask keyword for factorize, is that this if/elif code block would do this in factorize itself (but again, then we would need an extra copy ...)


labels, uniques = _factorize_array(arr, check_nulls=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could also use pd.factorize here if we want to limit the use of internal APIs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just trying that. Unfortunately, I have to pass check_nulls=True for categorical to work properly. It's typically False for integers. But perhaps that's a sign we're doing something improper :/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, because there we still know what the original dtype was ...
Hmm, that maybe shows that we rather need a na_value keyword (just from a perspective of making pd.factorize generally usable, not necessarily in this PR).
In general, the default of na_value=None could be the val != val idiom, and you can override that with val == na_value with this keyword (e.g. to specify iNaT or -1 for integer data).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this issue and the fact that UInt64Hashtable doesn't have a null value are making me come back around to the mask approach.

That seems like the only way we can ensure that we'll actually get things correct.

na_sentinel=na_sentinel)
uniques = self._from_factorized(uniques, self)
return labels, uniques

# ------------------------------------------------------------------------
# Indexing methods
# ------------------------------------------------------------------------
Expand Down
58 changes: 8 additions & 50 deletions pandas/core/arrays/categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -2119,59 +2119,17 @@ def unique(self):
take_codes = sorted(take_codes)
return cat.set_categories(cat.categories.take(take_codes))

def factorize(self, na_sentinel=-1):
"""Encode the Categorical as an enumerated type.

Parameters
----------
sort : boolean, default False
Sort by values
na_sentinel: int, default -1
Value to mark "not found"

Returns
-------
labels : ndarray
An integer NumPy array that's an indexer into the original
Categorical
uniques : Categorical
A Categorical whose values are the unique values and
whose dtype matches the original CategoricalDtype. Note that if
there any unobserved categories in ``self`` will not be present
in ``uniques.values``. They will be present in
``uniques.categories``

Examples
--------
>>> cat = pd.Categorical(['a', 'a', 'c'], categories=['a', 'b', 'c'])
>>> labels, uniques = cat.factorize()
>>> labels
(array([0, 0, 1]),
>>> uniques
[a, c]
Categories (3, object): [a, b, c])

Missing values are handled

>>> labels, uniques = pd.factorize(pd.Categorical(['a', 'b', None]))
>>> labels
array([ 0, 1, -1])
>>> uniques
[a, b]
Categories (2, object): [a, b]
"""
from pandas.core.algorithms import _factorize_array

def _values_for_factorize(self):
codes = self.codes.astype('int64')
codes[codes == -1] = iNaT
# We set missing codes, normally -1, to iNaT so that the
# Int64HashTable treats them as missing values.
labels, uniques = _factorize_array(codes, check_nulls=True,
na_sentinel=na_sentinel)
uniques = self._constructor(self.categories.take(uniques),
categories=self.categories,
ordered=self.ordered)
return labels, uniques
return codes

@classmethod
def _from_factorized(cls, uniques, original):
return original._constructor(original.categories.take(uniques),
categories=original.categories,
ordered=original.ordered)

def equals(self, other):
"""
Expand Down
27 changes: 10 additions & 17 deletions pandas/core/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
Base and utility classes for pandas objects.
"""
import warnings
import textwrap
from pandas import compat
from pandas.compat import builtins
import numpy as np
Expand Down Expand Up @@ -1151,24 +1152,16 @@ def memory_usage(self, deep=False):
v += lib.memory_usage_of_objects(self.values)
return v

@Substitution(
values='', order='', size_hint='',
sort=textwrap.dedent("""\
sort : boolean, default False
Sort `uniques` and shuffle `labels` to maintain the
relationship.
"""))
@Appender(algorithms._shared_docs['factorize'])
def factorize(self, sort=False, na_sentinel=-1):
"""
Encode the object as an enumerated type or categorical variable

Parameters
----------
sort : boolean, default False
Sort by values
na_sentinel: int, default -1
Value to mark "not found"

Returns
-------
labels : the indexer to the original array
uniques : the unique Index
"""
from pandas.core.algorithms import factorize
return factorize(self, sort=sort, na_sentinel=na_sentinel)
return algorithms.factorize(self, sort=sort, na_sentinel=na_sentinel)

_shared_docs['searchsorted'] = (
"""Find indices where elements should be inserted to maintain order.
Expand Down
Loading