Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/API: ExtensionArray.factorize #20361

Merged
merged 64 commits into from
Mar 27, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
0ec3600
ENH: Sorting of ExtensionArrays
TomAugspurger Feb 19, 2018
4707273
REF: Split argsort into two parts
TomAugspurger Mar 2, 2018
b61fb8d
Fixed docstring
TomAugspurger Mar 2, 2018
44b6d72
Remove _values_for_argsort
TomAugspurger Mar 2, 2018
5be3917
Revert "Remove _values_for_argsort"
TomAugspurger Mar 2, 2018
e474c20
Merge remote-tracking branch 'upstream/master' into fu1+sort
TomAugspurger Mar 2, 2018
c2578c3
Workaround Py2
TomAugspurger Mar 2, 2018
b73e303
Indexer as array
TomAugspurger Mar 2, 2018
0db9e97
Fixed dtypes
TomAugspurger Mar 4, 2018
baf624c
Fixed docstring
TomAugspurger Mar 12, 2018
ce92f7b
Merge remote-tracking branch 'upstream/master' into fu1+sort
TomAugspurger Mar 12, 2018
8cbfc36
Merge remote-tracking branch 'upstream/master' into fu1+sort
TomAugspurger Mar 13, 2018
425fb2a
Merge remote-tracking branch 'upstream/master' into fu1+sort
TomAugspurger Mar 13, 2018
7bbe796
Update docs
TomAugspurger Mar 14, 2018
31ed4c9
ENH/API: ExtensionArray.factorize
TomAugspurger Mar 13, 2018
434df7d
fixup! ENH/API: ExtensionArray.factorize
TomAugspurger Mar 15, 2018
505ad44
REF: Changed ExtensionDtype inheritance
TomAugspurger Mar 15, 2018
77a10b6
Merge remote-tracking branch 'upstream/master' into ea-factorize-2
TomAugspurger Mar 16, 2018
b59656f
Fix factorize equivalence test
TomAugspurger Mar 16, 2018
201e029
Shared factorize doc
TomAugspurger Mar 16, 2018
9b0c2a9
Move to algorithms
TomAugspurger Mar 16, 2018
eb19488
BUG: py2 bug
TomAugspurger Mar 16, 2018
cbfee1a
Typo, ref
TomAugspurger Mar 16, 2018
35a8977
Change name
TomAugspurger Mar 17, 2018
7efece2
Merge remote-tracking branch 'upstream/master' into fu1+sort
TomAugspurger Mar 17, 2018
ef8e6cb
Fixed docs
TomAugspurger Mar 17, 2018
dd3bf1d
Merge remote-tracking branch 'upstream/master' into ea-factorize-2
TomAugspurger Mar 17, 2018
6a6034f
Wording
TomAugspurger Mar 17, 2018
5c758aa
Merge remote-tracking branch 'upstream/master' into fu1+sort
TomAugspurger Mar 18, 2018
5526398
fixup! Wording
TomAugspurger Mar 19, 2018
cd5c2db
Merge remote-tracking branch 'upstream/master' into ea-factorize-2
TomAugspurger Mar 19, 2018
d5e8198
Back to _values_for_argsort
TomAugspurger Mar 19, 2018
30941cb
Example with _from_factorize
TomAugspurger Mar 19, 2018
3574273
Merge remote-tracking branch 'upstream/master' into ea-factorize-2
TomAugspurger Mar 20, 2018
c776133
Unskip most JSON tests
TomAugspurger Mar 20, 2018
2a79315
Merge branch 'fu1+sort' into ea-factorize-2
TomAugspurger Mar 20, 2018
6ca65f8
Overridable na_value too.
TomAugspurger Mar 22, 2018
bbedd8c
Reverted sorting changes
TomAugspurger Mar 22, 2018
96ecab7
Remove a bit more argsort
TomAugspurger Mar 22, 2018
1010417
Merge remote-tracking branch 'upstream/master' into ea-factorize-2
TomAugspurger Mar 23, 2018
c288d67
Mask values going into hashtables
TomAugspurger Mar 23, 2018
55c9e31
remove stale comment
TomAugspurger Mar 23, 2018
163bfa3
wip
TomAugspurger Mar 23, 2018
872c24a
ENH: Parametrized NA sentinel for factorize
TomAugspurger Mar 23, 2018
3c18428
REF: Moved to get_labels
TomAugspurger Mar 23, 2018
703ab8a
Remove python-level use_na_value
TomAugspurger Mar 23, 2018
ab32e0f
REF: More cleanup
TomAugspurger Mar 24, 2018
62fa538
API: Make it non-public
TomAugspurger Mar 24, 2018
28fad50
Revert formatting changes in pxd
TomAugspurger Mar 24, 2018
8580754
linting
TomAugspurger Mar 24, 2018
cf14ee1
Handle bool
TomAugspurger Mar 24, 2018
8141131
Merge remote-tracking branch 'upstream/master' into parametrized-fact…
TomAugspurger Mar 24, 2018
a23d451
Specify dtypes
TomAugspurger Mar 24, 2018
b25f3d4
Remove unused variable.
TomAugspurger Mar 24, 2018
dfcda85
REF: Removed check_nulls
TomAugspurger Mar 25, 2018
eaff342
BUG: NaT for period
TomAugspurger Mar 26, 2018
c05c807
Merge remote-tracking branch 'upstream/master' into parametrized-fact…
TomAugspurger Mar 26, 2018
e786253
Other hashtable
TomAugspurger Mar 26, 2018
465d458
na_value_for_dtype PeriodDtype
TomAugspurger Mar 26, 2018
6f8036e
Merge remote-tracking branch 'upstream/master' into ea-factorize-2
TomAugspurger Mar 26, 2018
bca4cdf
Merge branch 'parametrized-factorize-na-value' into ea-factorize-2+pa…
TomAugspurger Mar 26, 2018
69c3ea2
Use na_value
TomAugspurger Mar 26, 2018
fa8e221
Merge remote-tracking branch 'upstream/master' into ea-factorize-2
TomAugspurger Mar 27, 2018
c06da3a
Typing
TomAugspurger Mar 27, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
140 changes: 118 additions & 22 deletions pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@
"""
from __future__ import division
from warnings import warn, catch_warnings
from textwrap import dedent

import numpy as np

from pandas.core.dtypes.cast import (
Expand Down Expand Up @@ -34,7 +36,10 @@
from pandas.core import common as com
from pandas._libs import algos, lib, hashtable as htable
from pandas._libs.tslib import iNaT
from pandas.util._decorators import deprecate_kwarg
from pandas.util._decorators import (Appender, Substitution,
deprecate_kwarg)

_shared_docs = {}


# --------------- #
Expand Down Expand Up @@ -146,10 +151,9 @@ def _reconstruct_data(values, dtype, original):
Returns
-------
Index for extension types, otherwise ndarray casted to dtype
"""
from pandas import Index
if is_categorical_dtype(dtype):
if is_extension_array_dtype(dtype):
pass
elif is_datetime64tz_dtype(dtype) or is_period_dtype(dtype):
values = Index(original)._shallow_copy(values, name=None)
Expand Down Expand Up @@ -469,32 +473,124 @@ def _factorize_array(values, na_sentinel=-1, size_hint=None,
return labels, uniques


@deprecate_kwarg(old_arg_name='order', new_arg_name=None)
def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None):
"""
Encode input values as an enumerated type or categorical variable
_shared_docs['factorize'] = """
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is only a single entry, can we make this a variable (eg _shared_docstring_factorize) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a slight preference for keeping it as a dictionary, since it looks like the docstrings for unique and value_counts can be shared between ops and base. #20390

Encode the object as an enumerated type or categorical variable.
This method is useful for obtaining a numeric representation of an
array when all that matters is identifying distinct values. `factorize`
is available as both a top-level function :func:`pandas.factorize`,
and as a method :meth:`Series.factorize` and :meth:`Index.factorize`.
Parameters
----------
values : Sequence
ndarrays must be 1-D. Sequences that aren't pandas objects are
coereced to ndarrays before factorization.
sort : boolean, default False
Sort by values
%(values)s%(sort)s%(order)s
na_sentinel : int, default -1
Value to mark "not found"
size_hint : hint to the hashtable sizer
Value to mark "not found".
%(size_hint)s\
Returns
-------
labels : the indexer to the original array
uniques : ndarray (1-d) or Index
the unique values. Index is returned when passed values is Index or
Series
labels : ndarray
An integer ndarray that's an indexer into `uniques`.
``uniques.take(labels)`` will have the same values as `values`.
uniques : ndarray, Index, or Categorical
The unique valid values. When `values` is Categorical, `uniques`
is a Categorical. When `values` is some other pandas object, an
`Index` is returned. Otherwise, a 1-D ndarray is returned.
.. note ::
Even if there's a missing value in `values`, `uniques` will
*not* contain an entry for it.
See Also
--------
pandas.cut : Discretize continuous-valued array.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add pandas.unique here

pandas.unique : Find the unique valuse in an array.
Examples
--------
These examples all show factorize as a top-level method like
``pd.factorize(values)``. The results are identical for methods like
:meth:`Series.factorize`.
>>> labels, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'])
>>> labels
array([0, 0, 1, 2, 0])
>>> uniques
array(['b', 'a', 'c'], dtype=object)
With ``sort=True``, the `uniques` will be sorted, and `labels` will be
shuffled so that the relationship is the maintained.
>>> labels, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'], sort=True)
>>> labels
array([1, 1, 0, 2, 1])
>>> uniques
array(['a', 'b', 'c'], dtype=object)
Missing values are indicated in `labels` with `na_sentinel`
(``-1`` by default). Note that missing values are never
included in `uniques`.
>>> labels, uniques = pd.factorize(['b', None, 'a', 'c', 'b'])
>>> labels
array([ 0, -1, 1, 2, 0])
>>> uniques
array(['b', 'a', 'c'], dtype=object)
note: an array of Periods will ignore sort as it returns an always sorted
PeriodIndex.
Thus far, we've only factorized lists (which are internally coerced to
NumPy arrays). When factorizing pandas objects, the type of `uniques`
will differ. For Categoricals, a `Categorical` is returned.
>>> cat = pd.Categorical(['a', 'a', 'c'], categories=['a', 'b', 'c'])
>>> labels, uniques = pd.factorize(cat)
>>> labels
array([0, 0, 1])
>>> uniques
[a, c]
Categories (3, object): [a, b, c]
Notice that ``'b'`` is in ``uniques.categories``, desipite not being
present in ``cat.values``.
For all other pandas objects, an Index of the appropriate type is
returned.
>>> cat = pd.Series(['a', 'a', 'c'])
>>> labels, uniques = pd.factorize(cat)
>>> labels
array([0, 0, 1])
>>> uniques
Index(['a', 'c'], dtype='object')
"""


@Substitution(
values=dedent("""\
values : sequence
A 1-D seqeunce. Sequences that aren't pandas objects are
coereced to ndarrays before factorization.
"""),
order=dedent("""\
order
.. deprecated:: 0.23.0
This parameter has no effect and is deprecated.
"""),
sort=dedent("""\
sort : bool, default False
Sort `uniques` and shuffle `labels` to maintain the
relationship.
"""),
size_hint=dedent("""\
size_hint : int, optional
Hint to the hashtable sizer.
"""),
)
@Appender(_shared_docs['factorize'])
@deprecate_kwarg(old_arg_name='order', new_arg_name=None)
def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None):
# Implementation notes: This method is responsible for 3 things
# 1.) coercing data to array-like (ndarray, Index, extension array)
# 2.) factorizing labels and uniques
Expand All @@ -507,9 +603,9 @@ def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None):
values = _ensure_arraylike(values)
original = values

if is_categorical_dtype(values):
if is_extension_array_dtype(values):
values = getattr(values, '_values', values)
labels, uniques = values.factorize()
labels, uniques = values.factorize(na_sentinel=na_sentinel)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this was a bug in #19938 where I forgot to pass this through. It's covered by our extension tests.

dtype = original.dtype
else:
values, dtype, _ = _ensure_data(values)
Expand Down
85 changes: 85 additions & 0 deletions pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,24 @@ def _constructor_from_sequence(cls, scalars):
"""
raise AbstractMethodError(cls)

@classmethod
def _from_factorized(cls, values, original):
"""Reconstruct an ExtensionArray after factorization.
Parameters
----------
values : ndarray
An integer ndarray with the factorized values.
original : ExtensionArray
The original ExtensionArray that factorize was called on.
See Also
--------
pandas.factorize
ExtensionArray.factorize
"""
raise AbstractMethodError(cls)

# ------------------------------------------------------------------------
# Must be a Sequence
# ------------------------------------------------------------------------
Expand Down Expand Up @@ -353,6 +371,73 @@ def unique(self):
uniques = unique(self.astype(object))
return self._constructor_from_sequence(uniques)

def _values_for_factorize(self):
# type: () -> Tuple[ndarray, Any]
"""Return an array and missing value suitable for factorization.
Returns
-------
values : ndarray
An array suitable for factoraization. This should maintain order
and be a supported dtype (Float64, Int64, UInt64, String, Object).
By default, the extension array is cast to object dtype.
na_value : object
The value in `values` to consider missing. This will be treated
as NA in the factorization routines, so it will be coded as
`na_sentinal` and not included in `uniques`. By default,
``np.nan`` is used.
"""
return self.astype(object), np.nan

def factorize(self, na_sentinel=-1):
# type: (int) -> Tuple[ndarray, ExtensionArray]
"""Encode the extension array as an enumerated type.
Parameters
----------
na_sentinel : int, default -1
Value to use in the `labels` array to indicate missing values.
Returns
-------
labels : ndarray
An interger NumPy array that's an indexer into the original
ExtensionArray.
uniques : ExtensionArray
An ExtensionArray containing the unique values of `self`.
.. note::
uniques will *not* contain an entry for the NA value of
the ExtensionArray if there are any missing values present
in `self`.
See Also
--------
pandas.factorize : Top-level factorize method that dispatches here.
Notes
-----
:meth:`pandas.factorize` offers a `sort` keyword as well.
"""
# Impelmentor note: There are two ways to override the behavior of
# pandas.factorize
# 1. _values_for_factorize and _from_factorize.
# Specify the values passed to pandas' internal factorization
# routines, and how to convert from those values back to the
# original ExtensionArray.
# 2. ExtensionArray.factorize.
# Complete control over factorization.
from pandas.core.algorithms import _factorize_array

arr, na_value = self._values_for_factorize()

labels, uniques = _factorize_array(arr, na_sentinel=na_sentinel,
na_value=na_value)

uniques = self._from_factorized(uniques, self)
return labels, uniques

# ------------------------------------------------------------------------
# Indexing methods
# ------------------------------------------------------------------------
Expand Down
59 changes: 8 additions & 51 deletions pandas/core/arrays/categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -2118,58 +2118,15 @@ def unique(self):
take_codes = sorted(take_codes)
return cat.set_categories(cat.categories.take(take_codes))

def factorize(self, na_sentinel=-1):
"""Encode the Categorical as an enumerated type.
Parameters
----------
sort : boolean, default False
Sort by values
na_sentinel: int, default -1
Value to mark "not found"
Returns
-------
labels : ndarray
An integer NumPy array that's an indexer into the original
Categorical
uniques : Categorical
A Categorical whose values are the unique values and
whose dtype matches the original CategoricalDtype. Note that if
there any unobserved categories in ``self`` will not be present
in ``uniques.values``. They will be present in
``uniques.categories``
Examples
--------
>>> cat = pd.Categorical(['a', 'a', 'c'], categories=['a', 'b', 'c'])
>>> labels, uniques = cat.factorize()
>>> labels
(array([0, 0, 1]),
>>> uniques
[a, c]
Categories (3, object): [a, b, c])
Missing values are handled
>>> labels, uniques = pd.factorize(pd.Categorical(['a', 'b', None]))
>>> labels
array([ 0, 1, -1])
>>> uniques
[a, b]
Categories (2, object): [a, b]
"""
from pandas.core.algorithms import _factorize_array

def _values_for_factorize(self):
codes = self.codes.astype('int64')
# We set missing codes, normally -1, to iNaT so that the
# Int64HashTable treats them as missing values.
labels, uniques = _factorize_array(codes, na_sentinel=na_sentinel,
na_value=-1)
uniques = self._constructor(self.categories.take(uniques),
categories=self.categories,
ordered=self.ordered)
return labels, uniques
return codes, -1

@classmethod
def _from_factorized(cls, uniques, original):
return original._constructor(original.categories.take(uniques),
categories=original.categories,
ordered=original.ordered)

def equals(self, other):
"""
Expand Down
27 changes: 10 additions & 17 deletions pandas/core/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
Base and utility classes for pandas objects.
"""
import warnings
import textwrap
from pandas import compat
from pandas.compat import builtins
import numpy as np
Expand Down Expand Up @@ -1151,24 +1152,16 @@ def memory_usage(self, deep=False):
v += lib.memory_usage_of_objects(self.values)
return v

@Substitution(
values='', order='', size_hint='',
sort=textwrap.dedent("""\
sort : boolean, default False
Sort `uniques` and shuffle `labels` to maintain the
relationship.
"""))
@Appender(algorithms._shared_docs['factorize'])
def factorize(self, sort=False, na_sentinel=-1):
"""
Encode the object as an enumerated type or categorical variable
Parameters
----------
sort : boolean, default False
Sort by values
na_sentinel: int, default -1
Value to mark "not found"
Returns
-------
labels : the indexer to the original array
uniques : the unique Index
"""
from pandas.core.algorithms import factorize
return factorize(self, sort=sort, na_sentinel=na_sentinel)
return algorithms.factorize(self, sort=sort, na_sentinel=na_sentinel)

_shared_docs['searchsorted'] = (
"""Find indices where elements should be inserted to maintain order.
Expand Down
Loading