-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH/API: ExtensionArray.factorize #20361
Merged
Merged
Changes from all commits
Commits
Show all changes
64 commits
Select commit
Hold shift + click to select a range
0ec3600
ENH: Sorting of ExtensionArrays
TomAugspurger 4707273
REF: Split argsort into two parts
TomAugspurger b61fb8d
Fixed docstring
TomAugspurger 44b6d72
Remove _values_for_argsort
TomAugspurger 5be3917
Revert "Remove _values_for_argsort"
TomAugspurger e474c20
Merge remote-tracking branch 'upstream/master' into fu1+sort
TomAugspurger c2578c3
Workaround Py2
TomAugspurger b73e303
Indexer as array
TomAugspurger 0db9e97
Fixed dtypes
TomAugspurger baf624c
Fixed docstring
TomAugspurger ce92f7b
Merge remote-tracking branch 'upstream/master' into fu1+sort
TomAugspurger 8cbfc36
Merge remote-tracking branch 'upstream/master' into fu1+sort
TomAugspurger 425fb2a
Merge remote-tracking branch 'upstream/master' into fu1+sort
TomAugspurger 7bbe796
Update docs
TomAugspurger 31ed4c9
ENH/API: ExtensionArray.factorize
TomAugspurger 434df7d
fixup! ENH/API: ExtensionArray.factorize
TomAugspurger 505ad44
REF: Changed ExtensionDtype inheritance
TomAugspurger 77a10b6
Merge remote-tracking branch 'upstream/master' into ea-factorize-2
TomAugspurger b59656f
Fix factorize equivalence test
TomAugspurger 201e029
Shared factorize doc
TomAugspurger 9b0c2a9
Move to algorithms
TomAugspurger eb19488
BUG: py2 bug
TomAugspurger cbfee1a
Typo, ref
TomAugspurger 35a8977
Change name
TomAugspurger 7efece2
Merge remote-tracking branch 'upstream/master' into fu1+sort
TomAugspurger ef8e6cb
Fixed docs
TomAugspurger dd3bf1d
Merge remote-tracking branch 'upstream/master' into ea-factorize-2
TomAugspurger 6a6034f
Wording
TomAugspurger 5c758aa
Merge remote-tracking branch 'upstream/master' into fu1+sort
TomAugspurger 5526398
fixup! Wording
TomAugspurger cd5c2db
Merge remote-tracking branch 'upstream/master' into ea-factorize-2
TomAugspurger d5e8198
Back to _values_for_argsort
TomAugspurger 30941cb
Example with _from_factorize
TomAugspurger 3574273
Merge remote-tracking branch 'upstream/master' into ea-factorize-2
TomAugspurger c776133
Unskip most JSON tests
TomAugspurger 2a79315
Merge branch 'fu1+sort' into ea-factorize-2
TomAugspurger 6ca65f8
Overridable na_value too.
TomAugspurger bbedd8c
Reverted sorting changes
TomAugspurger 96ecab7
Remove a bit more argsort
TomAugspurger 1010417
Merge remote-tracking branch 'upstream/master' into ea-factorize-2
TomAugspurger c288d67
Mask values going into hashtables
TomAugspurger 55c9e31
remove stale comment
TomAugspurger 163bfa3
wip
TomAugspurger 872c24a
ENH: Parametrized NA sentinel for factorize
TomAugspurger 3c18428
REF: Moved to get_labels
TomAugspurger 703ab8a
Remove python-level use_na_value
TomAugspurger ab32e0f
REF: More cleanup
TomAugspurger 62fa538
API: Make it non-public
TomAugspurger 28fad50
Revert formatting changes in pxd
TomAugspurger 8580754
linting
TomAugspurger cf14ee1
Handle bool
TomAugspurger 8141131
Merge remote-tracking branch 'upstream/master' into parametrized-fact…
TomAugspurger a23d451
Specify dtypes
TomAugspurger b25f3d4
Remove unused variable.
TomAugspurger dfcda85
REF: Removed check_nulls
TomAugspurger eaff342
BUG: NaT for period
TomAugspurger c05c807
Merge remote-tracking branch 'upstream/master' into parametrized-fact…
TomAugspurger e786253
Other hashtable
TomAugspurger 465d458
na_value_for_dtype PeriodDtype
TomAugspurger 6f8036e
Merge remote-tracking branch 'upstream/master' into ea-factorize-2
TomAugspurger bca4cdf
Merge branch 'parametrized-factorize-na-value' into ea-factorize-2+pa…
TomAugspurger 69c3ea2
Use na_value
TomAugspurger fa8e221
Merge remote-tracking branch 'upstream/master' into ea-factorize-2
TomAugspurger c06da3a
Typing
TomAugspurger File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,6 +4,8 @@ | |
""" | ||
from __future__ import division | ||
from warnings import warn, catch_warnings | ||
from textwrap import dedent | ||
|
||
import numpy as np | ||
|
||
from pandas.core.dtypes.cast import ( | ||
|
@@ -34,7 +36,10 @@ | |
from pandas.core import common as com | ||
from pandas._libs import algos, lib, hashtable as htable | ||
from pandas._libs.tslib import iNaT | ||
from pandas.util._decorators import deprecate_kwarg | ||
from pandas.util._decorators import (Appender, Substitution, | ||
deprecate_kwarg) | ||
|
||
_shared_docs = {} | ||
|
||
|
||
# --------------- # | ||
|
@@ -146,10 +151,9 @@ def _reconstruct_data(values, dtype, original): | |
Returns | ||
------- | ||
Index for extension types, otherwise ndarray casted to dtype | ||
""" | ||
from pandas import Index | ||
if is_categorical_dtype(dtype): | ||
if is_extension_array_dtype(dtype): | ||
pass | ||
elif is_datetime64tz_dtype(dtype) or is_period_dtype(dtype): | ||
values = Index(original)._shallow_copy(values, name=None) | ||
|
@@ -469,32 +473,124 @@ def _factorize_array(values, na_sentinel=-1, size_hint=None, | |
return labels, uniques | ||
|
||
|
||
@deprecate_kwarg(old_arg_name='order', new_arg_name=None) | ||
def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None): | ||
""" | ||
Encode input values as an enumerated type or categorical variable | ||
_shared_docs['factorize'] = """ | ||
Encode the object as an enumerated type or categorical variable. | ||
This method is useful for obtaining a numeric representation of an | ||
array when all that matters is identifying distinct values. `factorize` | ||
is available as both a top-level function :func:`pandas.factorize`, | ||
and as a method :meth:`Series.factorize` and :meth:`Index.factorize`. | ||
Parameters | ||
---------- | ||
values : Sequence | ||
ndarrays must be 1-D. Sequences that aren't pandas objects are | ||
coereced to ndarrays before factorization. | ||
sort : boolean, default False | ||
Sort by values | ||
%(values)s%(sort)s%(order)s | ||
na_sentinel : int, default -1 | ||
Value to mark "not found" | ||
size_hint : hint to the hashtable sizer | ||
Value to mark "not found". | ||
%(size_hint)s\ | ||
Returns | ||
------- | ||
labels : the indexer to the original array | ||
uniques : ndarray (1-d) or Index | ||
the unique values. Index is returned when passed values is Index or | ||
Series | ||
labels : ndarray | ||
An integer ndarray that's an indexer into `uniques`. | ||
``uniques.take(labels)`` will have the same values as `values`. | ||
uniques : ndarray, Index, or Categorical | ||
The unique valid values. When `values` is Categorical, `uniques` | ||
is a Categorical. When `values` is some other pandas object, an | ||
`Index` is returned. Otherwise, a 1-D ndarray is returned. | ||
.. note :: | ||
Even if there's a missing value in `values`, `uniques` will | ||
*not* contain an entry for it. | ||
See Also | ||
-------- | ||
pandas.cut : Discretize continuous-valued array. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add |
||
pandas.unique : Find the unique valuse in an array. | ||
Examples | ||
-------- | ||
These examples all show factorize as a top-level method like | ||
``pd.factorize(values)``. The results are identical for methods like | ||
:meth:`Series.factorize`. | ||
>>> labels, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b']) | ||
>>> labels | ||
array([0, 0, 1, 2, 0]) | ||
>>> uniques | ||
array(['b', 'a', 'c'], dtype=object) | ||
With ``sort=True``, the `uniques` will be sorted, and `labels` will be | ||
shuffled so that the relationship is the maintained. | ||
>>> labels, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'], sort=True) | ||
>>> labels | ||
array([1, 1, 0, 2, 1]) | ||
>>> uniques | ||
array(['a', 'b', 'c'], dtype=object) | ||
Missing values are indicated in `labels` with `na_sentinel` | ||
(``-1`` by default). Note that missing values are never | ||
included in `uniques`. | ||
>>> labels, uniques = pd.factorize(['b', None, 'a', 'c', 'b']) | ||
>>> labels | ||
array([ 0, -1, 1, 2, 0]) | ||
>>> uniques | ||
array(['b', 'a', 'c'], dtype=object) | ||
note: an array of Periods will ignore sort as it returns an always sorted | ||
PeriodIndex. | ||
Thus far, we've only factorized lists (which are internally coerced to | ||
NumPy arrays). When factorizing pandas objects, the type of `uniques` | ||
will differ. For Categoricals, a `Categorical` is returned. | ||
>>> cat = pd.Categorical(['a', 'a', 'c'], categories=['a', 'b', 'c']) | ||
>>> labels, uniques = pd.factorize(cat) | ||
>>> labels | ||
array([0, 0, 1]) | ||
>>> uniques | ||
[a, c] | ||
Categories (3, object): [a, b, c] | ||
Notice that ``'b'`` is in ``uniques.categories``, desipite not being | ||
present in ``cat.values``. | ||
For all other pandas objects, an Index of the appropriate type is | ||
returned. | ||
>>> cat = pd.Series(['a', 'a', 'c']) | ||
>>> labels, uniques = pd.factorize(cat) | ||
>>> labels | ||
array([0, 0, 1]) | ||
>>> uniques | ||
Index(['a', 'c'], dtype='object') | ||
""" | ||
|
||
|
||
@Substitution( | ||
values=dedent("""\ | ||
values : sequence | ||
A 1-D seqeunce. Sequences that aren't pandas objects are | ||
coereced to ndarrays before factorization. | ||
"""), | ||
order=dedent("""\ | ||
order | ||
.. deprecated:: 0.23.0 | ||
This parameter has no effect and is deprecated. | ||
"""), | ||
sort=dedent("""\ | ||
sort : bool, default False | ||
Sort `uniques` and shuffle `labels` to maintain the | ||
relationship. | ||
"""), | ||
size_hint=dedent("""\ | ||
size_hint : int, optional | ||
Hint to the hashtable sizer. | ||
"""), | ||
) | ||
@Appender(_shared_docs['factorize']) | ||
@deprecate_kwarg(old_arg_name='order', new_arg_name=None) | ||
def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None): | ||
# Implementation notes: This method is responsible for 3 things | ||
# 1.) coercing data to array-like (ndarray, Index, extension array) | ||
# 2.) factorizing labels and uniques | ||
|
@@ -507,9 +603,9 @@ def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None): | |
values = _ensure_arraylike(values) | ||
original = values | ||
|
||
if is_categorical_dtype(values): | ||
if is_extension_array_dtype(values): | ||
values = getattr(values, '_values', values) | ||
labels, uniques = values.factorize() | ||
labels, uniques = values.factorize(na_sentinel=na_sentinel) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note: this was a bug in #19938 where I forgot to pass this through. It's covered by our extension tests. |
||
dtype = original.dtype | ||
else: | ||
values, dtype, _ = _ensure_data(values) | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is only a single entry, can we make this a variable (eg
_shared_docstring_factorize
) ?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a slight preference for keeping it as a dictionary, since it looks like the docstrings for
unique
andvalue_counts
can be shared between ops and base. #20390