Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/API: ExtensionArray.factorize #20361

Merged
merged 64 commits into from
Mar 27, 2018
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
0ec3600
ENH: Sorting of ExtensionArrays
TomAugspurger Feb 19, 2018
4707273
REF: Split argsort into two parts
TomAugspurger Mar 2, 2018
b61fb8d
Fixed docstring
TomAugspurger Mar 2, 2018
44b6d72
Remove _values_for_argsort
TomAugspurger Mar 2, 2018
5be3917
Revert "Remove _values_for_argsort"
TomAugspurger Mar 2, 2018
e474c20
Merge remote-tracking branch 'upstream/master' into fu1+sort
TomAugspurger Mar 2, 2018
c2578c3
Workaround Py2
TomAugspurger Mar 2, 2018
b73e303
Indexer as array
TomAugspurger Mar 2, 2018
0db9e97
Fixed dtypes
TomAugspurger Mar 4, 2018
baf624c
Fixed docstring
TomAugspurger Mar 12, 2018
ce92f7b
Merge remote-tracking branch 'upstream/master' into fu1+sort
TomAugspurger Mar 12, 2018
8cbfc36
Merge remote-tracking branch 'upstream/master' into fu1+sort
TomAugspurger Mar 13, 2018
425fb2a
Merge remote-tracking branch 'upstream/master' into fu1+sort
TomAugspurger Mar 13, 2018
7bbe796
Update docs
TomAugspurger Mar 14, 2018
31ed4c9
ENH/API: ExtensionArray.factorize
TomAugspurger Mar 13, 2018
434df7d
fixup! ENH/API: ExtensionArray.factorize
TomAugspurger Mar 15, 2018
505ad44
REF: Changed ExtensionDtype inheritance
TomAugspurger Mar 15, 2018
77a10b6
Merge remote-tracking branch 'upstream/master' into ea-factorize-2
TomAugspurger Mar 16, 2018
b59656f
Fix factorize equivalence test
TomAugspurger Mar 16, 2018
201e029
Shared factorize doc
TomAugspurger Mar 16, 2018
9b0c2a9
Move to algorithms
TomAugspurger Mar 16, 2018
eb19488
BUG: py2 bug
TomAugspurger Mar 16, 2018
cbfee1a
Typo, ref
TomAugspurger Mar 16, 2018
35a8977
Change name
TomAugspurger Mar 17, 2018
7efece2
Merge remote-tracking branch 'upstream/master' into fu1+sort
TomAugspurger Mar 17, 2018
ef8e6cb
Fixed docs
TomAugspurger Mar 17, 2018
dd3bf1d
Merge remote-tracking branch 'upstream/master' into ea-factorize-2
TomAugspurger Mar 17, 2018
6a6034f
Wording
TomAugspurger Mar 17, 2018
5c758aa
Merge remote-tracking branch 'upstream/master' into fu1+sort
TomAugspurger Mar 18, 2018
5526398
fixup! Wording
TomAugspurger Mar 19, 2018
cd5c2db
Merge remote-tracking branch 'upstream/master' into ea-factorize-2
TomAugspurger Mar 19, 2018
d5e8198
Back to _values_for_argsort
TomAugspurger Mar 19, 2018
30941cb
Example with _from_factorize
TomAugspurger Mar 19, 2018
3574273
Merge remote-tracking branch 'upstream/master' into ea-factorize-2
TomAugspurger Mar 20, 2018
c776133
Unskip most JSON tests
TomAugspurger Mar 20, 2018
2a79315
Merge branch 'fu1+sort' into ea-factorize-2
TomAugspurger Mar 20, 2018
6ca65f8
Overridable na_value too.
TomAugspurger Mar 22, 2018
bbedd8c
Reverted sorting changes
TomAugspurger Mar 22, 2018
96ecab7
Remove a bit more argsort
TomAugspurger Mar 22, 2018
1010417
Merge remote-tracking branch 'upstream/master' into ea-factorize-2
TomAugspurger Mar 23, 2018
c288d67
Mask values going into hashtables
TomAugspurger Mar 23, 2018
55c9e31
remove stale comment
TomAugspurger Mar 23, 2018
163bfa3
wip
TomAugspurger Mar 23, 2018
872c24a
ENH: Parametrized NA sentinel for factorize
TomAugspurger Mar 23, 2018
3c18428
REF: Moved to get_labels
TomAugspurger Mar 23, 2018
703ab8a
Remove python-level use_na_value
TomAugspurger Mar 23, 2018
ab32e0f
REF: More cleanup
TomAugspurger Mar 24, 2018
62fa538
API: Make it non-public
TomAugspurger Mar 24, 2018
28fad50
Revert formatting changes in pxd
TomAugspurger Mar 24, 2018
8580754
linting
TomAugspurger Mar 24, 2018
cf14ee1
Handle bool
TomAugspurger Mar 24, 2018
8141131
Merge remote-tracking branch 'upstream/master' into parametrized-fact…
TomAugspurger Mar 24, 2018
a23d451
Specify dtypes
TomAugspurger Mar 24, 2018
b25f3d4
Remove unused variable.
TomAugspurger Mar 24, 2018
dfcda85
REF: Removed check_nulls
TomAugspurger Mar 25, 2018
eaff342
BUG: NaT for period
TomAugspurger Mar 26, 2018
c05c807
Merge remote-tracking branch 'upstream/master' into parametrized-fact…
TomAugspurger Mar 26, 2018
e786253
Other hashtable
TomAugspurger Mar 26, 2018
465d458
na_value_for_dtype PeriodDtype
TomAugspurger Mar 26, 2018
6f8036e
Merge remote-tracking branch 'upstream/master' into ea-factorize-2
TomAugspurger Mar 26, 2018
bca4cdf
Merge branch 'parametrized-factorize-na-value' into ea-factorize-2+pa…
TomAugspurger Mar 26, 2018
69c3ea2
Use na_value
TomAugspurger Mar 26, 2018
fa8e221
Merge remote-tracking branch 'upstream/master' into ea-factorize-2
TomAugspurger Mar 27, 2018
c06da3a
Typing
TomAugspurger Mar 27, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 3 additions & 4 deletions pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -146,10 +146,9 @@ def _reconstruct_data(values, dtype, original):
Returns
-------
Index for extension types, otherwise ndarray casted to dtype

"""
from pandas import Index
if is_categorical_dtype(dtype):
if is_extension_array_dtype(dtype):
pass
elif is_datetime64tz_dtype(dtype) or is_period_dtype(dtype):
values = Index(original)._shallow_copy(values, name=None)
Expand Down Expand Up @@ -502,9 +501,9 @@ def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None):
values = _ensure_arraylike(values)
original = values

if is_categorical_dtype(values):
if is_extension_array_dtype(values):
values = getattr(values, '_values', values)
labels, uniques = values.factorize()
labels, uniques = values.factorize(na_sentinel=na_sentinel)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this was a bug in #19938 where I forgot to pass this through. It's covered by our extension tests.

dtype = original.dtype
else:
values, dtype, _ = _ensure_data(values)
Expand Down
35 changes: 35 additions & 0 deletions pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -248,6 +248,41 @@ def unique(self):
uniques = unique(self.astype(object))
return self._constructor_from_sequence(uniques)

def factorize(self, na_sentinel=-1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need a sort=False arg here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want / need that? It complicates the implementation a bit. Any idea what it's actually used for?

"""Encode the extension array as an enumerated type.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, need to merge this one too maybe. Will have to check on import order...


Parameters
----------
na_sentinel : int, default -1
Value to use in the `labels` array to indicate missing values.

Returns
-------
labels : ndarray
An interger NumPy array that's an indexer into the original
ExtensionArray
uniques : ExtensionArray
An ExtensionArray containing the unique values of `self`.

See Also
--------
pandas.factorize : top-level factorize method that dispatches here.

Notes
-----
:meth:`pandas.factorize` offers a `sort` keyword as well.
"""
from pandas.core.algorithms import _factorize_array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're OK with using this private API here?

Because an extension authors might want to copy paste this method and change the arr = self.astype(object) line? (how do you implement factorize in cyberpandas?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite similar to this. https://github.com/ContinuumIO/cyberpandas/blob/468644bcbdc9320a1a33b0df393d4fa4bef57dd7/cyberpandas/base.py#L72

In that case I think going to object dtypes is unavoidable, since there's no easy way to factorize a 2-D array, and I didn't want to write a new hashtable implementation :)

Copy link
Contributor Author

@TomAugspurger TomAugspurger Mar 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

w.r.t. using _factorize_array, I don't think it's avoidable.

We might consider making it public / semi-public (keep the _ to not confuse users, but point EA authors to it).


mask = self.isna()
arr = self.astype(object)
arr[mask] = np.nan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be the self._na_value ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We removed that from the API. Anyway, that would be the type-specific NA value, we need an NA value that pandas knows about, which is just NaN or NaT.

Speaking of which, I don't really like how this is done right now. If your values_for_factorize are integers, this will force them to be floating. Will thing more about it.


labels, uniques = _factorize_array(arr, check_nulls=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could also use pd.factorize here if we want to limit the use of internal APIs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just trying that. Unfortunately, I have to pass check_nulls=True for categorical to work properly. It's typically False for integers. But perhaps that's a sign we're doing something improper :/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, because there we still know what the original dtype was ...
Hmm, that maybe shows that we rather need a na_value keyword (just from a perspective of making pd.factorize generally usable, not necessarily in this PR).
In general, the default of na_value=None could be the val != val idiom, and you can override that with val == na_value with this keyword (e.g. to specify iNaT or -1 for integer data).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this issue and the fact that UInt64Hashtable doesn't have a null value are making me come back around to the mask approach.

That seems like the only way we can ensure that we'll actually get things correct.

na_sentinel=na_sentinel)
uniques = self._constructor_from_sequence(uniques)
return labels, uniques

# ------------------------------------------------------------------------
# Indexing methods
# ------------------------------------------------------------------------
Expand Down
3 changes: 3 additions & 0 deletions pandas/tests/extension/base/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,6 @@
class BaseExtensionTests(object):
assert_series_equal = staticmethod(tm.assert_series_equal)
assert_frame_equal = staticmethod(tm.assert_frame_equal)
assert_extension_array_equal = staticmethod(
tm.assert_extension_array_equal
)
20 changes: 20 additions & 0 deletions pandas/tests/extension/base/methods.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
import numpy as np

import pandas as pd
import pandas.util.testing as tm

from .base import BaseExtensionTests

Expand Down Expand Up @@ -42,3 +43,22 @@ def test_unique(self, data, box, method):
assert len(result) == 1
assert isinstance(result, type(data))
assert result[0] == duplicated[0]

@pytest.mark.parametrize('na_sentinel', [-1, -2])
def test_factorize(self, data_for_grouping, na_sentinel):
labels, uniques = pd.factorize(data_for_grouping,
na_sentinel=na_sentinel)
expected_labels = np.array([0, 0, na_sentinel,
na_sentinel, 1, 1, 0, 2],
dtype='int64')
expected_uniques = data_for_grouping.take([0, 4, 7])

tm.assert_numpy_array_equal(labels, expected_labels)
self.assert_extension_array_equal(uniques, expected_uniques)

def test_factorize_equivalence(self, data_for_grouping):
l1, u1 = pd.factorize(data_for_grouping)
l2, u2 = pd.factorize(data_for_grouping)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this test doing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, I think I meant for one to be data_for_grouping.factorize()


tm.assert_numpy_array_equal(l1, l2)
self.assert_extension_array_equal(u1, u2)
5 changes: 5 additions & 0 deletions pandas/tests/extension/category/test_categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,11 @@ def na_value():
return np.nan


@pytest.fixture
def data_for_grouping():
return Categorical(['a', 'a', None, None, 'b', 'b', 'a', 'c'])


class TestDtype(base.BaseDtypeTests):
pass

Expand Down
11 changes: 11 additions & 0 deletions pandas/tests/extension/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,3 +46,14 @@ def na_cmp():
def na_value():
"""The scalar missing value for this type. Default 'None'"""
return None


@pytest.fixture
def data_for_grouping():
"""Data for factorization, grouping, and unique tests.

Expected to be like [B, B, NA, NA, A, A, B, C]

Where A < B < C and NA is missing
"""
raise NotImplementedError
9 changes: 9 additions & 0 deletions pandas/tests/extension/decimal/test_decimal.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,15 @@ def na_value():
return decimal.Decimal("NaN")


@pytest.fixture
def data_for_grouping():
b = decimal.Decimal('1.0')
a = decimal.Decimal('0.0')
c = decimal.Decimal('2.0')
na = decimal.Decimal('NaN')
return DecimalArray([b, b, na, na, a, a, b, c])


class TestDtype(base.BaseDtypeTests):
pass

Expand Down
16 changes: 16 additions & 0 deletions pandas/tests/extension/json/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@

import numpy as np

import pandas as pd
from pandas.core.dtypes.base import ExtensionDtype
from pandas.core.arrays import ExtensionArray

Expand Down Expand Up @@ -104,6 +105,21 @@ def _concat_same_type(cls, to_concat):
data = list(itertools.chain.from_iterable([x.data for x in to_concat]))
return cls(data)

def factorize(self, na_sentinel=-1):
frozen = tuple(tuple(x.items()) for x in self)
labels, uniques = pd.factorize(frozen)

# fixup NA
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type of thing is going to be error prone. We expect that [B, NA, C, B] is coded as [0, -1, 1, 0], not [0, 1, 2, 0] (I ran into this with cyberpandas as well).

If library authors are using our tests, this should be caught. Otherwise, I'm not sure how to handle it. We could design a method like around def refactorize(labels, mask, uniques) that does something like this, but I'm not sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm apparently I messed this up, so the tests will fail. Fixing now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 434df7d

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we require this, I think we should put a stronger comment (maybe as a comment below the docstring) about that we expect that missing values are not in the uniques.

What would go wrong if .factorize() on an extension array would include its missing value in the uniques?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm probably not the end of the world, but it would violate our factorize (and eventually groupby) semantics. Until recently, we messed this up for Categorical. That didn't have downstream effects for groupby, since everything in groupby is special for categorical.

I'll make this clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well adding that simple note turned into a rabit hole. I decided to merge the 3 docstrings

  • pandas.factorize
  • pandas.core.base.IndexOpsMixin.factorize
  • pandas.Categorical.factorize

if self.isna().any():
na_code = self.isna().argmax()

labels[labels == na_code] = na_sentinel
labels[labels > na_code] -= 1

uniques = JSONArray([collections.UserDict(x)
for x in uniques if x != ()])
return labels, uniques


def make_data():
# TODO: Use a regular dict. See _NDFrameIndexer._setitem_with_indexer
Expand Down
17 changes: 15 additions & 2 deletions pandas/tests/extension/json/test_json.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,17 @@ def na_cmp():
return operator.eq


@pytest.fixture
def data_for_grouping():
return JSONArray([
{'b': 1}, {'b': 1},
{}, {},
{'a': 0, 'c': 2}, {'a': 0, 'c': 2},
{'b': 1},
{'c': 2},
])


class TestDtype(base.BaseDtypeTests):
pass

Expand All @@ -64,8 +75,10 @@ class TestMissing(base.BaseMissingTests):


class TestMethods(base.BaseMethodsTests):
@pytest.mark.skip(reason="Unhashable")
def test_value_counts(self, all_data, dropna):
unhashable = pytest.mark.skip(reason="Unhashable")

@unhashable
def test_factorize(self):
pass


Expand Down
27 changes: 27 additions & 0 deletions pandas/util/testing.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
import numpy as np

import pandas as pd
from pandas.core.arrays.base import ExtensionArray
from pandas.core.dtypes.missing import array_equivalent
from pandas.core.dtypes.common import (
is_datetimelike_v_numeric,
Expand Down Expand Up @@ -1083,6 +1084,32 @@ def _raise(left, right, err_msg):
return True


def assert_extension_array_equal(left, right):
"""Check that left and right ExtensionArrays are equal.

Parameters
----------
left, right : ExtensionArray
The two arrays to compare

Notes
-----
Missing values are checked separately from valid values.
A mask of missing values is computed for each and checked to match.
The remaining all-valid values are cast to object dtype and checked.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this needed only now? (wasn't the missing values the reason you added assert_frame_equal et al as overridable class methods?)

Copy link
Contributor Author

@TomAugspurger TomAugspurger Mar 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

factorize is I think the first case where we have a public method returning an ExtensionArray (the second return value).

I'll see if any old tests can make use of this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any other cases where we could apply assert_extension_array_equal.

The closes would be the __getitem__ tests, but in that case we don't have an expected array to compare to, since we're testing getitem / take / etc.

assert isinstance(left, ExtensionArray)
assert left.dtype == right.dtype
left_na = left.isna()
right_na = right.isna()
assert_numpy_array_equal(left_na, right_na)

left_valid = left[~left_na].astype(object)
right_valid = right[~right_na].astype(object)

assert_numpy_array_equal(left_valid, right_valid)


# This could be refactored to use the NDFrame.equals method
def assert_series_equal(left, right, check_dtype=True,
check_index_type='equiv',
Expand Down