ENH/API: ExtensionArray.factorize #20361

TomAugspurger · 2018-03-15T12:10:41Z

Adds factorize to the interface for ExtensionArray, with a default
implementation.

This is a stepping stone to groupby.

Categorical already has a custom implementation.

Decimal reuses the default.

JSON can't use the default since an array of dictionaries isn't hashable, so we use a custom implementation.

This enables {Series,DataFrame}.sort_values and {Series,DataFrame}.argsort

This reverts commit 44b6d72.

Adds factorize to the interface for ExtensionArray, with a default implementation. This is a stepping stone to groupby.

TomAugspurger · 2018-03-15T12:13:39Z

pandas/tests/extension/json/array.py

+        frozen = tuple(tuple(x.items()) for x in self)
+        labels, uniques = pd.factorize(frozen)
+
+        # fixup NA


This type of thing is going to be error prone. We expect that [B, NA, C, B] is coded as [0, -1, 1, 0], not [0, 1, 2, 0] (I ran into this with cyberpandas as well).

If library authors are using our tests, this should be caught. Otherwise, I'm not sure how to handle it. We could design a method like around def refactorize(labels, mask, uniques) that does something like this, but I'm not sure.

Hmm apparently I messed this up, so the tests will fail. Fixing now.

Fixed in 434df7d

If we require this, I think we should put a stronger comment (maybe as a comment below the docstring) about that we expect that missing values are not in the uniques.

What would go wrong if .factorize() on an extension array would include its missing value in the uniques?

Hmm probably not the end of the world, but it would violate our factorize (and eventually groupby) semantics. Until recently, we messed this up for Categorical. That didn't have downstream effects for groupby, since everything in groupby is special for categorical.

I'll make this clearer.

Well adding that simple note turned into a rabit hole. I decided to merge the 3 docstrings

pandas.factorize

pandas.core.base.IndexOpsMixin.factorize

pandas.Categorical.factorize

TomAugspurger · 2018-03-15T12:23:06Z

Marking this as a WIP for now. For JSONArray, the value_counts test is failing. I'd like to see if we can reuse factorize in value_counts. In principal, a counting the labels and doing a uniques.take on the index of the counts should work.

TomAugspurger · 2018-03-15T13:54:49Z

The failures are caused by a "bug" in is_extension_array_dtype. #20363

`is_extension_array_dtype(dtype)` was incorrect for dtypes that haven't implemented the new interface yet. This is because they indirectly subclassed ExtensionDtype. This PR changes the hierarchy so that PandasExtensionDtype doesn't subclass ExtensionDtype. As we implement the interface, like Categorical, we'll add ExtensionDtype as a base class. Before: ``` DatetimeTZDtype <- PandasExtensionDtype <- ExtensionDtype (wrong) CategoricalDtype <- PandasExtensionDtype <- ExtensionDtype (right) After: DatetimeTZDtype <- PandasExtensionDtype \ - _DtypeOpsMixin / ExtensionDtype ------ CategoricalDtype - PandasExtensionDtype - \ \ \ -_DtypeOpsMixin \ / ExtensionDtype ------- ``` Once all our extension dtypes have implemented the interface we can go back to the simple, linear inheritance structure.

TomAugspurger · 2018-03-16T14:14:47Z

pandas/core/algorithms.py

        values = getattr(values, '_values', values)
-        labels, uniques = values.factorize()
+        labels, uniques = values.factorize(na_sentinel=na_sentinel)


Note: this was a bug in #19938 where I forgot to pass this through. It's covered by our extension tests.

TomAugspurger · 2018-03-16T14:15:55Z

This should pass now.

jorisvandenbossche · 2018-03-16T18:08:11Z

pandas/core/arrays/base.py

+        -----
+        :meth:`pandas.factorize` offers a `sort` keyword as well.
+        """
+        from pandas.core.algorithms import _factorize_array


We're OK with using this private API here?

Because an extension authors might want to copy paste this method and change the arr = self.astype(object) line? (how do you implement factorize in cyberpandas?)

Quite similar to this. https://github.com/ContinuumIO/cyberpandas/blob/468644bcbdc9320a1a33b0df393d4fa4bef57dd7/cyberpandas/base.py#L72

In that case I think going to object dtypes is unavoidable, since there's no easy way to factorize a 2-D array, and I didn't want to write a new hashtable implementation :)

w.r.t. using _factorize_array, I don't think it's avoidable.

We might consider making it public / semi-public (keep the _ to not confuse users, but point EA authors to it).

jorisvandenbossche · 2018-03-16T18:08:34Z

pandas/tests/extension/base/methods.py

+
+    def test_factorize_equivalence(self, data_for_grouping):
+        l1, u1 = pd.factorize(data_for_grouping)
+        l2, u2 = pd.factorize(data_for_grouping)


What is this test doing?

Ha, I think I meant for one to be data_for_grouping.factorize()

jorisvandenbossche · 2018-03-16T18:10:38Z

pandas/tests/extension/json/array.py

+        frozen = tuple(tuple(x.items()) for x in self)
+        labels, uniques = pd.factorize(frozen)
+
+        # fixup NA


If we require this, I think we should put a stronger comment (maybe as a comment below the docstring) about that we expect that missing values are not in the uniques.

What would go wrong if .factorize() on an extension array would include its missing value in the uniques?

jorisvandenbossche · 2018-03-16T18:12:39Z

pandas/util/testing.py

+    Missing values are checked separately from valid values.
+    A mask of missing values is computed for each and checked to match.
+    The remaining all-valid values are cast to object dtype and checked.
+    """


Why was this needed only now? (wasn't the missing values the reason you added assert_frame_equal et al as overridable class methods?)

factorize is I think the first case where we have a public method returning an ExtensionArray (the second return value).

I'll see if any old tests can make use of this.

I don't see any other cases where we could apply assert_extension_array_equal.

The closes would be the __getitem__ tests, but in that case we don't have an expected array to compare to, since we're testing getitem / take / etc.

TomAugspurger · 2018-03-24T11:59:45Z

@jreback you can ignore all the back and forth between Joris and I yesterday about mask. I think #20473 will make things nice for EA authors. They'll just need to specify

The array to be used for factorization
The scalar in their array to be considered NA.

Some examples:

defualt: self.astype(object), np.nan
Categorical: self.codes.astype(int64), -1
MACArray: self.data, 0 (this is in cyberpandas. A uint64 array, where 0 is NA)

…orize-na-value

Added PyObject hashtable test

jreback · 2018-03-24T22:22:33Z

ok let's work on #20473 first then, that is exactly the kind of things would like to see. e.g. moving around pandas internals to make more friendly by essentially pushing things down rather than doing them solely in the EA.

…orize-na-value

…rametrized

Spacing Docs

TomAugspurger · 2018-03-27T11:23:17Z

Updated on top of the parametrized NA value. Things look relatively clean now I think (most of the diff is due to moving docstrings to a shared spot). In terms of API:

def _values_for_argsort(): -> Tuple[ndarray, Any]
    # array to factorize and na_value

@classmethod
@abstractmethod
def _from_factorized(cls, values, original): -> ExtensionArray
    # Reconstruct uniques from values & original

This is adequate for all our examples in pandas, including JSONArray which isn't otherwise factorizable (non-hashable). It also works well for IPArray and MACArray (Uint64 with 0 as the NA value).

@jorisvandenbossche do you think this will work well for geopandas? Can you have missing geometries?

The categorical tests are an example of using original in _from_factorized, as we copy over the unobserved categories to the uniques dtype.

jorisvandenbossche · 2018-03-27T11:56:51Z

@jorisvandenbossche do you think this will work well for geopandas? Can you have missing geometries?

Yes, that should be fine. I am not fully sure yet, but I think I will use a binary blob for each geometry ("well known binary" format), which will give an object array of bytes, and with eg None as missing value. So this should work (that should already work with pd.factorize itself without the new infrastructure I think).
I will also need the original to add back metadata about the coordinate reference system.

TomAugspurger · 2018-03-27T11:59:38Z

From Python your array of pointers look like Int64s, right? Could those be factorized? I think it should work, though I haven't thought it through fully.

jorisvandenbossche · 2018-03-27T12:05:16Z

No, because two separate Points (separate objects in C) but with same coordinates, will have different pointer. So the pointer information is not necessarily relevant for factorize.

TomAugspurger · 2018-03-27T12:52:12Z

Understood.

CI is all green.

jreback · 2018-03-27T15:40:54Z

thanks @TomAugspurger

TomAugspurger added 15 commits March 1, 2018 14:45

ENH: Sorting of ExtensionArrays

0ec3600

This enables {Series,DataFrame}.sort_values and {Series,DataFrame}.argsort

REF: Split argsort into two parts

4707273

Fixed docstring

b61fb8d

Remove _values_for_argsort

44b6d72

Revert "Remove _values_for_argsort"

5be3917

This reverts commit 44b6d72.

Merge remote-tracking branch 'upstream/master' into fu1+sort

e474c20

Workaround Py2

c2578c3

Indexer as array

b73e303

Fixed dtypes

0db9e97

Fixed docstring

baf624c

Merge remote-tracking branch 'upstream/master' into fu1+sort

ce92f7b

Merge remote-tracking branch 'upstream/master' into fu1+sort

8cbfc36

Merge remote-tracking branch 'upstream/master' into fu1+sort

425fb2a

Update docs

7bbe796

ENH/API: ExtensionArray.factorize

31ed4c9

Adds factorize to the interface for ExtensionArray, with a default implementation. This is a stepping stone to groupby.

TomAugspurger added API Design Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff ExtensionArray Extending pandas with custom dtypes or arrays. labels Mar 15, 2018

TomAugspurger commented Mar 15, 2018

View reviewed changes

fixup! ENH/API: ExtensionArray.factorize

434df7d

TomAugspurger changed the title ~~ENH/API: ExtensionArray.factorize~~ [WIP]ENH/API: ExtensionArray.factorize Mar 15, 2018

TomAugspurger added 2 commits March 15, 2018 08:59

Merge remote-tracking branch 'upstream/master' into ea-factorize-2

77a10b6

TomAugspurger mentioned this pull request Mar 16, 2018

ExtensionArray meta-issue #19696

Closed

15 tasks

TomAugspurger commented Mar 16, 2018

View reviewed changes

jorisvandenbossche reviewed Mar 16, 2018

View reviewed changes

Fix factorize equivalence test

b59656f

TomAugspurger mentioned this pull request Mar 24, 2018

Parametrized NA sentinel for factorize #20473

Merged

TomAugspurger added 5 commits March 24, 2018 07:00

linting

8580754

Handle bool

cf14ee1

Merge remote-tracking branch 'upstream/master' into parametrized-fact…

8141131

…orize-na-value

Specify dtypes

a23d451

Remove unused variable.

b25f3d4

Added PyObject hashtable test

TomAugspurger added 10 commits March 24, 2018 19:09

REF: Removed check_nulls

dfcda85

BUG: NaT for period

eaff342

Merge remote-tracking branch 'upstream/master' into parametrized-fact…

c05c807

…orize-na-value

Other hashtable

e786253

na_value_for_dtype PeriodDtype

465d458

Merge remote-tracking branch 'upstream/master' into ea-factorize-2

6f8036e

Merge branch 'parametrized-factorize-na-value' into ea-factorize-2+pa…

bca4cdf

…rametrized

Use na_value

69c3ea2

Merge remote-tracking branch 'upstream/master' into ea-factorize-2

fa8e221

Typing

c06da3a

Spacing Docs

jreback merged commit 766a480 into pandas-dev:master Mar 27, 2018

TomAugspurger deleted the ea-factorize-2 branch March 27, 2018 16:06

javadnoorb pushed a commit to javadnoorb/pandas that referenced this pull request Mar 29, 2018

ENH/API: ExtensionArray.factorize (pandas-dev#20361)

b0143e5

dworvos pushed a commit to dworvos/pandas that referenced this pull request Apr 2, 2018

ENH/API: ExtensionArray.factorize (pandas-dev#20361)

24977f0

kornilova203 pushed a commit to kornilova203/pandas that referenced this pull request Apr 23, 2018

ENH/API: ExtensionArray.factorize (pandas-dev#20361)

7128665

TomAugspurger mentioned this pull request Mar 12, 2020

EA: revisit interface #32586

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH/API: ExtensionArray.factorize #20361

ENH/API: ExtensionArray.factorize #20361

TomAugspurger commented Mar 15, 2018

TomAugspurger Mar 15, 2018

TomAugspurger Mar 15, 2018

TomAugspurger Mar 15, 2018

jorisvandenbossche Mar 16, 2018

TomAugspurger Mar 16, 2018

TomAugspurger Mar 16, 2018

TomAugspurger commented Mar 15, 2018 •

edited

Loading

TomAugspurger commented Mar 15, 2018 •

edited

Loading

TomAugspurger Mar 16, 2018

TomAugspurger commented Mar 16, 2018

jorisvandenbossche Mar 16, 2018

TomAugspurger Mar 16, 2018

TomAugspurger Mar 16, 2018 •

edited

Loading

jorisvandenbossche Mar 16, 2018

TomAugspurger Mar 16, 2018

jorisvandenbossche Mar 16, 2018

jorisvandenbossche Mar 16, 2018

TomAugspurger Mar 16, 2018 •

edited

Loading

TomAugspurger Mar 16, 2018

TomAugspurger commented Mar 24, 2018 •

edited

Loading

jreback commented Mar 24, 2018

TomAugspurger commented Mar 27, 2018

jorisvandenbossche commented Mar 27, 2018

TomAugspurger commented Mar 27, 2018 •

edited

Loading

jorisvandenbossche commented Mar 27, 2018 •

edited

Loading

TomAugspurger commented Mar 27, 2018

jreback commented Mar 27, 2018

ENH/API: ExtensionArray.factorize #20361

ENH/API: ExtensionArray.factorize #20361

Conversation

TomAugspurger commented Mar 15, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Mar 15, 2018 • edited Loading

TomAugspurger commented Mar 15, 2018 • edited Loading

Choose a reason for hiding this comment

TomAugspurger commented Mar 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger Mar 16, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger Mar 16, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Mar 24, 2018 • edited Loading

jreback commented Mar 24, 2018

TomAugspurger commented Mar 27, 2018

jorisvandenbossche commented Mar 27, 2018

TomAugspurger commented Mar 27, 2018 • edited Loading

jorisvandenbossche commented Mar 27, 2018 • edited Loading

TomAugspurger commented Mar 27, 2018

jreback commented Mar 27, 2018

TomAugspurger commented Mar 15, 2018 •

edited

Loading

TomAugspurger commented Mar 15, 2018 •

edited

Loading

TomAugspurger Mar 16, 2018 •

edited

Loading

TomAugspurger Mar 16, 2018 •

edited

Loading

TomAugspurger commented Mar 24, 2018 •

edited

Loading

TomAugspurger commented Mar 27, 2018 •

edited

Loading

jorisvandenbossche commented Mar 27, 2018 •

edited

Loading