-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH/API: ExtensionArray.factorize #20361
Conversation
This enables {Series,DataFrame}.sort_values and {Series,DataFrame}.argsort
This reverts commit 44b6d72.
Adds factorize to the interface for ExtensionArray, with a default implementation. This is a stepping stone to groupby.
pandas/tests/extension/json/array.py
Outdated
frozen = tuple(tuple(x.items()) for x in self) | ||
labels, uniques = pd.factorize(frozen) | ||
|
||
# fixup NA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type of thing is going to be error prone. We expect that [B, NA, C, B]
is coded as [0, -1, 1, 0]
, not [0, 1, 2, 0]
(I ran into this with cyberpandas as well).
If library authors are using our tests, this should be caught. Otherwise, I'm not sure how to handle it. We could design a method like around def refactorize(labels, mask, uniques)
that does something like this, but I'm not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm apparently I messed this up, so the tests will fail. Fixing now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 434df7d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we require this, I think we should put a stronger comment (maybe as a comment below the docstring) about that we expect that missing values are not in the uniques.
What would go wrong if .factorize()
on an extension array would include its missing value in the uniques?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm probably not the end of the world, but it would violate our factorize (and eventually groupby) semantics. Until recently, we messed this up for Categorical. That didn't have downstream effects for groupby, since everything in groupby is special for categorical.
I'll make this clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well adding that simple note turned into a rabit hole. I decided to merge the 3 docstrings
- pandas.factorize
- pandas.core.base.IndexOpsMixin.factorize
- pandas.Categorical.factorize
Marking this as a WIP for now. For JSONArray, the |
The failures are caused by a "bug" in |
`is_extension_array_dtype(dtype)` was incorrect for dtypes that haven't implemented the new interface yet. This is because they indirectly subclassed ExtensionDtype. This PR changes the hierarchy so that PandasExtensionDtype doesn't subclass ExtensionDtype. As we implement the interface, like Categorical, we'll add ExtensionDtype as a base class. Before: ``` DatetimeTZDtype <- PandasExtensionDtype <- ExtensionDtype (wrong) CategoricalDtype <- PandasExtensionDtype <- ExtensionDtype (right) After: DatetimeTZDtype <- PandasExtensionDtype \ - _DtypeOpsMixin / ExtensionDtype ------ CategoricalDtype - PandasExtensionDtype - \ \ \ -_DtypeOpsMixin \ / ExtensionDtype ------- ``` Once all our extension dtypes have implemented the interface we can go back to the simple, linear inheritance structure.
values = getattr(values, '_values', values) | ||
labels, uniques = values.factorize() | ||
labels, uniques = values.factorize(na_sentinel=na_sentinel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this was a bug in #19938 where I forgot to pass this through. It's covered by our extension tests.
This should pass now. |
pandas/core/arrays/base.py
Outdated
----- | ||
:meth:`pandas.factorize` offers a `sort` keyword as well. | ||
""" | ||
from pandas.core.algorithms import _factorize_array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're OK with using this private API here?
Because an extension authors might want to copy paste this method and change the arr = self.astype(object)
line? (how do you implement factorize
in cyberpandas?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quite similar to this. https://github.com/ContinuumIO/cyberpandas/blob/468644bcbdc9320a1a33b0df393d4fa4bef57dd7/cyberpandas/base.py#L72
In that case I think going to object dtypes is unavoidable, since there's no easy way to factorize a 2-D array, and I didn't want to write a new hashtable implementation :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
w.r.t. using _factorize_array
, I don't think it's avoidable.
We might consider making it public / semi-public (keep the _
to not confuse users, but point EA authors to it).
|
||
def test_factorize_equivalence(self, data_for_grouping): | ||
l1, u1 = pd.factorize(data_for_grouping) | ||
l2, u2 = pd.factorize(data_for_grouping) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this test doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha, I think I meant for one to be data_for_grouping.factorize()
pandas/tests/extension/json/array.py
Outdated
frozen = tuple(tuple(x.items()) for x in self) | ||
labels, uniques = pd.factorize(frozen) | ||
|
||
# fixup NA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we require this, I think we should put a stronger comment (maybe as a comment below the docstring) about that we expect that missing values are not in the uniques.
What would go wrong if .factorize()
on an extension array would include its missing value in the uniques?
pandas/util/testing.py
Outdated
Missing values are checked separately from valid values. | ||
A mask of missing values is computed for each and checked to match. | ||
The remaining all-valid values are cast to object dtype and checked. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this needed only now? (wasn't the missing values the reason you added assert_frame_equal
et al as overridable class methods?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
factorize
is I think the first case where we have a public method returning an ExtensionArray (the second return value).
I'll see if any old tests can make use of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any other cases where we could apply assert_extension_array_equal
.
The closes would be the __getitem__
tests, but in that case we don't have an expected array to compare to, since we're testing getitem / take / etc.
@jreback you can ignore all the back and forth between Joris and I yesterday about
Some examples:
|
Added PyObject hashtable test
ok let's work on #20473 first then, that is exactly the kind of things would like to see. e.g. moving around pandas internals to make more friendly by essentially pushing things down rather than doing them solely in the EA. |
Updated on top of the parametrized NA value. Things look relatively clean now I think (most of the diff is due to moving docstrings to a shared spot). In terms of API: def _values_for_argsort(): -> Tuple[ndarray, Any]
# array to factorize and na_value
@classmethod
@abstractmethod
def _from_factorized(cls, values, original): -> ExtensionArray
# Reconstruct uniques from values & original This is adequate for all our examples in pandas, including @jorisvandenbossche do you think this will work well for geopandas? Can you have missing geometries? The categorical tests are an example of using |
Yes, that should be fine. I am not fully sure yet, but I think I will use a binary blob for each geometry ("well known binary" format), which will give an object array of bytes, and with eg None as missing value. So this should work (that should already work with |
From Python your array of pointers look like Int64s, right? Could those be factorized? I think it should work, though I haven't thought it through fully. |
No, because two separate Points (separate objects in C) but with same coordinates, will have different pointer. So the pointer information is not necessarily relevant for factorize. |
Understood. CI is all green. |
thanks @TomAugspurger |
Adds factorize to the interface for ExtensionArray, with a default
implementation.
This is a stepping stone to groupby.
Categorical already has a custom implementation.
Decimal reuses the default.
JSON can't use the default since an array of dictionaries isn't hashable, so we use a custom implementation.