Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/API: ExtensionArray.factorize #20361

Merged
merged 64 commits into from
Mar 27, 2018
Merged

Conversation

TomAugspurger
Copy link
Contributor

Adds factorize to the interface for ExtensionArray, with a default
implementation.

This is a stepping stone to groupby.

Categorical already has a custom implementation.

Decimal reuses the default.

JSON can't use the default since an array of dictionaries isn't hashable, so we use a custom implementation.

@TomAugspurger TomAugspurger added API Design Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff ExtensionArray Extending pandas with custom dtypes or arrays. labels Mar 15, 2018
frozen = tuple(tuple(x.items()) for x in self)
labels, uniques = pd.factorize(frozen)

# fixup NA
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type of thing is going to be error prone. We expect that [B, NA, C, B] is coded as [0, -1, 1, 0], not [0, 1, 2, 0] (I ran into this with cyberpandas as well).

If library authors are using our tests, this should be caught. Otherwise, I'm not sure how to handle it. We could design a method like around def refactorize(labels, mask, uniques) that does something like this, but I'm not sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm apparently I messed this up, so the tests will fail. Fixing now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 434df7d

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we require this, I think we should put a stronger comment (maybe as a comment below the docstring) about that we expect that missing values are not in the uniques.

What would go wrong if .factorize() on an extension array would include its missing value in the uniques?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm probably not the end of the world, but it would violate our factorize (and eventually groupby) semantics. Until recently, we messed this up for Categorical. That didn't have downstream effects for groupby, since everything in groupby is special for categorical.

I'll make this clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well adding that simple note turned into a rabit hole. I decided to merge the 3 docstrings

  • pandas.factorize
  • pandas.core.base.IndexOpsMixin.factorize
  • pandas.Categorical.factorize

@TomAugspurger TomAugspurger changed the title ENH/API: ExtensionArray.factorize [WIP]ENH/API: ExtensionArray.factorize Mar 15, 2018
@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Mar 15, 2018

Marking this as a WIP for now. For JSONArray, the value_counts test is failing. I'd like to see if we can reuse factorize in value_counts. In principal, a counting the labels and doing a uniques.take on the index of the counts should work.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Mar 15, 2018

The failures are caused by a "bug" in is_extension_array_dtype. #20363

`is_extension_array_dtype(dtype)` was incorrect for dtypes that haven't
implemented the new interface yet. This is because they indirectly subclassed
ExtensionDtype.

This PR changes the hierarchy so that PandasExtensionDtype doesn't subclass
ExtensionDtype. As we implement the interface, like Categorical, we'll add
ExtensionDtype as a base class.

Before:

```
DatetimeTZDtype <- PandasExtensionDtype <- ExtensionDtype (wrong)
CategoricalDtype <- PandasExtensionDtype <- ExtensionDtype (right)

After:

DatetimeTZDtype <- PandasExtensionDtype
                                        \
                                         - _DtypeOpsMixin
                                        /
                   ExtensionDtype ------

CategoricalDtype - PandasExtensionDtype -
                \                        \
                 \                        -_DtypeOpsMixin
                  \                      /
                   ExtensionDtype -------

```

Once all our extension dtypes have implemented the interface we can go back
to the simple, linear inheritance structure.
@TomAugspurger TomAugspurger mentioned this pull request Mar 16, 2018
15 tasks
values = getattr(values, '_values', values)
labels, uniques = values.factorize()
labels, uniques = values.factorize(na_sentinel=na_sentinel)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this was a bug in #19938 where I forgot to pass this through. It's covered by our extension tests.

@TomAugspurger
Copy link
Contributor Author

This should pass now.

-----
:meth:`pandas.factorize` offers a `sort` keyword as well.
"""
from pandas.core.algorithms import _factorize_array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're OK with using this private API here?

Because an extension authors might want to copy paste this method and change the arr = self.astype(object) line? (how do you implement factorize in cyberpandas?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite similar to this. https://github.com/ContinuumIO/cyberpandas/blob/468644bcbdc9320a1a33b0df393d4fa4bef57dd7/cyberpandas/base.py#L72

In that case I think going to object dtypes is unavoidable, since there's no easy way to factorize a 2-D array, and I didn't want to write a new hashtable implementation :)

Copy link
Contributor Author

@TomAugspurger TomAugspurger Mar 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

w.r.t. using _factorize_array, I don't think it's avoidable.

We might consider making it public / semi-public (keep the _ to not confuse users, but point EA authors to it).


def test_factorize_equivalence(self, data_for_grouping):
l1, u1 = pd.factorize(data_for_grouping)
l2, u2 = pd.factorize(data_for_grouping)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this test doing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, I think I meant for one to be data_for_grouping.factorize()

frozen = tuple(tuple(x.items()) for x in self)
labels, uniques = pd.factorize(frozen)

# fixup NA
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we require this, I think we should put a stronger comment (maybe as a comment below the docstring) about that we expect that missing values are not in the uniques.

What would go wrong if .factorize() on an extension array would include its missing value in the uniques?

Missing values are checked separately from valid values.
A mask of missing values is computed for each and checked to match.
The remaining all-valid values are cast to object dtype and checked.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this needed only now? (wasn't the missing values the reason you added assert_frame_equal et al as overridable class methods?)

Copy link
Contributor Author

@TomAugspurger TomAugspurger Mar 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

factorize is I think the first case where we have a public method returning an ExtensionArray (the second return value).

I'll see if any old tests can make use of this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any other cases where we could apply assert_extension_array_equal.

The closes would be the __getitem__ tests, but in that case we don't have an expected array to compare to, since we're testing getitem / take / etc.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Mar 24, 2018

@jreback you can ignore all the back and forth between Joris and I yesterday about mask. I think #20473 will make things nice for EA authors. They'll just need to specify

  • The array to be used for factorization
  • The scalar in their array to be considered NA.

Some examples:

  • defualt: self.astype(object), np.nan
  • Categorical: self.codes.astype(int64), -1
  • MACArray: self.data, 0 (this is in cyberpandas. A uint64 array, where 0 is NA)

@jreback
Copy link
Contributor

jreback commented Mar 24, 2018

ok let's work on #20473 first then, that is exactly the kind of things would like to see. e.g. moving around pandas internals to make more friendly by essentially pushing things down rather than doing them solely in the EA.

@TomAugspurger
Copy link
Contributor Author

Updated on top of the parametrized NA value. Things look relatively clean now I think (most of the diff is due to moving docstrings to a shared spot). In terms of API:

def _values_for_argsort(): -> Tuple[ndarray, Any]
    # array to factorize and na_value

@classmethod
@abstractmethod
def _from_factorized(cls, values, original): -> ExtensionArray
    # Reconstruct uniques from values & original

This is adequate for all our examples in pandas, including JSONArray which isn't otherwise factorizable (non-hashable). It also works well for IPArray and MACArray (Uint64 with 0 as the NA value).

@jorisvandenbossche do you think this will work well for geopandas? Can you have missing geometries?

The categorical tests are an example of using original in _from_factorized, as we copy over the unobserved categories to the uniques dtype.

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche do you think this will work well for geopandas? Can you have missing geometries?

Yes, that should be fine. I am not fully sure yet, but I think I will use a binary blob for each geometry ("well known binary" format), which will give an object array of bytes, and with eg None as missing value. So this should work (that should already work with pd.factorize itself without the new infrastructure I think).
I will also need the original to add back metadata about the coordinate reference system.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Mar 27, 2018

From Python your array of pointers look like Int64s, right? Could those be factorized? I think it should work, though I haven't thought it through fully.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Mar 27, 2018

No, because two separate Points (separate objects in C) but with same coordinates, will have different pointer. So the pointer information is not necessarily relevant for factorize.

@TomAugspurger
Copy link
Contributor Author

Understood.

CI is all green.

@jreback jreback merged commit 766a480 into pandas-dev:master Mar 27, 2018
@jreback
Copy link
Contributor

jreback commented Mar 27, 2018

thanks @TomAugspurger

@TomAugspurger TomAugspurger deleted the ea-factorize-2 branch March 27, 2018 16:06
javadnoorb pushed a commit to javadnoorb/pandas that referenced this pull request Mar 29, 2018
dworvos pushed a commit to dworvos/pandas that referenced this pull request Apr 2, 2018
kornilova203 pushed a commit to kornilova203/pandas that referenced this pull request Apr 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff API Design ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants