Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF/BUG/API: factorizing categorical data #19938

Merged
merged 14 commits into from
Mar 15, 2018

Conversation

TomAugspurger
Copy link
Contributor

This changes / fixes how Categorical data are factorized. The return value of a
factorized categorical is now Tuple[ndarray[int], Categorical].

Before

In [2]: l, u = pd.factorize(pd.Categorical(['a', 'a', 'b']))

In [3]: l
Out[3]: array([0, 0, 1])

In [4]: u
Out[4]: array([0, 1])

after

In [2]: l, u = pd.factorize(pd.Categorical(['a', 'a', 'b']))

In [3]: l
Out[3]: array([0, 0, 1])

In [4]: u
Out[4]:
[a, b]
Categories (2, object): [a, b]

The implementation is similar to .unique.

  1. The algo (pd.factorize, pd.unique) handles unboxing / dtype coercion
  2. The algo dispatches the actual array factorization for extension types
  3. The algo boxes the output if necessary, depending on the input.

I've implemented this as a new public method on Categorical, mainly since
this is what we do for unique, and I think it's a useful method to have.

This fixes a bug in factorizing categoricals with missing values. Previously, we
included -1 in the uniques.

Before

In [2]: l, u = pd.factorize(pd.Categorical(['a', 'a', 'b', None]))

In [3]: u
Out[3]: array([ 0,  1, -1])

After

In [2]: l, u = pd.factorize(pd.Categorical(['a', 'a', 'b', None]))

In [3]: u
Out[3]:
[a, b]
Categories (2, object): [a, b]

xref #19721. This doesn't close it, as I haven't addressed the issue of what factorize(Series[EA]) should return. Just fixing the categorical bugs.

This changes / fixes how Categorical data are factorized. The return value of a
factorized categorical is now `Tuple[ndarray[int], Categorical]`.

Before

```python
In [2]: l, u = pd.factorize(pd.Categorical(['a', 'a', 'b']))

In [3]: l
Out[3]: array([0, 0, 1])

In [4]: u
Out[4]: array([0, 1])
```

after

```python
In [2]: l, u = pd.factorize(pd.Categorical(['a', 'a', 'b']))

In [3]: l
Out[3]: array([0, 0, 1])

In [4]: u
Out[4]:
[a, b]
Categories (2, object): [a, b]
```

The implementation is similar to `.unique`.

1. The algo (`pd.factorize`, `pd.unique`) handles unboxing / dtype coercion
2. The algo dispatches the actual array factorization for extension types
3. The algo boxes the output if necessary, depending on the input.

I've implemented this as a new public method on ``Categorical``, mainly since
this is what we do for unique, and I think it's a useful method to have.

This fixes a bug in factorizing categoricals with missing values. Previously, we
included -1 in the uniques.

Before

```python
In [2]: l, u = pd.factorize(pd.Categorical(['a', 'a', 'b', None]))

In [3]: u
Out[3]: array([ 0,  1, -1])
```

After

```python
In [2]: l, u = pd.factorize(pd.Categorical(['a', 'a', 'b', None]))

In [3]: u
Out[3]:
[a, b]
Categories (2, object): [a, b]
```
@TomAugspurger TomAugspurger added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Categorical Categorical Data Type labels Feb 28, 2018
@TomAugspurger TomAugspurger added this to the 0.23.0 milestone Feb 28, 2018
@codecov
Copy link

codecov bot commented Feb 28, 2018

Codecov Report

Merging #19938 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #19938      +/-   ##
==========================================
+ Coverage   91.77%   91.78%   +<.01%     
==========================================
  Files         152      152              
  Lines       49185    49206      +21     
==========================================
+ Hits        45140    45163      +23     
+ Misses       4045     4043       -2
Flag Coverage Δ
#multiple 90.16% <100%> (ø) ⬆️
#single 41.83% <62.5%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/core/algorithms.py 94.29% <100%> (+0.12%) ⬆️
pandas/core/arrays/categorical.py 95.11% <100%> (+0.05%) ⬆️
pandas/util/testing.py 83.95% <0%> (+0.2%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cad6dc7...1e006d1. Read the comment docs.

import pandas as pd
import pandas.util.testing as tm


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should prob pull the categorical tests out of test_algos.py then

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better or worse, test_algos.py didn't have any tests for Categorical.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, yeah i guess just lots of tests for unique.

from pandas.core.algorithms import _factorize_array, take_1d

codes = self.codes.astype('int64')
# We set missing codes, normally -1, to iNaT so that the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put the astype after the comment, looks awkward otherwise

why do you think you need to do this? the point of the na_sentinel is to select the missing values

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

na_sentinel controls the missing marker for the output. We're modifying the input, since the Int64HashTable sees that they're missing, instead of the value -1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be fixed generally, e.g. we should have a way to pass in the missing value. can you fix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the hashtable code, but at a glance it looks like the null condition is included in the class definition template?

{{py:

# name, dtype, null_condition, float_group
dtypes = [('Float64', 'float64', 'val != val', True),
          ('UInt64', 'uint64', 'False', False),
          ('Int64', 'int64', 'val == iNaT', False)]

I'm not sure how to pass expressions down to cython as a parameter.

Anyway, do we actually need this to be parameterized? Do we have other cases where we've needed to pass the null condition down?

uniques = self._constructor(self.categories.take(uniques),
categories=self.categories,
ordered=self.ordered)
if sort:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove the sort from here and just do it in factorize part (after the else), otherwise logic is in multiple places here

if is_categorical_dtype(values):
values = getattr(values, '_values', values)
labels, uniques = values.factorize(sort=sort)
dtype = original.dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comment below, but you might simpy dispatch on categricals and just return, mixing the impl is really confusing here.

@jreback
Copy link
Contributor

jreback commented Mar 4, 2018

can you show what exactly is the 'bug' here?

@jorisvandenbossche
Copy link
Member

can you show what exactly is the 'bug' here?

Did you see Tom's top level post with 'before' and 'after' ? (and the discussion in #19721)

@jreback
Copy link
Contributor

jreback commented Mar 7, 2018

ok I see the rationale behind this, which is ok. still some question regarding the impl. this looks like a bit of a work around.


if sort and len(uniques) > 0:
from pandas.core.sorting import safe_sort
uniques, labels = safe_sort(uniques, labels, na_sentinel=na_sentinel,
assume_unique=True)
try:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed all the sorting from Categorical.factorize. All that logic is here.

I don't think we want to just call safe_sort for two reasons

  1. that function does a lot of unnescessary work when we know that uniques is an ndarray or EA.
  2. It coerces categoricals to object ndarrays.
  3. EAs (like Categorical) may have special sorting rules.

On some small bencharks (10,000 elements) this is about 25-40% faster. The only slow case, for which we still need safe_sort, is when the array is mixed. In that case things are about 10% slower.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Mar 12, 2018

Fixed the bug in my logic from the last commit.

ASV looks good. The first time I ran it I had a 1.20 ratio (slowdown) relative to master on algorithms.Factorize.time_factorize_int(True). 0.8 ratio (speedup) the second time. No change 3rd time.

check_nulls = not is_integer_dtype(original)
labels, uniques = _factorize_array(values, check_nulls,
na_sentinel=na_sentinel,
size_hint=size_hint)

if sort and len(uniques) > 0:
from pandas.core.sorting import safe_sort
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could move to the except (but no big deal)

values = getattr(values, '_values', values)
labels, uniques = values.factorize()
dtype = original.dtype
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this actually be a check on the values if they have a .factorize() method (or check is_extension_array)? instead of specifically checking for categorical? (of course categorical will pass these checks). as this will then make pd.factorize(an_extension_array) work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is just a bugfix for categorical. But the structure will be very similar (I'll just change is_categorical_dtype to is_extension_array_dtype.)

I'll implement EA.factorize today hopefully, but have to get things like unique and argsort working first.

from pandas.core.algorithms import _factorize_array

codes = self.codes.astype('int64')
codes[codes == -1] = iNaT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interface we have to hashtable.get_labels() is very odd right now, IOW we have a check_null flag which then makes the caller know to substitute values to iNaT (for int64) and know which are the sentinels. This is breaking the abstrastion. Rather would either like to be able to pass in the actual sentinel (not the output sentinel, but that's another confusion). e.g . you would simply pass -1 here.

I think its worth re-factoring this (maybe before this PR), though I suppose could be after.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#20328

Yes, that'd be nicer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we actually want this to be public?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

factorize in general? I don’t see why not. It’s present on series and index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#19938 (comment) was in reference to the API docs. We whitelist the methods on Categorical that are included in the API docs (just __array__ and from_codes for now).

@TomAugspurger
Copy link
Contributor Author

@jorisvandenbossche just confirming, but I assume we don't want to add Categorical.factorize to the API docs?

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche just confirming, but I assume we don't want to add Categorical.factorize to the API docs?

That maybe depends on the discussion we will have on that for ExtensionArray?

from pandas.core.algorithms import _factorize_array

codes = self.codes.astype('int64')
codes[codes == -1] = iNaT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we actually want this to be public?

@jreback
Copy link
Contributor

jreback commented Mar 14, 2018

ahh I see we are already exposing factorize on Series. ok, then this should be a part of the EA interface. (can be separate PR).

otherwise this ok? can you rebase

@TomAugspurger
Copy link
Contributor Author

Yeah, this should be good to go.

@jreback jreback merged commit 38afa93 into pandas-dev:master Mar 15, 2018
@jreback
Copy link
Contributor

jreback commented Mar 15, 2018

thanks @TomAugspurger

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Categorical Categorical Data Type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants