-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF/BUG/API: factorizing categorical data #19938
REF/BUG/API: factorizing categorical data #19938
Conversation
This changes / fixes how Categorical data are factorized. The return value of a factorized categorical is now `Tuple[ndarray[int], Categorical]`. Before ```python In [2]: l, u = pd.factorize(pd.Categorical(['a', 'a', 'b'])) In [3]: l Out[3]: array([0, 0, 1]) In [4]: u Out[4]: array([0, 1]) ``` after ```python In [2]: l, u = pd.factorize(pd.Categorical(['a', 'a', 'b'])) In [3]: l Out[3]: array([0, 0, 1]) In [4]: u Out[4]: [a, b] Categories (2, object): [a, b] ``` The implementation is similar to `.unique`. 1. The algo (`pd.factorize`, `pd.unique`) handles unboxing / dtype coercion 2. The algo dispatches the actual array factorization for extension types 3. The algo boxes the output if necessary, depending on the input. I've implemented this as a new public method on ``Categorical``, mainly since this is what we do for unique, and I think it's a useful method to have. This fixes a bug in factorizing categoricals with missing values. Previously, we included -1 in the uniques. Before ```python In [2]: l, u = pd.factorize(pd.Categorical(['a', 'a', 'b', None])) In [3]: u Out[3]: array([ 0, 1, -1]) ``` After ```python In [2]: l, u = pd.factorize(pd.Categorical(['a', 'a', 'b', None])) In [3]: u Out[3]: [a, b] Categories (2, object): [a, b] ```
Codecov Report
@@ Coverage Diff @@
## master #19938 +/- ##
==========================================
+ Coverage 91.77% 91.78% +<.01%
==========================================
Files 152 152
Lines 49185 49206 +21
==========================================
+ Hits 45140 45163 +23
+ Misses 4045 4043 -2
Continue to review full report at Codecov.
|
import pandas as pd | ||
import pandas.util.testing as tm | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should prob pull the categorical tests out of test_algos.py then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better or worse, test_algos.py
didn't have any tests for Categorical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, yeah i guess just lots of tests for unique.
pandas/core/arrays/categorical.py
Outdated
from pandas.core.algorithms import _factorize_array, take_1d | ||
|
||
codes = self.codes.astype('int64') | ||
# We set missing codes, normally -1, to iNaT so that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put the astype after the comment, looks awkward otherwise
why do you think you need to do this? the point of the na_sentinel is to select the missing values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
na_sentinel controls the missing marker for the output. We're modifying the input, since the Int64HashTable sees that they're missing, instead of the value -1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be fixed generally, e.g. we should have a way to pass in the missing value. can you fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with the hashtable code, but at a glance it looks like the null condition is included in the class definition template?
{{py:
# name, dtype, null_condition, float_group
dtypes = [('Float64', 'float64', 'val != val', True),
('UInt64', 'uint64', 'False', False),
('Int64', 'int64', 'val == iNaT', False)]
I'm not sure how to pass expressions down to cython as a parameter.
Anyway, do we actually need this to be parameterized? Do we have other cases where we've needed to pass the null condition down?
pandas/core/arrays/categorical.py
Outdated
uniques = self._constructor(self.categories.take(uniques), | ||
categories=self.categories, | ||
ordered=self.ordered) | ||
if sort: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove the sort from here and just do it in factorize part (after the else), otherwise logic is in multiple places here
pandas/core/algorithms.py
Outdated
if is_categorical_dtype(values): | ||
values = getattr(values, '_values', values) | ||
labels, uniques = values.factorize(sort=sort) | ||
dtype = original.dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my comment below, but you might simpy dispatch on categricals and just return, mixing the impl is really confusing here.
can you show what exactly is the 'bug' here? |
Did you see Tom's top level post with 'before' and 'after' ? (and the discussion in #19721) |
ok I see the rationale behind this, which is ok. still some question regarding the impl. this looks like a bit of a work around. |
|
||
if sort and len(uniques) > 0: | ||
from pandas.core.sorting import safe_sort | ||
uniques, labels = safe_sort(uniques, labels, na_sentinel=na_sentinel, | ||
assume_unique=True) | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed all the sorting from Categorical.factorize. All that logic is here.
I don't think we want to just call safe_sort
for two reasons
- that function does a lot of unnescessary work when we know that
uniques
is an ndarray or EA. - It coerces categoricals to object ndarrays.
- EAs (like Categorical) may have special sorting rules.
On some small bencharks (10,000 elements) this is about 25-40% faster. The only slow case, for which we still need safe_sort, is when the array is mixed. In that case things are about 10% slower.
Fixed the bug in my logic from the last commit. ASV looks good. The first time I ran it I had a 1.20 ratio (slowdown) relative to master on |
check_nulls = not is_integer_dtype(original) | ||
labels, uniques = _factorize_array(values, check_nulls, | ||
na_sentinel=na_sentinel, | ||
size_hint=size_hint) | ||
|
||
if sort and len(uniques) > 0: | ||
from pandas.core.sorting import safe_sort |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could move to the except (but no big deal)
values = getattr(values, '_values', values) | ||
labels, uniques = values.factorize() | ||
dtype = original.dtype | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this actually be a check on the values if they have a .factorize()
method (or check is_extension_array
)? instead of specifically checking for categorical? (of course categorical will pass these checks). as this will then make pd.factorize(an_extension_array)
work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is just a bugfix for categorical. But the structure will be very similar (I'll just change is_categorical_dtype
to is_extension_array_dtype
.)
I'll implement EA.factorize today hopefully, but have to get things like unique and argsort working first.
from pandas.core.algorithms import _factorize_array | ||
|
||
codes = self.codes.astype('int64') | ||
codes[codes == -1] = iNaT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interface we have to hashtable.get_labels()
is very odd right now, IOW we have a check_null
flag which then makes the caller know to substitute values to iNaT (for int64) and know which are the sentinels. This is breaking the abstrastion. Rather would either like to be able to pass in the actual sentinel (not the output sentinel, but that's another confusion). e.g . you would simply pass -1 here.
I think its worth re-factoring this (maybe before this PR), though I suppose could be after.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that'd be nicer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we actually want this to be public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
factorize in general? I don’t see why not. It’s present on series and index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#19938 (comment) was in reference to the API docs. We whitelist the methods on Categorical that are included in the API docs (just __array__
and from_codes
for now).
@jorisvandenbossche just confirming, but I assume we don't want to add |
That maybe depends on the discussion we will have on that for ExtensionArray? |
from pandas.core.algorithms import _factorize_array | ||
|
||
codes = self.codes.astype('int64') | ||
codes[codes == -1] = iNaT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we actually want this to be public?
ahh I see we are already exposing factorize on otherwise this ok? can you rebase |
Yeah, this should be good to go. |
thanks @TomAugspurger |
This changes / fixes how Categorical data are factorized. The return value of a
factorized categorical is now
Tuple[ndarray[int], Categorical]
.Before
after
The implementation is similar to
.unique
.pd.factorize
,pd.unique
) handles unboxing / dtype coercionI've implemented this as a new public method on
Categorical
, mainly sincethis is what we do for unique, and I think it's a useful method to have.
This fixes a bug in factorizing categoricals with missing values. Previously, we
included -1 in the uniques.
Before
After
xref #19721. This doesn't close it, as I haven't addressed the issue of what
factorize(Series[EA])
should return. Just fixing the categorical bugs.