-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Categorical.values_for_(factorize|argsort) dont preserve order #33245
Comments
Can you explain how |
Or you mean the -1 for missing values? |
For ordered Categoricals, this preserve order fine. For non-ordered, order-requiring functions want the behavior of values_for_rank. |
I understand there is a difference in implementations between those functions, but I don't understand why For example,
|
Is sorting an unordered categorical just a bad / ill-defined idea? We don't allow taking |
We cetainly had that discussion regularly before ;) Example use case: sort a set of values to see the duplicates next too each other (eg when sorting a dataframe by a categorical column). The act of sorting (putting equal values together) is still useful in such a case, even without being an clear order being defined. BTW, looking at the outpug of rank vs sort_values:
I think at least one of the two is wrong, as IMO both should match in the way they order the values? |
This isn't crazy, but if we go down that road we need a way to make it general for non-ordered EAs and de-special-case Categorical.
Agreed that they should match on how they sort values. #15422 implemented |
But, so if you think (to be clear, apparently I also found back in time that this was the logical behaviour for rank, but just trying to get the issue clear) |
|
FWIW, I would have expected lexicographical sorting for unordered categoricals. Seeing this issue, now I understand what's going on, but I just stared at some results that surprised me given that the codes ordering is arbitrary for unordered categoricals by definition. I'm certain it's more complicated than just sorting the categories on instantiation but 🤷 . I'll have to remember to use reorder_categories or to use ordered categories. |
The docstrings for these two methods say that values_for_foo should preserve order, but for non-ordered Categoricals, the current implementation using self.codes is not order-preserving. Is this a problem cc @jorisvandenbossche ?
Categorical.values_for_rank
exists pretty much to solve this problem for rank and value_counts. We have a special case in core.algorithms using values_for_rank that we should ideally avoid.The text was updated successfully, but these errors were encountered: