Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Categorical.values_for_(factorize|argsort) dont preserve order #33245

Open
jbrockmendel opened this issue Apr 2, 2020 · 10 comments
Open
Labels
Bug Categorical Categorical Data Type ExtensionArray Extending pandas with custom dtypes or arrays.

Comments

@jbrockmendel
Copy link
Member

The docstrings for these two methods say that values_for_foo should preserve order, but for non-ordered Categoricals, the current implementation using self.codes is not order-preserving. Is this a problem cc @jorisvandenbossche ?

Categorical.values_for_rank exists pretty much to solve this problem for rank and value_counts. We have a special case in core.algorithms using values_for_rank that we should ideally avoid.

@jorisvandenbossche
Copy link
Member

Can you explain how values_for_factorize/argsort don't preserve order for Categorical?

@jorisvandenbossche
Copy link
Member

Or you mean the -1 for missing values?

@jbrockmendel
Copy link
Member Author

Can you explain how values_for_factorize/argsort don't preserve order for Categorical?

    def _values_for_argsort(self):
        return self._codes

    def _values_for_factorize(self):
        codes = self.codes.astype("int64")
        return codes, -1

    def _values_for_rank(self):
        """
        For correctly ranking ordered categorical data. See GH#15420

        Ordered categorical data should be ranked on the basis of
        codes with -1 translated to NaN.

        Returns
        -------
        numpy.array

        """
        from pandas import Series

        if self.ordered:
            values = self.codes
            mask = values == -1
            if mask.any():
                values = values.astype("float64")
                values[mask] = np.nan
        elif self.categories.is_numeric():
            values = np.array(self)
        else:
            #  reorder the categories (so rank can use the float codes)
            #  instead of passing an object array to rank
            values = np.array(
                self.rename_categories(Series(self.categories).rank().values)
            )
        return values

For ordered Categoricals, this preserve order fine. For non-ordered, order-requiring functions want the behavior of values_for_rank.

@jorisvandenbossche
Copy link
Member

I understand there is a difference in implementations between those functions, but I don't understand why _values_for_factorize/argsort wouldn't preserve order? What aspect of behaviour is wrong then?

For example, sort_values requires order, but is using _values_for_argsort, and not _values_for_rank. So is the return value of sort_values wrong?

In [1]: cat = pd.Categorical(['a', 'b', 'a', 'b', None, 'c'], categories=['a', 'c', 'b'])                                                                     

In [2]: pd.Series(cat).sort_values()                                                                                                                          
> /home/joris/scipy/pandas/pandas/core/arrays/categorical.py(1488)_values_for_argsort()
-> return self._codes
(Pdb) c
Out[2]: 
0      a
2      a
5      c
1      b
3      b
4    NaN
dtype: category
Categories (3, object): [a, c, b]

@jorisvandenbossche jorisvandenbossche added Categorical Categorical Data Type ExtensionArray Extending pandas with custom dtypes or arrays. labels Apr 2, 2020
@TomAugspurger
Copy link
Contributor

Is sorting an unordered categorical just a bad / ill-defined idea? We don't allow taking min / max, so perhaps they should just be considered unorderable.

@jorisvandenbossche
Copy link
Member

We cetainly had that discussion regularly before ;)
I don't remember the exact arguments, but IMO it is useful to be able to sort values (and the sort order is just the categories order), without there being a meaning of min/max to the order.

Example use case: sort a set of values to see the duplicates next too each other (eg when sorting a dataframe by a categorical column). The act of sorting (putting equal values together) is still useful in such a case, even without being an clear order being defined.

BTW, looking at the outpug of rank vs sort_values:

In [1]: cat = pd.Categorical(['a', 'b', 'a', 'b', None, 'c'], categories=['a', 'c', 'b']) 

In [2]: pd.Series(cat).sort_values()   
Out[2]: 
0      a
2      a
5      c
1      b
3      b
4    NaN
dtype: category
Categories (3, object): [a, c, b]

In [3]: pd.Series(cat).rank()   
Out[3]: 
0    1.5
1    3.5
2    1.5
3    3.5
4    NaN
5    5.0
dtype: float64

I think at least one of the two is wrong, as IMO both should match in the way they order the values?
But personally, I would rather say that rank is the wrong one ..

@jbrockmendel
Copy link
Member Author

Is sorting an unordered categorical just a bad / ill-defined idea?

This isn't crazy, but if we go down that road we need a way to make it general for non-ordered EAs and de-special-case Categorical.

I think at least one of the two is wrong, as IMO both should match in the way they order the values?
But personally, I would rather say that rank is the wrong one ..

Agreed that they should match on how they sort values. #15422 implemented values_for_rank to use the ordering of the underlying categories for non-ordered. That seems like a reasonable fallback.

@jorisvandenbossche
Copy link
Member

But, so if you think rank is fine, then that means you think the output of sort_values is wrong / should be changed?

(to be clear, apparently I also found back in time that this was the logical behaviour for rank, but just trying to get the issue clear)

@jorisvandenbossche
Copy link
Member

rank and sort_values still give a different answer, which I think is someting we should solve eventually?

@jseabold
Copy link
Contributor

jseabold commented Sep 1, 2021

FWIW, I would have expected lexicographical sorting for unordered categoricals. Seeing this issue, now I understand what's going on, but I just stared at some results that surprised me given that the codes ordering is arbitrary for unordered categoricals by definition. I'm certain it's more complicated than just sorting the categories on instantiation but 🤷 . I'll have to remember to use reorder_categories or to use ordered categories.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

No branches or pull requests

5 participants