Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concatenation of category value counts mixes up the index order #14600

Closed
jbao opened this issue Nov 7, 2016 · 11 comments · Fixed by #31161
Closed

Concatenation of category value counts mixes up the index order #14600

jbao opened this issue Nov 7, 2016 · 11 comments · Fixed by #31161
Labels
Bug Categorical Categorical Data Type good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@jbao
Copy link

jbao commented Nov 7, 2016

Let's say we have two Series s1 and s2, which can be the output of the pd.value_counts() function, and we want to combine them into one DataFrame

s1 = pd.Series([39,6,4], index=pd.CategoricalIndex(['female','male','unknown']))
s2 = pd.Series([2,152,2,242,150], index=pd.CategoricalIndex(['f','female','m','male','unknown']))
pd.DataFrame([s1,s2])

The result is

    female   male  unknown      f      m
0     NaN   39.0      NaN    6.0    4.0
1     2.0  152.0      2.0  242.0  150.0

where the order of categories in the first row is changed.

And the current workaround is

pd.DataFrame([pd.Series(s1.values,index=s1.index.astype(list)),pd.Series(s2.values,index=s2.index.astype(list))])

which gives the correct result

        f  female    m   male  unknown
0  NaN    39.0  NaN    6.0      4.0
1  2.0   152.0  2.0  242.0    150.0

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 3.19.0-31-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.0
nose: None
pip: 8.1.2
setuptools: 27.2.0
Cython: None
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

@jorisvandenbossche
Copy link
Member

The underlying reason is possibly that union does not return a sorted result in case of Categoricals (which is a bug I think):

In [116]: s1.index.union(s2.index)    ## <--- not sorted result
Out[116]: CategoricalIndex(['female', 'male', 'unknown', 'f', 'm'], categories=['f', 'female', 'm', 'male', 'unknown'], ordered=False, dtype='category')

In [117]: s1.index.union(s2.index).sort_values()
Out[117]: CategoricalIndex(['f', 'female', 'm', 'male', 'unknown'], categories=['f', 'female', 'm', 'male', 'unknown'], ordered=False, dtype='category')

In [119]: s1.index.astype(object).union(s2.index.astype(object))    ## <--- sorted result
Out[119]: Index(['f', 'female', 'm', 'male', 'unknown'], dtype='object')

@nathalier
Copy link
Contributor

Hello!
I looked at this bug and I found that the source of that particular error was in the next section of code (in union() ):

if self.is_monotonic and other.is_monotonic:
   try:
      result = self._outer_indexer(self._values, other._values)[0]
   except TypeError:
      # incomparable objects
.......

where _outer_indexer() expected ndarray and not categorical. That caused TypeError.
Avoiding this section for categorical values resolve the bug.

But after I wrote tests another problem was discovered.
indices of different values (of self and other) in union() is defined based on categories list, while diff values themselves are defined based on values of CategorialIndex. Thus, it uses incorect mask to define differences, and if the order of CategoricalIndex values does not correspond to ordered categories list, different errors appear.
I'm not sure I'm clear enough:-)
The example could be as follows:

s1 = pd.Series([6, 39, 4], index=pd.CategoricalIndex(['male', 'female', 'unknown']))
s2 = pd.Series([150, 2, 1, 242], index=pd.CategoricalIndex(['unknown', 'm', 'f', 'male'],
                categories=['m', 'male', 'f', 'unknown'], ordered=True))
res = pd.DataFrame([s1, s2])

This causes IndexError: list index out of range.

So based on which should union() work: on indices or on underlying categories? Or they both should be joined, taking into account that the list of categories may be wider than the list of indices ? Or maybe it worse to sort indices firstly and than pass it to union()?

@jreback
Copy link
Contributor

jreback commented Jan 11, 2017

@nathalier thanks for having a look!

yes both of these things appear wrong. The general principal that we try to follow is that Categoricals (or CategoricalIndex) can be combined if they are dtype equal (is_dtype_equal is True). IOW, there categories and ordered flags match.

Then they should be combined and stay categoricals.

Otherwise we still allow combinations, BUT coerce to object, then follow union / diff logic.

So happy to have more tests / fixes.

@jcontesti
Copy link
Contributor

I start to work with this issue.

@jcontesti
Copy link
Contributor

This bug was fixed in version 0.20.1.

Using the current development version, this code:

s1 = pd.Series([39,6,4], index=pd.CategoricalIndex(['female','male','unknown']))
s2 = pd.Series([2,152,2,242,150], index=pd.CategoricalIndex(['f','female','m','male','unknown']))
pd.DataFrame([s1,s2])

returns the correct result:

   female   male  unknown    f    m
0    39.0    6.0      4.0  NaN  NaN
1   152.0  242.0    150.0  2.0  2.0

And the nathalier’s code runs without error too:

s1 = pd.Series([6, 39, 4], index=pd.CategoricalIndex(['male', 'female', 'unknown']))
s2 = pd.Series([150, 2, 1, 242], index=pd.CategoricalIndex(['unknown', 'm', 'f', 'male'],
                categories=['m', 'male', 'f', 'unknown'], ordered=True))
pd.DataFrame([s1, s2])

The result is:

     f  female    m   male  unknown
0  NaN    39.0  NaN    6.0      4.0
1  1.0     NaN  2.0  242.0    150.0

Please, let me know if I can help with anything else!

@jorisvandenbossche
Copy link
Member

@jcontesti Thanks for checking this!
To close this issue, it would be good to still add some tests to ensure the cases from this issue keep working (or at least investigate if such tests have been added when it was fixed). PR with tests welcome!

@jorisvandenbossche jorisvandenbossche added Needs Tests Unit test(s) needed to prevent regressions and removed Effort Low labels Aug 15, 2018
@jcontesti
Copy link
Contributor

I go on with this one.

@jcontesti
Copy link
Contributor

Hi, I have the tests prepared to commit, but now the bug strikes back again :-(

This code:

s1 = pd.Series([1, 2], index=pd.CategoricalIndex(['A', 'B']))
s2 = pd.Series([3, 4], index=pd.CategoricalIndex(['B', 'C']))
pd.DataFrame([s1,s2])

returns:

     A    B  NaN
0  1.0  2.0  NaN
1  NaN  3.0  NaN

instead of:

     A    B    C
0  1.0  2.0  NaN
1  NaN  3.0  4.0

Version 0.23.4 executes it right, but v0.24.0rc1 and development version fail. Remember that this bug was solved since version 0.20.1.

I can help with the solution, but I need some help to know how to proceed because of my little knowledge of the internals of this project.

Thank you!

@TomAugspurger
Copy link
Contributor

@jcontesti can you open a new issue for that constructor bug?

@jcontesti
Copy link
Contributor

@TomAugspurger Could it be already opened in #24845? It's a very similar bug. Let me know if you want me to add a new issue anyway.

@rbenes
Copy link
Contributor

rbenes commented Apr 13, 2019

I tried to investigate this issue. I agree with previous findings, that problem is in union. But what is expected behavior of union in general. I tried to analyze it on normal Index, not Categorical. And I see, that results are different based on sort parameter. See bellow:

>>> import pandas as pd
>>> import numpy as np
>>> i1 = pd.Index([1, 3])
>>> i2 = pd.Index([2, 3, 3])
>>> i_u = i1.union(i2, sort=False)
>>> print(i_u)
Int64Index([1, 3, 2], dtype='int64')
>>> i_u = i1.union(i2, sort=None)
>>> print(i_u)
Int64Index([1, 2, 3, 3], dtype='int64')

What is correct?

@jreback jreback modified the milestones: Contributions Welcome, 1.1 Jan 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants