Concatenation of category value counts mixes up the index order #14600

jbao · 2016-11-07T09:24:01Z

Let's say we have two Series s1 and s2, which can be the output of the pd.value_counts() function, and we want to combine them into one DataFrame

s1 = pd.Series([39,6,4], index=pd.CategoricalIndex(['female','male','unknown']))
s2 = pd.Series([2,152,2,242,150], index=pd.CategoricalIndex(['f','female','m','male','unknown']))
pd.DataFrame([s1,s2])

The result is

    female   male  unknown      f      m
0     NaN   39.0      NaN    6.0    4.0
1     2.0  152.0      2.0  242.0  150.0

where the order of categories in the first row is changed.

And the current workaround is

pd.DataFrame([pd.Series(s1.values,index=s1.index.astype(list)),pd.Series(s2.values,index=s2.index.astype(list))])

which gives the correct result

        f  female    m   male  unknown
0  NaN    39.0  NaN    6.0      4.0
1  2.0   152.0  2.0  242.0    150.0

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 3.19.0-31-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.0
nose: None
pip: 8.1.2
setuptools: 27.2.0
Cython: None
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2016-11-07T10:35:17Z

The underlying reason is possibly that union does not return a sorted result in case of Categoricals (which is a bug I think):

In [116]: s1.index.union(s2.index)    ## <--- not sorted result
Out[116]: CategoricalIndex(['female', 'male', 'unknown', 'f', 'm'], categories=['f', 'female', 'm', 'male', 'unknown'], ordered=False, dtype='category')

In [117]: s1.index.union(s2.index).sort_values()
Out[117]: CategoricalIndex(['f', 'female', 'm', 'male', 'unknown'], categories=['f', 'female', 'm', 'male', 'unknown'], ordered=False, dtype='category')

In [119]: s1.index.astype(object).union(s2.index.astype(object))    ## <--- sorted result
Out[119]: Index(['f', 'female', 'm', 'male', 'unknown'], dtype='object')

nathalier · 2017-01-10T16:02:54Z

Hello!
I looked at this bug and I found that the source of that particular error was in the next section of code (in union() ):

if self.is_monotonic and other.is_monotonic:
   try:
      result = self._outer_indexer(self._values, other._values)[0]
   except TypeError:
      # incomparable objects
.......

where _outer_indexer() expected ndarray and not categorical. That caused TypeError.
Avoiding this section for categorical values resolve the bug.

But after I wrote tests another problem was discovered.
indices of different values (of self and other) in union() is defined based on categories list, while diff values themselves are defined based on values of CategorialIndex. Thus, it uses incorect mask to define differences, and if the order of CategoricalIndex values does not correspond to ordered categories list, different errors appear.
I'm not sure I'm clear enough:-)
The example could be as follows:

s1 = pd.Series([6, 39, 4], index=pd.CategoricalIndex(['male', 'female', 'unknown']))
s2 = pd.Series([150, 2, 1, 242], index=pd.CategoricalIndex(['unknown', 'm', 'f', 'male'],
                categories=['m', 'male', 'f', 'unknown'], ordered=True))
res = pd.DataFrame([s1, s2])

This causes IndexError: list index out of range.

So based on which should union() work: on indices or on underlying categories? Or they both should be joined, taking into account that the list of categories may be wider than the list of indices ? Or maybe it worse to sort indices firstly and than pass it to union()?

jreback · 2017-01-11T00:04:45Z

@nathalier thanks for having a look!

yes both of these things appear wrong. The general principal that we try to follow is that Categoricals (or CategoricalIndex) can be combined if they are dtype equal (is_dtype_equal is True). IOW, there categories and ordered flags match.

Then they should be combined and stay categoricals.

Otherwise we still allow combinations, BUT coerce to object, then follow union / diff logic.

So happy to have more tests / fixes.

jcontesti · 2018-06-09T16:55:41Z

I start to work with this issue.

jcontesti · 2018-08-15T20:06:07Z

This bug was fixed in version 0.20.1.

Using the current development version, this code:

s1 = pd.Series([39,6,4], index=pd.CategoricalIndex(['female','male','unknown']))
s2 = pd.Series([2,152,2,242,150], index=pd.CategoricalIndex(['f','female','m','male','unknown']))
pd.DataFrame([s1,s2])

returns the correct result:

   female   male  unknown    f    m
0    39.0    6.0      4.0  NaN  NaN
1   152.0  242.0    150.0  2.0  2.0

And the nathalier’s code runs without error too:

s1 = pd.Series([6, 39, 4], index=pd.CategoricalIndex(['male', 'female', 'unknown']))
s2 = pd.Series([150, 2, 1, 242], index=pd.CategoricalIndex(['unknown', 'm', 'f', 'male'],
                categories=['m', 'male', 'f', 'unknown'], ordered=True))
pd.DataFrame([s1, s2])

The result is:

     f  female    m   male  unknown
0  NaN    39.0  NaN    6.0      4.0
1  1.0     NaN  2.0  242.0    150.0

Please, let me know if I can help with anything else!

jorisvandenbossche · 2018-08-15T21:36:14Z

@jcontesti Thanks for checking this!
To close this issue, it would be good to still add some tests to ensure the cases from this issue keep working (or at least investigate if such tests have been added when it was fixed). PR with tests welcome!

jcontesti · 2018-12-26T14:51:19Z

I go on with this one.

jcontesti · 2019-01-20T16:15:05Z

Hi, I have the tests prepared to commit, but now the bug strikes back again :-(

This code:

s1 = pd.Series([1, 2], index=pd.CategoricalIndex(['A', 'B']))
s2 = pd.Series([3, 4], index=pd.CategoricalIndex(['B', 'C']))
pd.DataFrame([s1,s2])

returns:

     A    B  NaN
0  1.0  2.0  NaN
1  NaN  3.0  NaN

instead of:

     A    B    C
0  1.0  2.0  NaN
1  NaN  3.0  4.0

Version 0.23.4 executes it right, but v0.24.0rc1 and development version fail. Remember that this bug was solved since version 0.20.1.

I can help with the solution, but I need some help to know how to proceed because of my little knowledge of the internals of this project.

Thank you!

TomAugspurger · 2019-01-20T17:37:00Z

@jcontesti can you open a new issue for that constructor bug?

jcontesti · 2019-01-29T21:55:33Z

@TomAugspurger Could it be already opened in #24845? It's a very similar bug. Let me know if you want me to add a new issue anyway.

rbenes · 2019-04-13T19:29:48Z

I tried to investigate this issue. I agree with previous findings, that problem is in union. But what is expected behavior of union in general. I tried to analyze it on normal Index, not Categorical. And I see, that results are different based on sort parameter. See bellow:

>>> import pandas as pd
>>> import numpy as np
>>> i1 = pd.Index([1, 3])
>>> i2 = pd.Index([2, 3, 3])
>>> i_u = i1.union(i2, sort=False)
>>> print(i_u)
Int64Index([1, 3, 2], dtype='int64')
>>> i_u = i1.union(i2, sort=None)
>>> print(i_u)
Int64Index([1, 2, 3, 3], dtype='int64')

What is correct?

jorisvandenbossche added the Categorical Categorical Data Type label Nov 7, 2016

jbao mentioned this issue Nov 7, 2016

frequency table in the chi square test doesn't respect the order of categories zalando/expan#56

Closed

jorisvandenbossche added Bug Difficulty Novice labels Nov 23, 2016

jorisvandenbossche added this to the Next Major Release milestone Nov 23, 2016

TomAugspurger added the good first issue label Oct 11, 2017

jreback removed the Difficulty Novice label Dec 15, 2017

jorisvandenbossche added Needs Tests Unit test(s) needed to prevent regressions and removed Effort Low labels Aug 15, 2018

mroeschke mentioned this issue Jan 20, 2020

TST: Add regression tests for fixed issues #31161

Merged

10 tasks

jreback modified the milestones: Contributions Welcome, 1.1 Jan 20, 2020

mroeschke closed this as completed in #31161 Jan 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concatenation of category value counts mixes up the index order #14600

Concatenation of category value counts mixes up the index order #14600

jbao commented Nov 7, 2016 •

edited

Loading

jorisvandenbossche commented Nov 7, 2016

nathalier commented Jan 10, 2017

jreback commented Jan 11, 2017

jcontesti commented Jun 9, 2018

jcontesti commented Aug 15, 2018

jorisvandenbossche commented Aug 15, 2018

jcontesti commented Dec 26, 2018

jcontesti commented Jan 20, 2019

TomAugspurger commented Jan 20, 2019

jcontesti commented Jan 29, 2019

rbenes commented Apr 13, 2019

Concatenation of category value counts mixes up the index order #14600

Concatenation of category value counts mixes up the index order #14600

Comments

jbao commented Nov 7, 2016 • edited Loading

Output of pd.show_versions()

jorisvandenbossche commented Nov 7, 2016

nathalier commented Jan 10, 2017

jreback commented Jan 11, 2017

jcontesti commented Jun 9, 2018

jcontesti commented Aug 15, 2018

jorisvandenbossche commented Aug 15, 2018

jcontesti commented Dec 26, 2018

jcontesti commented Jan 20, 2019

TomAugspurger commented Jan 20, 2019

jcontesti commented Jan 29, 2019

rbenes commented Apr 13, 2019

jbao commented Nov 7, 2016 •

edited

Loading

Output of `pd.show_versions()`