Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix for TypeError: unorderable types" in when using set_index with multiple column names #22072

Merged
merged 13 commits into from
Aug 12, 2018
Merged

fix for TypeError: unorderable types" in when using set_index with multiple column names #22072

merged 13 commits into from
Aug 12, 2018

Conversation

dickreuter
Copy link
Contributor

@dickreuter dickreuter commented Jul 27, 2018

@dickreuter dickreuter changed the title fix for issue #15457 for for TypeError: unorderable types" in when using set_index with list Jul 27, 2018
@dickreuter dickreuter changed the title for for TypeError: unorderable types" in when using set_index with list for for TypeError: unorderable types" in when using set_index with multiple column names Jul 27, 2018
@dickreuter dickreuter changed the title for for TypeError: unorderable types" in when using set_index with multiple column names fix for TypeError: unorderable types" in when using set_index with multiple column names Jul 27, 2018
@gfyoung
Copy link
Member

gfyoung commented Jul 27, 2018

Going to need tests for this as well.

@gfyoung gfyoung requested review from jreback and toobaz July 27, 2018 04:39
@toobaz
Copy link
Member

toobaz commented Jul 27, 2018

closes #16718

Thanks for your attempt! But is this really the bug you're after?! (feel free to edit the description)

@codecov
Copy link

codecov bot commented Jul 27, 2018

Codecov Report

Merging #22072 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #22072   +/-   ##
=======================================
  Coverage   92.08%   92.08%           
=======================================
  Files         169      169           
  Lines       50694    50694           
=======================================
  Hits        46682    46682           
  Misses       4012     4012
Flag Coverage Δ
#multiple 90.49% <100%> (ø) ⬆️
#single 42.34% <100%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/arrays/categorical.py 95.75% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0370740...213c4ea. Read the comment docs.

Copy link
Member

@toobaz toobaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix is good, some comments.

@@ -430,7 +430,7 @@ Bug Fixes
Categorical
^^^^^^^^^^^

-
- Fixes an issue with creating multiple indices with for example .set_index([a,b]) giving an error "TypeError: unorderable types" (:issue:'15457')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not telling the user what the problem really was (that example given works fine in 99% of cases). Moreover, the bug (in the sense of API misbehavior) is unrelated to categorical Consider e.g. Fix issue when creating :class:`MultiIndex` from mixed type arrays including tuples (:issue:'15457')

@@ -2520,7 +2520,7 @@ def _factorize_from_iterable(values):
ordered=values.ordered)
codes = values.codes
else:
cat = Categorical(values, ordered=True)
cat = Categorical(values, ordered=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smart fix... but not really simple to understand. Please add a comment stating that the value of ordered is irrelevant since we don't use cat as such, but only the resulting categories, the order of which is independent from ordered.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always write tests first - how else can you tell if you fixed something?

@pep8speaks
Copy link

pep8speaks commented Jul 27, 2018

Hello @dickreuter! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on August 11, 2018 at 12:07 Hours UTC

@dickreuter
Copy link
Contributor Author

dickreuter commented Jul 27, 2018

@dickreuter
Copy link
Contributor Author

dickreuter commented Jul 28, 2018

Would like to check it locally but building pandas seems impossible (cl.exe failed with exit status 2). Is there no additional info what version of visual studio dev tools is needed exactly? https://stackoverflow.com/questions/51566895/unable-to-build-pandas-from-source-cl-exe-failed-with-exit-status-2

@dickreuter
Copy link
Contributor Author

I don't see how the error on the ci is related to my change (especially also since the tests passed yesterday)

@toobaz
Copy link
Member

toobaz commented Jul 28, 2018

I don't see how the error on the ci is related to my change (especially also since the tests passed yesterday)

At least one is explained by

pandas/tests/arrays/categorical/test_indexing.py:131:41: W292 no newline at end of file

(see also the automatic comment by @pep8speaks )

@dickreuter
Copy link
Contributor Author

dickreuter commented Jul 28, 2018

Still getting an error which I can't see how it's related to this: BufferError: could not get scalar buffer information. Running this test locally passes fine.

@@ -51,7 +51,7 @@ Bug Fixes

**Categorical**

-
- Fix issue when creating :class:`MultiIndex` from mixed type arrays including tuples (:issue:'15457')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you be even more explicit here, a user won't under stand this.

@@ -2520,7 +2520,10 @@ def _factorize_from_iterable(values):
ordered=values.ordered)
codes = values.codes
else:
cat = Categorical(values, ordered=True)
# The value of ordered is irrelevant since we don't use cat as such,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the issue reference

@@ -121,3 +121,11 @@ def test_get_indexer_non_unique(self, idx_values, key_values, key_class):

tm.assert_numpy_array_equal(expected, result)
tm.assert_numpy_array_equal(exp_miss, res_miss)


class TestSetIndex(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put in pandas/tests/frame/test_alter_axes.py.

put in a full comparison on the result table and use assert_fram_equal

Copy link
Contributor Author

@dickreuter dickreuter Jul 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df = DataFrame([[2, 1, 2], [4, (1, 2), 3]]).set_index([0, 1])
gives me

Out[97]: 
          2
0 1        
2 1       2
4 (1, 2)  3

How I construct this directly from the DataFrame constructor? The first fow seems empty?
df.to_dict() gives {2: {(2, 1): 2, (4, (1, 2)): 3}}

I tried:

DataFrame(['',2,3],
          columns=['2'],
          index=[['0','2','4'], ['1','1','(1,2)']])

but it's not equal (even though it looks equal), but
DataFrame shape mismatch

[left]:  (2, 1)
[right]: (3, 1)

Implemented with dict instead:
assert df.to_dict() == {2: {(2, 1): 2, (4, (1, 2)): 3}}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it's not equal (even though it looks equal), but

Yes, they look equal, but you can see from how you built them they are not. [0, 1] are the index names, not index values.

pd.DataFrame([2,3], columns=['2'],
             index=pd.MultiIndex.from_arrays([['2','4'], ['1','(1,2)']], names=[0, 1]))

is what you're looking for.

Copy link
Contributor Author

@dickreuter dickreuter Jul 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        df = DataFrame([[2, 1, 2], [4, (1, 2), 3]]).set_index([0, 1])
        assert_frame_equal(df, pd.DataFrame([2, 3],
                                            columns=['2'],
                                            index=pd.MultiIndex.
                                            from_arrays([['2', '4'],
                                                         ['1', '(1,2)']],
                                                        names=[0, 1])))

MultiIndex level [0] classes are not equivalent
[left]:  Int64Index([2, 4], dtype='int64', name=0)
[right]: Index(['2', '4'], dtype='object', name=0)

I also tried

df = DataFrame([[2, 1, 2], [4, (1, 2), 3]]).set_index([0, 1])
assert_frame_equal(df, pd.DataFrame([2, 3],
                                    columns=['2'],
                                    index=pd.MultiIndex.
                                    from_arrays([[2, 4],
                                                 ['1', '(1,2)']],
                                                names=[0, 1])))
Attribute "inferred_type" are different
[left]:  mixed-integer
[right]: string

Maybe the dictionary is a more feasible solution,.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also tried

The original df has no strings at all, just integers. Why are you putting strings in the other (e.g. '2', '(1,2)'?!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the two strings I get ValueError: setting an array element with a sequence.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the two strings I get ValueError: setting an array element with a sequence.

OK, pd.MultiIndex.from_tuples([(2, 1), (4, (1,2))], names=[0, 1])

@@ -51,7 +51,7 @@ Bug Fixes

**Categorical**

-
- Fix issue when creating :class:`MultiIndex` from mixed type arrays including tuples (:issue:'15457'). In some cases df.set_index(['A','B']) raised a TypeError unorderable types when trying to set multiple columns as indices.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to 0.24

@@ -1186,3 +1186,11 @@ def test_set_axis_prior_to_deprecation_signature(self):
with tm.assert_produces_warning(FutureWarning):
result = df.set_axis(axis, list('abc'), inplace=False)
tm.assert_frame_equal(result, expected[axis])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't add a new class, rather put this test next to the other set_index tests

def test_set_index_tuple(self):
# test for fix of issue #15457, where the following raised a TypeError
df = DataFrame([[2, 1, 2], [4, (1, 2), 3]]).set_index([0, 1])
assert df.to_dict() == {2: {(2, 1): 2, (4, (1, 2)): 3}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use assert_frame_equal, IOW you must construct the expected frame. use result= and expected= here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback @dickreuter since the bug is in MultiIndex.from_arrays, I think that only reconstructing the index (as I suggested above, comparing from_arrays with from_tuples, which already works) rather than the full DataFrame - and moving the test to pandas/tests/indexes/multi/test_constructor.py - is OK (actually even better)

Copy link
Contributor Author

@dickreuter dickreuter Jul 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@toobaz Comparing from_arrays with from_tuples? Not sure I understand. As you said, this already works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dickreuter from_tuples already works, from_arrays doesn't. With your patch, both do, so you can compare one to the other. Something like (untested):

result = pd.MultiIndex.from_arrays([[2, 4], [1, (1, 2)]])
expected = pd.MultiIndex.from_tuples([(2, 1), (4, (1,2))])
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@toobaz I reverted to the test via dict as otherwise line 908 of cast.py causes an error when trying to create an array from [1, (1, 2)], raising {ValueError}setting an array element with a sequence. This may be an unrelated issue but not sure the test can be constructed in that way at the moment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dickreuter this has to be an assert on a frame, otherwise this test is not testing what we need here

Copy link
Member

@toobaz toobaz Aug 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{ValueError}setting an array element with a sequence

@dickreuter You are right, sorry. This should work instead:

expected = pd.DataFrame([[2, 1], [4, (1, 2)]]).set_index([0, 1]).index
result = pd.MultiIndex.from_tuples([(2, 1), (4, (1,2))], names=(0, 1))

@jreback see my comment above, the bug is unrelated to DataFrame

@@ -430,7 +430,7 @@ Bug Fixes
Categorical
^^^^^^^^^^^

-
- Fix issue when creating :class:`MultiIndex` from mixed type arrays including tuples (:issue:'15457'). In some cases df.set_index(['A','B']) raised a TypeError unorderable types when trying to set multiple columns as indices.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use double backticks around TypeError, list the issue number here as well. Move this to MultiIndex section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want thesame issue number mentioned twice? I'm not aware of any second issue number.

@dickreuter
Copy link
Contributor Author

dickreuter commented Aug 1, 2018

@toobaz: I'm getting ValueError: setting an array element with a sequence when I test that. Not sure my fix is related to this bug?

@dickreuter
Copy link
Contributor Author

@toobaz I reverted to the test via dict as otherwise line 908 of cast.py causes an error when trying to create an array from [1, (1, 2)], raising {ValueError}setting an array element with a sequence.

@dickreuter
Copy link
Contributor Author

@toobaz @jreback What's the next step? Can this be merged? tx

@gfyoung
Copy link
Member

gfyoung commented Aug 8, 2018

https://travis-ci.org/pandas-dev/pandas/jobs/413344917#L3557

@dickreuter : You have linter failures. I would fix those first before pinging.

@jreback
Copy link
Contributor

jreback commented Aug 9, 2018

rebased

@toobaz merge when satisfied

@@ -626,7 +626,7 @@ MultiIndex

- Removed compatibility for :class:`MultiIndex` pickles prior to version 0.8.0; compatibility with :class:`MultiIndex` pickles from version 0.13 forward is maintained (:issue:`21654`)
- :meth:`MultiIndex.get_loc_level` (and as a consequence, ``.loc`` on a :class:``MultiIndex``ed object) will now raise a ``KeyError``, rather than returning an empty ``slice``, if asked a label which is present in the ``levels`` but is unused (:issue:`22221`)
-
- Fix `TypeError` in Python 3 when creating `MultiIndex` in which some columns are of object type, e.g. when labels are tuples (:issue:'15457')
Copy link
Member

@toobaz toobaz Aug 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from my comment on "object type" ( #22072 ), you have lost the ":class:" before "MultiIndex", and you still have quotes rather than backticks around the issue number

@@ -626,7 +626,7 @@ MultiIndex

- Removed compatibility for :class:`MultiIndex` pickles prior to version 0.8.0; compatibility with :class:`MultiIndex` pickles from version 0.13 forward is maintained (:issue:`21654`)
- :meth:`MultiIndex.get_loc_level` (and as a consequence, ``.loc`` on a :class:``MultiIndex``ed object) will now raise a ``KeyError``, rather than returning an empty ``slice``, if asked a label which is present in the ``levels`` but is unused (:issue:`22221`)
-
- Fix `TypeError` in Python 3 when creating :class: `MultiIndex` in which some columns are of object type, e.g. when labels are tuples (:issue:`15457`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dickreuter OK, I think I see your point. What I'm saying is "not all object" dtype give problems, what you are saying is "other object dtypes than tuples give problems". I think a more precise description then is "in which some levels have mixed types", as I guess happens in your case (no error comes if all labels are ctype items, right?).

Then i think we're ready.

Copy link
Contributor Author

@dickreuter dickreuter Aug 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@toobaz I believe in the example I first hit the error there where 3 columns. All of them were dtype and all the rows were homogenous of the same type as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@toobaz When you say mixed type, is the column name also relevant for that for each for? If yes, then I assume it’s a mixed type because the column names were strings in my case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your new version is almost ready (see my last comment), but for the records:

I believe in the example I first hit the error there where 3 columns. All of them were dtype and all the rows were homogenous of the same type as well.

The error only comes out when two elements (in a same column) cannot be ordered, which in Python 3 happens when they are of non-comparable types. If you got an error with columns which homogeneous types, I doubt it was the same error.
Notice that I'm referring to (Python) types, not dtypes. If you put mixed types in a single column, then its dtype will be just object, but it is still a mixed type column.

When you say mixed type, is the column name also relevant for that for each for?

No, column names are stored separately and should be unrelated to this bug and to any issue with mixed type columns/levels.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume in my case the elements in the same column were unorderable not because they had multiple non-comparable types but because they contained just references to memory of python ctype objects.

@@ -626,7 +626,7 @@ MultiIndex

- Removed compatibility for :class:`MultiIndex` pickles prior to version 0.8.0; compatibility with :class:`MultiIndex` pickles from version 0.13 forward is maintained (:issue:`21654`)
- :meth:`MultiIndex.get_loc_level` (and as a consequence, ``.loc`` on a :class:``MultiIndex``ed object) will now raise a ``KeyError``, rather than returning an empty ``slice``, if asked a label which is present in the ``levels`` but is unused (:issue:`22221`)
-
- Fix `TypeError` in Python 3 when creating :class: `MultiIndex` in which some levels have mixed types, e.g. when labels are tuples (:issue:`15457`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there... "when some labels are tuples". If (for instance) all labels are tuples of ints, there is no problem.

@@ -626,7 +626,7 @@ MultiIndex

- Removed compatibility for :class:`MultiIndex` pickles prior to version 0.8.0; compatibility with :class:`MultiIndex` pickles from version 0.13 forward is maintained (:issue:`21654`)
- :meth:`MultiIndex.get_loc_level` (and as a consequence, ``.loc`` on a :class:``MultiIndex``ed object) will now raise a ``KeyError``, rather than returning an empty ``slice``, if asked a label which is present in the ``levels`` but is unused (:issue:`22221`)
-
- Fix `TypeError` in Python 3 when creating :class: `MultiIndex` in which some levels have mixed types. If (for instance) all labels are tuples of ints, there is no problem (:issue:`15457`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dickreuter the previous one was almost OK, this one is not... in the whatsnew, we don't need examples in which "there is no problem" :-)

Please revert to the previous one, just adding "some", that is

Fix `TypeError` in Python 3 when creating :class: `MultiIndex` in which some levels have mixed types, e.g. when some labels are tuples (:issue:`15457`)

@@ -626,7 +626,7 @@ MultiIndex

- Removed compatibility for :class:`MultiIndex` pickles prior to version 0.8.0; compatibility with :class:`MultiIndex` pickles from version 0.13 forward is maintained (:issue:`21654`)
- :meth:`MultiIndex.get_loc_level` (and as a consequence, ``.loc`` on a :class:``MultiIndex``ed object) will now raise a ``KeyError``, rather than returning an empty ``slice``, if asked a label which is present in the ``levels`` but is unused (:issue:`22221`)
-
- Fix `TypeError` in Python 3 when creating :class: `MultiIndex` in which some levels have mixed types, e.g. when some labels are tuples (:issue:`15457`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double backticks around TypeError, no space after :class:

@dickreuter
Copy link
Contributor Author

@toobaz Why does travis fail with "Unable to download 3.5 archive. The archive may not exist. Please consider a different version"?

@toobaz toobaz merged commit d09db1f into pandas-dev:master Aug 12, 2018
@toobaz
Copy link
Member

toobaz commented Aug 12, 2018

Thanks @dickreuter !

@dickreuter dickreuter deleted the feature/fix-#15457 branch August 12, 2018 08:29
@gfyoung
Copy link
Member

gfyoung commented Aug 12, 2018

@dickreuter : FYI, those Travis failures that you saw earlier are spurious. I restarted the builds to get them passing and this PR merged 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"TypeError: unorderable types" in Python3 when column for MultiIndex contains tuple and int
6 participants