-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for use of Enums in MultiIndex #21348
Conversation
Due to an unnecessary "orderedness" check in `arrays/categorical.py`'s method `_factorize_from_iterable`, MultiIndex's constructor inappropriately threw a `TypeError` when passed an Enum as one of its iterables. This has been fixed.
CircleCI failed due to unavailability of module |
@benediamond : I believe the build that's failing is a Python 2.7 build. I would use |
pandas/tests/indexes/test_multi.py
Outdated
# cannot actually be built yet. | ||
from pandas.core.algorithms import take_1d | ||
|
||
def _get_ilevel_values(index, level): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is all this for?
simply construct the expected index directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thought here is that the expected index cannot be constructed directly, due to the exact bug I'm trying to fix. Thus instead I create an expected "level", and verify that it matches the relevant level of df
's index. This is done by replicating a sub-step of tm.assert_index_equal
.
pandas/tests/indexes/test_multi.py
Outdated
values = unique._shallow_copy(filled, name=index.names[level]) | ||
return values | ||
|
||
from enum import Enum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all imports go at the top
pandas/tests/indexes/test_multi.py
Outdated
return values | ||
|
||
from enum import Enum | ||
MyEnum = Enum("MyEnum", "A B") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you would have to add enum34
(the 2.7 backport) in our 2.7 builds
(pandas) bash-3.2$ grep python=2.7 ci/*.yaml
appveyor-27.yaml: - python=2.7.*
circle-27-compat.yaml: - python=2.7*
travis-27-locale.yaml: - python=2.7
travis-27.yaml: - python=2.7*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I've hopefully done this correctly.
Codecov Report
@@ Coverage Diff @@
## master #21348 +/- ##
=========================================
- Coverage 92.28% 91.89% -0.4%
=========================================
Files 161 153 -8
Lines 51500 49596 -1904
=========================================
- Hits 47528 45576 -1952
- Misses 3972 4020 +48
Continue to review full report at Codecov.
|
Added `enum34` dependency
@gfyoung I'm stumped by this |
doc/source/whatsnew/v0.23.1.txt
Outdated
@@ -88,6 +88,7 @@ Indexing | |||
- Bug in :meth:`MultiIndex.set_names` where error raised for a ``MultiIndex`` with ``nlevels == 1`` (:issue:`21149`) | |||
- Bug in :class:`IntervalIndex` constructors where creating an ``IntervalIndex`` from categorical data was not fully supported (:issue:`21243`, issue:`21253`) | |||
- Bug in :meth:`MultiIndex.sort_index` which was not guaranteed to sort correctly with ``level=1``; this was also causing data misalignment in particular :meth:`DataFrame.stack` operations (:issue:`20994`, :issue:`20945`, :issue:`21052`) | |||
- Bug in :func:`pandas.core.arrays.categorical._factorize_from_iterable` inappropriately caused ``TypeError`` to be raised when an ``Enum`` was used as a factor in a ``MultiIndex`` (:issue:`21298`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gfyoung I'm stumped by this whatsnew conflict, any thoughts?
@benediamond : I would actually move your whatsnew
addition to 0.24.0
. That will suffice. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone ahead and just removed it---I'm not sure where it belongs, as it appeared that this PR (when it was passing) was slated for the next milestone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benediamond : I would put it in 0.24.0
, as I suggested above. We will need a whatsnew
for this for sure, and that would be the best place to put it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gfyoung Done. this one didn't create merge conflicts, thankfully
pandas/tests/indexes/test_multi.py
Outdated
labels, fill_value=unique._na_value | ||
) | ||
values = unique._shallow_copy(filled, name=index.names[level]) | ||
return values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that you use this function only once, I think this abstraction is unnecessary as @jreback alluded to in an earlier comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, amended. Thanks for your guidance.
pandas/tests/indexes/test_multi.py
Outdated
# GH 21298 | ||
# Allow use of Enums as one of the factors in a MultiIndex. | ||
from enum import Enum | ||
from pandas.core.algorithms import take_1d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By "top", @jreback meant move the imports to "top of the file" 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hah whoops. done
This nested function had been drawn directly from _within_ `pands.util.testing.assert_index_equal`. As there is no way to construct the target dataframe's (multi) index directly (hence this PR), I simply compare this index's _0th level_ to a manually constructed (simple) index. This comparison is given by a subroutine of `assert_index_equal` (which applies the subroutine to _each_ level).
pandas/core/arrays/categorical.py
Outdated
@@ -2523,7 +2523,7 @@ def _factorize_from_iterable(values): | |||
ordered=values.ordered) | |||
codes = values.codes | |||
else: | |||
cat = Categorical(values, ordered=True) | |||
cat = Categorical(values, ordered=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting that this didn't cause any testing issues...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's striking. It's possible that thatindexes.multi
is essentially the only caller of categorical._factorize_from_iterable
, and that the calls that multi
makes happen to be such that within Categorical
's constructor, control is always routed so that the only time that ordered
is used is to incorrectly raise the TypeError
on lines 352-354. That's my best-guess explanation, at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this looks pretty good to me!
cc @jreback
@jreback bumping this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add a whatsnew note for 0.24
ci/circle-27-compat.yaml
Outdated
@@ -22,6 +22,7 @@ dependencies: | |||
# universal | |||
- pytest | |||
- pytest-xdist | |||
- enum34 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you remove from all but 1 of the 27 compat tests (the travis-27) one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pandas/tests/indexes/test_multi.py
Outdated
@@ -3307,3 +3308,28 @@ def test_duplicate_multiindex_labels(self): | |||
with pytest.raises(ValueError): | |||
ind.set_levels([['A', 'B', 'A', 'A', 'B'], [2, 1, 3, -2, 5]], | |||
inplace=True) | |||
|
|||
def test_use_enum_in_multiindex(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use @td.skip_if_no('enum')
here (then do the actual import inside), we have to guard against the tests being run on 2.7 w/o the compat library (by a user).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas/tests/indexes/test_multi.py
Outdated
filled = take_1d(unique.values, labels, fill_value=unique._na_value) | ||
df_index_0 = unique._shallow_copy(filled, name=df.columns.names[0]) | ||
exp_index_0 = pd.Index([MyEnum.A, MyEnum.A, MyEnum.B, MyEnum.B], | ||
dtype='object') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is pretty complicated, why cannot you just directly construct the Index here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback implemented here: @td.skip_if_no('enum')
def test_use_enum_in_multiindex(self):
# GH 21298
# Allow use of Enums as one of the factors in a MultiIndex.
from enum import Enum
MyEnum = Enum("MyEnum", "A B")
... but yields:
not sure what's going on 😩 |
I think you need to |
closing as stale, if you'd like to continue, pls ping. |
@jreback, my understanding was that this commit was ready, but had been waiting for your approval. I found it hard to get responses from you. if you're ready to go, i'm ready to play my part. it should be almost there. |
i left multiple comments |
Hello @benediamond! Thanks for updating the PR.
|
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -132,7 +132,7 @@ Strings | |||
Indexing | |||
^^^^^^^^ | |||
|
|||
- | |||
- Bug in :func:`pandas.core.arrays.categorical._factorize_from_iterable` inappropriately caused ``TypeError`` to be raised when an ``Enum`` was used as a factor in a ``MultiIndex`` (:issue:`21298`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Release notes shouldn't reference private methods.
You should be able to reference Enum with :class:enum.Enum
, and :class:MultiIndex
.
@jreback and others: looks like this issue has since been fixed by PR #22072, see also issue #15457. I have gone ahead and preserved the test I wrote in my rebase, in case you'd like to merge it in---if not no worries and this (as well as issue #21298) can be closed! (note that issue #15457 was a somewhat different problem). thanks for everyone's time and patience. |
@benediamond closing per your latest comment. Ping if we need to reopen |
I want to point out that when merging dataframes with enums as multi-index it throws |
you could open a separate issue with a reproducible example but we have no official support for enums and would need a champion in the community |
test_use_enum_in_multiindex
intests/indexes/test_multi.py
, see diffgit diff upstream/master -u -- "*.py" | flake8 --diff
The boolean flag in
pandas/pandas/core/arrays/categorical.py
Lines 2526 to 2528 in 9f95f7d
TypeError
to be raised inpandas/pandas/core/arrays/categorical.py
Lines 352 to 354 in 9f95f7d
when an Enum was used in a MultiIndex. No tests fail when the flag is flipped back. This is fixed in this PR.