API: better error-handling for df.set_index #22486

h-vetinari · 2018-08-23T15:18:46Z

closes API: better error-handling for df.set_index #22484
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

split off of #22236, and so builds on top of it.

gfyoung · 2018-08-25T08:42:53Z

pandas/core/frame.py

+                            'following: valid column keys, Series, Index, '
+                            'MultiIndex, list or np.ndarray')
+
+        inplace = validate_bool_kwarg(inplace, 'inplace')


Leave the inplace invalidation where it was.

gfyoung · 2018-08-25T08:45:32Z

pandas/core/frame.py

+                                  list, np.ndarray)) for x in keys):
+            raise TypeError('keys may only contain a combination of the '
+                            'following: valid column keys, Series, Index, '
+                            'MultiIndex, list or np.ndarray')


This looks like a lot of iterations over the keys (iterating over col_labels is equivalent time-complexity-wise). Could we have a single for-loop where we iterate over keys and fail fast?

About the number of times iterating over the columns, it would IMO be very complicated and unreadable to do it in one loop, and - to me - not worth the performance gain, because the number of columns used for an index is generally so negligibly small, that traversing that list is a non-issue performance-wise.

@h-vetinari : That's fair, but I'm not sure I agree that the code would become that much more complicated e.g.:

for x in keys: if not (is_scalar(x) or isinstance(x, tuple)): if not isinstance(x, (ABCSeries, ABCIndexClass, ABCMultiIndex, list, np.ndarray): raise TypeError(...) else: if x not in self: raise KeyError(x)

How does that look?

Not bad, but doesn't handle the duplicate column keys yet.

I don't quite see the difference between what you wrote and what I did. Where do you handle duplicate column keys in your changes above?

Nevermind. Forgot that this case had been removed now by this PR

In that case, unless there are other objections, I would suggest using the single for-loop instead. It seems pretty readable IMO.

gfyoung · 2018-08-25T08:47:56Z

pandas/tests/frame/conftest.py

@@ -0,0 +1,121 @@
+import pytest


A couple of things about this file:

Do you use all of these fixtures in your changes?

I would prefer if the naming is a little more consistent e.g.:

You have the word "frame" is some of your fixture names but not othres

You have underscores between words in some names but not others

This is based on #22236, and so it will be easier to review once that is in.

I'm not sure I follow you here. How does adding currently unused fixtures in this PR get explained in #22236? I'm sensing you might want to break off some of the fixtures for a separate PR.

It's buried in the review there and related to #22471 - replacing the attributes of TestData with fixtures. I did it all in one go as a start towards #22471. It's a direct translation of the attributes, including their names.

Hmm...I'm sorry, but I think this is a little too much for one PR. Keep in mind that the purpose of the PR, per your title, is better error reporting for df.set_index. This is a lot of new code that is arguably not pertinent to that aim.

The diff of this PR will be tiny after #22236. Feel free to comment there about the changes of that PR

Fair enough. We can re-evaluate after #22236 gets merged.

FYI, I literally have the same questions in that PR as I do in this one for that file.

I'll answer in more detail in the other thread. But I already mentioned what it's about:

It's [...] related to #22471 - replacing the attributes of TestData with fixtures. I did it all in one go as a start towards #22471. It's a direct translation of the attributes, including their names.

h-vetinari · 2018-08-25T10:13:56Z

@gfyoung, thanks for the review. This is based on #22236, and so it will be easier to review once that is in.

About the number of times iterating over the columns, it would IMO be very complicated and unreadable to do it in one loop, and - to me - not worth the performance gain, because the number of columns used for an index is generally so negligibly small, that traversing that list is a non-issue performance-wise.

h-vetinari · 2018-08-27T07:33:03Z

@gfyoung

In that case, unless there are other objections, I would suggest using the single for-loop instead. It seems pretty readable IMO.

Incorporated your feedback, coming with the next commit after rebasing on top of #22236

pep8speaks · 2018-09-15T14:13:19Z

Hello @h-vetinari! Thanks for updating the PR.

There are no PEP8 issues in the file pandas/core/frame.py !
There are no PEP8 issues in the file pandas/tests/frame/conftest.py !
There are no PEP8 issues in the file pandas/tests/frame/test_alter_axes.py !

Comment last updated on October 05, 2018 at 21:48 Hours UTC

h-vetinari

Now that #22236 is in, we can continue here. :)

h-vetinari · 2018-09-16T00:02:40Z

pandas/core/frame.py

                names.append(col.name)
-            elif isinstance(col, Index):
-                level = col
+            elif isinstance(col, ABCSeries):


The ABC forms are a left-over from the review of #22236 by @jreback (#22236 (comment)):

huh? pls use the ABC version, we do this everywhere else.

jreback · 2018-09-16T00:07:52Z

pandas/tests/frame/test_alter_axes.py

-            with tm.assert_raises_regex(KeyError, '.*'):
-                df.set_index(keys, drop=drop, append=append)
+            # can't drop same column twice
+            first_drop = False


why are you changing these?

these tests are not very hard to interpret

This changes because now we're not failing for drop=True and keys=['A', 'A'] (for example), see #22484

this tests is very confusing. pls make it simpler.

The diff might be confusing, but it doesn't look so bad in practice:

pandas/pandas/tests/frame/test_alter_axes.py

Line 187 in e861dcd

# == gives ambiguous Boolean for Series

It doesn't get much simpler - to test against previously tested behaviour, I have to do the set_index-calls sequentially, but it's not possible to drop the same key twice (which this test is enabling), so I have to take that into account.

jreback · 2018-09-16T00:09:26Z

pandas/tests/frame/test_alter_axes.py

+        rgx = 'keys may only contain a combination of the following:.*'
+        # forbidden type, e.g. set
+        with tm.assert_raises_regex(TypeError, rgx):
+            df.set_index(set(df['A']), drop=drop, append=append)


why are you singling out an iterable (set) is there some reason?

No, just any of the types that are not allowed (I've tried to indicate as much by using "e.g.")

jreback · 2018-09-16T00:10:04Z

pandas/core/frame.py

@@ -3892,6 +3893,22 @@ def set_index(self, keys, drop=True, append=False, inplace=False,
        if not isinstance(keys, list):
            keys = [keys]

+        missing = []
+        for x in keys:
+            if not (is_scalar(x) or isinstance(x, tuple)):


why this overly complicated condition? can u show a simple example which fails now?

This came out of the 2./3. outlined in #22484 and then the feedback from @gfyoung further up the thread

jreback

ok, theme is reasonable, but need to simplify the code to avoid special casing things

jreback · 2018-09-18T12:38:32Z

pandas/core/frame.py

+            if not (is_scalar(x) or isinstance(x, tuple)):
+                if not isinstance(x, (ABCSeries, ABCIndexClass, ABCMultiIndex,
+                                      list, np.ndarray)):
+                    raise TypeError('keys may only contain a combination of '


this is not true. either all have to be scalars, or all list-likes (yes they don't have to be the same, just list-convertible).

Please make this more readable from a user perspective. If I got this I would have no idea what to do.

See longer comment in the main thread.

Tuples are allowed because they might be valid column labels.

jreback · 2018-09-18T12:39:44Z

pandas/tests/frame/test_alter_axes.py

-            with tm.assert_raises_regex(KeyError, '.*'):
-                df.set_index(keys, drop=drop, append=append)
+            # can't drop same column twice
+            first_drop = False


this tests is very confusing. pls make it simpler.

h-vetinari · 2018-09-18T17:04:37Z

this is not true. either all have to be scalars, or all list-likes (yes they don't have to be the same, just list-convertible).

This is wrong on both counts. On master, a mix of column labels and a list:

>>> df = pd.DataFrame({'a': [1, 2], 'b': [11, 12]})
>>> df
   a   b
0  1  11
1  2  12
>>> df.set_index(['a', [101, 102]])
        b
a
1 101  11
2 102  12

Regarding the argument types, the list I wrote (and test) is exhaustive (see https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L3908) - currently only Series/Index/MultiIndex/list/ndarray are allowed, everything else gets tried as a frame key. This is one of the points in #22484.

>>> from pandas.core.dtypes.common import is_list_like
>>> LL = iter([21, 22])
>>> is_list_like(LL)
True
>>> df.set_index(LL)
KeyError: "None of [Int64Index([21, 22], dtype='int64')] are in the [columns]"

codecov · 2018-09-19T08:54:33Z

Codecov Report

Merging #22486 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #22486      +/-   ##
==========================================
+ Coverage   92.19%   92.19%   +<.01%     
==========================================
  Files         169      169              
  Lines       50956    50966      +10     
==========================================
+ Hits        46978    46988      +10     
  Misses       3978     3978

Flag	Coverage Δ
#multiple	`90.61% <100%> (ø)`	⬆️
#single	`42.27% <54.16%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/frame.py	`97.12% <100%> (+0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5e06c84...42d5f2a. Read the comment docs.

h-vetinari · 2018-09-19T08:55:45Z

@jreback
Reworded the warning and rewrote the test - hopefully both clearer now.

h-vetinari · 2018-09-20T13:59:35Z

@jreback
Green

h-vetinari · 2018-09-23T16:12:05Z

@jreback
While you're at it with the reviewing, please don't forget this one. :)

jreback · 2018-09-23T16:53:39Z

pandas/core/frame.py

+        missing = []
+        for x in keys:
+            if not (is_scalar(x) or isinstance(x, tuple)):
+                if not isinstance(x, (ABCSeries, ABCIndexClass, ABCMultiIndex,


you can use is_list_like here

OK. How to deal with tuples then. Can be valid column keys and valid list-likes.

This would also change the capabilities of set_index. e.g. an iterator has is_list_like = True - do you want to support that? Currently, only the types that I'm instance-checking (plus column keys) are allowed.

I don't want to explicity list types here as its verbose and fragile, use is_list_like and not is_iterator if you must (though to be honest an iterator is actually ok)

I don't want to explicity list types here as its verbose and fragile

I have a feeling the implementation you request will be much more verbose, but OK

OK. How to deal with tuples then. Can be valid column keys and valid list-likes.

You want to try tuples as column keys first and then as an array (would make the most sense to me), or something else?

jreback · 2018-09-23T16:55:09Z

pandas/tests/frame/test_alter_axes.py

            df.set_index([df['A'], df['B'], 'X'], drop=drop, append=append)

+        rgx = 'The parameter "keys" may only contain a combination of.*'


@jreback Changed. Please let me know what you want me to do about the allowed argument signature. Allowing all list-likes also means checking ndim==1 for np.ndarrays, etc.

See also:
#22486 (comment)
#22486 (comment)

jreback · 2018-09-25T13:38:29Z

pandas/core/frame.py

+        missing = []
+        for x in keys:
+            if not (is_scalar(x) or isinstance(x, tuple)):
+                if not isinstance(x, (ABCSeries, ABCIndexClass, ABCMultiIndex,


I don't want to explicity list types here as its verbose and fragile, use is_list_like and not is_iterator if you must (though to be honest an iterator is actually ok)

jreback · 2018-10-06T15:31:35Z

pandas/core/frame.py

+                # tuples that are not column keys are considered list-like,
+                # not considered missing
+                missing.append(col)
+            elif (not is_list_like(col) or isinstance(col, set)


i am not averse to excluding, but i also don't want ad-infinitem special cases here. This code is already too complex. We accept an iterable, so this is a valid input, if the user wants to do it that's there issue; these are not currently excluded.

jreback · 2018-10-06T15:37:44Z

pandas/tests/frame/test_alter_axes.py

@@ -126,21 +131,29 @@ def test_set_index_pass_single_array(self, frame_of_index_cols,
        df.index.name = index_name

        key = box(df['B'])
-        # np.array and list "forget" the name of B
-        name = [None if box in [np.array, list] else 'B']
+        if box == list:


why exactly is a list interpreted differently here? this makes this test really really odd. I am worried something changed here, as this appears to work just fine in the master.

You will see above (line 123) that list wasn't tested - for precisely this reason. df.set_index(['A', 'B']) interprets a A & B as column keys, so (assuming this df was length 2) it would not use the list ['A', 'B'] as the index. To do that, one would have to pass [['A', 'B']]. This PR proposes to add tests for the current behaviour.

@jreback
In case the above was not very clearly worded, this corresponds exactly to behaviour on master:

>>> df = pd.DataFrame(np.random.randn(2,3)) >>> df.set_index(['A', 'B']) KeyError: 'A' >>> df.set_index([['A', 'B']]) 0 1 2 A 1.962370 -1.150770 0.843600 B -0.417274 0.509781 -0.697802

h-vetinari

Thanks for review; added responses

h-vetinari · 2018-10-06T16:25:41Z

pandas/core/frame.py

+                # tuples that are not column keys are considered list-like,
+                # not considered missing
+                missing.append(col)
+            elif (not is_list_like(col) or isinstance(col, set)


The larger issue IMO is that is_list_like(<set>) should be False and not True. It's the only exclusion here, and it makes sense. On many issues, pandas prides itself on enforcing sensible defaults and behaviour. Why would we want to enable something that's broken by design?

I may also add that this is not something that's currently allowed, and it shouldn't be IMO

h-vetinari · 2018-10-06T16:28:57Z

pandas/tests/frame/test_alter_axes.py

@@ -126,21 +131,29 @@ def test_set_index_pass_single_array(self, frame_of_index_cols,
        df.index.name = index_name

        key = box(df['B'])
-        # np.array and list "forget" the name of B
-        name = [None if box in [np.array, list] else 'B']
+        if box == list:


You will see above (line 123) that list wasn't tested - for precisely this reason. df.set_index(['A', 'B']) interprets a A & B as column keys, so (assuming this df was length 2) it would not use the list ['A', 'B'] as the index. To do that, one would have to pass [['A', 'B']]. This PR proposes to add tests for the current behaviour.

h-vetinari · 2018-10-07T23:22:27Z

@jreback
PTAL here :)

jreback · 2018-10-09T11:39:44Z

@h-vetinari why don't you try (separate PR) excluding set from is_list_like and see what the implications of that are.

h-vetinari · 2018-10-09T22:59:51Z

@jreback

@h-vetinari why don't you try (separate PR) excluding set from is_list_like and see what the implications of that are.

I did, seems doable: #23061 #23065

h-vetinari · 2018-10-18T20:45:31Z

@jreback
All green. After merging and incorporating #23065, all comments here should now be addressed.

jreback · 2018-10-19T13:13:41Z

thanks!

This reverts commit 145c227.

h-vetinari mentioned this pull request Aug 23, 2018

TST/CLN: break up & parametrize tests for df.set_index #22236

Merged

h-vetinari force-pushed the df_set_index_warn branch 2 times, most recently from 964e79c to 4027825 Compare August 23, 2018 15:24

gfyoung added Indexing Related to indexing on series/frames, not to indexes themselves Error Reporting Incorrect or improved errors from pandas labels Aug 25, 2018

gfyoung reviewed Aug 25, 2018

View reviewed changes

h-vetinari mentioned this pull request Aug 28, 2018

ENH: add return_inverse to duplicated for DataFrame/Series/Index/MultiIndex #21645

Closed

h-vetinari force-pushed the df_set_index_warn branch from 4027825 to 04928d9 Compare September 15, 2018 14:13

h-vetinari force-pushed the df_set_index_warn branch from 04928d9 to 229e72d Compare September 15, 2018 15:56

h-vetinari commented Sep 16, 2018

View reviewed changes

jreback requested changes Sep 16, 2018

View reviewed changes

h-vetinari force-pushed the df_set_index_warn branch from fecb731 to a5f6a3e Compare September 16, 2018 00:20

h-vetinari added a commit to h-vetinari/pandas that referenced this pull request Sep 18, 2018

Rebased version of pandas-dev#22486

a6c708a

h-vetinari mentioned this pull request Sep 18, 2018

ENH: Add set_index to Series #22225

Closed

5 tasks

jreback requested changes Sep 18, 2018

View reviewed changes

h-vetinari added a commit to h-vetinari/pandas that referenced this pull request Sep 18, 2018

Rebased version of pandas-dev#22486

e997faf

jreback requested changes Sep 23, 2018

View reviewed changes

jreback requested changes Sep 25, 2018

View reviewed changes

h-vetinari added 2 commits September 26, 2018 18:42

API: better error-handling for df.set_index

28ec3a9

Review (jreback)

8dd1453

jreback requested changes Oct 6, 2018

View reviewed changes

h-vetinari commented Oct 6, 2018

View reviewed changes

h-vetinari closed this Oct 7, 2018

h-vetinari reopened this Oct 7, 2018

Improve comment

47c4e74

This was referenced Oct 9, 2018

API: set should not be considered list_like #23061

Closed

Add allow_sets-kwarg to is_list_like #23065

Merged

Merge remote-tracking branch 'upstream/master' into df_set_index_warn

06ed33a

h-vetinari added a commit to h-vetinari/pandas that referenced this pull request Oct 16, 2018

Rebased version of pandas-dev#22486

8ab863b

h-vetinari added 4 commits October 18, 2018 08:24

Merge remote-tracking branch 'upstream/master' into df_set_index_warn

9871be2

Retrigger Circle

9da6e6c

Merge remote-tracking branch 'upstream/master' into df_set_index_warn

361a905

Incorporate allow_sets-kwarg for is_list_like

42d5f2a

jreback approved these changes Oct 19, 2018

View reviewed changes

jreback merged commit 145c227 into pandas-dev:master Oct 19, 2018

h-vetinari deleted the df_set_index_warn branch October 22, 2018 14:16

tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018

API: better error-handling for df.set_index (pandas-dev#22486)

b0966b3

This was referenced Jan 9, 2019

API: capabilities of df.set_index #24046

Open

DEPR/API: disallow lists within list for set_index #24697

Closed

DOC: update DF.set_index #24762

Merged

jorisvandenbossche mentioned this pull request Jan 28, 2019

Regression in DataFrame.set_index with class instance column keys #24969

Closed

h-vetinari mentioned this pull request Jan 28, 2019

API/ERR: allow iterators in df.set_index & improve errors #24984

Merged

3 tasks

h-vetinari added a commit to h-vetinari/pandas that referenced this pull request Feb 1, 2019

Revert "API: better error-handling for df.set_index (pandas-dev#22486)"

4a211e9

This reverts commit 145c227.

This was referenced Feb 1, 2019

Revert set_index inspection/error handling for 0.24.1 #25085

Merged

CLN: Use ABCs in set_index #25128

Merged

		df.set_index([df['A'], df['B'], 'X'], drop=drop, append=append)

		rgx = 'The parameter "keys" may only contain a combination of.*'

API: better error-handling for df.set_index #22486

API: better error-handling for df.set_index #22486

Conversation

h-vetinari commented Aug 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Aug 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Aug 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Aug 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari commented Aug 25, 2018

h-vetinari commented Aug 27, 2018

pep8speaks commented Sep 15, 2018 • edited Loading

Comment last updated on October 05, 2018 at 21:48 Hours UTC

h-vetinari left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari Sep 18, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari commented Sep 18, 2018 • edited Loading

codecov bot commented Sep 19, 2018 • edited Loading

Codecov Report

h-vetinari commented Sep 19, 2018

h-vetinari commented Sep 20, 2018

h-vetinari commented Sep 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari Sep 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h-vetinari commented Oct 7, 2018

jreback commented Oct 9, 2018

h-vetinari commented Oct 9, 2018

h-vetinari commented Oct 18, 2018 • edited Loading

jreback commented Oct 19, 2018

h-vetinari commented Aug 23, 2018 •

edited

Loading

gfyoung Aug 25, 2018 •

edited

Loading

gfyoung Aug 25, 2018 •

edited

Loading

gfyoung Aug 25, 2018 •

edited

Loading

pep8speaks commented Sep 15, 2018 •

edited

Loading

h-vetinari Sep 18, 2018 •

edited

Loading

h-vetinari commented Sep 18, 2018 •

edited

Loading

codecov bot commented Sep 19, 2018 •

edited

Loading

h-vetinari Sep 23, 2018 •

edited

Loading

h-vetinari commented Oct 18, 2018 •

edited

Loading