-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: better error-handling for df.set_index #22486
Conversation
964e79c
to
4027825
Compare
pandas/core/frame.py
Outdated
'following: valid column keys, Series, Index, ' | ||
'MultiIndex, list or np.ndarray') | ||
|
||
inplace = validate_bool_kwarg(inplace, 'inplace') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leave the inplace
invalidation where it was.
pandas/core/frame.py
Outdated
list, np.ndarray)) for x in keys): | ||
raise TypeError('keys may only contain a combination of the ' | ||
'following: valid column keys, Series, Index, ' | ||
'MultiIndex, list or np.ndarray') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a lot of iterations over the keys
(iterating over col_labels
is equivalent time-complexity-wise). Could we have a single for-loop where we iterate over keys
and fail fast?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the number of times iterating over the columns, it would IMO be very complicated and unreadable to do it in one loop, and - to me - not worth the performance gain, because the number of columns used for an index is generally so negligibly small, that traversing that list is a non-issue performance-wise.
@h-vetinari : That's fair, but I'm not sure I agree that the code would become that much more complicated e.g.:
for x in keys:
if not (is_scalar(x) or isinstance(x, tuple)):
if not isinstance(x, (ABCSeries, ABCIndexClass, ABCMultiIndex, list, np.ndarray):
raise TypeError(...)
else:
if x not in self:
raise KeyError(x)
How does that look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not bad, but doesn't handle the duplicate column keys yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite see the difference between what you wrote and what I did. Where do you handle duplicate column keys in your changes above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind. Forgot that this case had been removed now by this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, unless there are other objections, I would suggest using the single for-loop instead. It seems pretty readable IMO.
pandas/tests/frame/conftest.py
Outdated
@@ -0,0 +1,121 @@ | |||
import pytest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of things about this file:
- Do you use all of these fixtures in your changes?
- I would prefer if the naming is a little more consistent e.g.:
- You have the word "frame" is some of your fixture names but not othres
- You have underscores between words in some names but not others
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm...I'm sorry, but I think this is a little too much for one PR. Keep in mind that the purpose of the PR, per your title, is better error reporting for df.set_index
. This is a lot of new code that is arguably not pertinent to that aim.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diff of this PR will be tiny after #22236. Feel free to comment there about the changes of that PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. We can re-evaluate after #22236 gets merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, I literally have the same questions in that PR as I do in this one for that file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gfyoung, thanks for the review. This is based on #22236, and so it will be easier to review once that is in. About the number of times iterating over the columns, it would IMO be very complicated and unreadable to do it in one loop, and - to me - not worth the performance gain, because the number of columns used for an index is generally so negligibly small, that traversing that list is a non-issue performance-wise. |
4027825
to
04928d9
Compare
Hello @h-vetinari! Thanks for updating the PR.
Comment last updated on October 05, 2018 at 21:48 Hours UTC |
04928d9
to
229e72d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that #22236 is in, we can continue here. :)
pandas/core/frame.py
Outdated
names.append(col.name) | ||
elif isinstance(col, Index): | ||
level = col | ||
elif isinstance(col, ABCSeries): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ABC forms are a left-over from the review of #22236 by @jreback (#22236 (comment)):
huh? pls use the ABC version, we do this everywhere else.
with tm.assert_raises_regex(KeyError, '.*'): | ||
df.set_index(keys, drop=drop, append=append) | ||
# can't drop same column twice | ||
first_drop = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you changing these?
these tests are not very hard to interpret
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes because now we're not failing for drop=True
and keys=['A', 'A']
(for example), see #22484
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this tests is very confusing. pls make it simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diff might be confusing, but it doesn't look so bad in practice:
pandas/pandas/tests/frame/test_alter_axes.py
Line 187 in e861dcd
# == gives ambiguous Boolean for Series |
It doesn't get much simpler - to test against previously tested behaviour, I have to do the set_index
-calls sequentially, but it's not possible to drop the same key twice (which this test is enabling), so I have to take that into account.
rgx = 'keys may only contain a combination of the following:.*' | ||
# forbidden type, e.g. set | ||
with tm.assert_raises_regex(TypeError, rgx): | ||
df.set_index(set(df['A']), drop=drop, append=append) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you singling out an iterable (set) is there some reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, just any of the types that are not allowed (I've tried to indicate as much by using "e.g.")
pandas/core/frame.py
Outdated
@@ -3892,6 +3893,22 @@ def set_index(self, keys, drop=True, append=False, inplace=False, | |||
if not isinstance(keys, list): | |||
keys = [keys] | |||
|
|||
missing = [] | |||
for x in keys: | |||
if not (is_scalar(x) or isinstance(x, tuple)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this overly complicated condition? can u show a simple example which fails now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fecb731
to
a5f6a3e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, theme is reasonable, but need to simplify the code to avoid special casing things
pandas/core/frame.py
Outdated
if not (is_scalar(x) or isinstance(x, tuple)): | ||
if not isinstance(x, (ABCSeries, ABCIndexClass, ABCMultiIndex, | ||
list, np.ndarray)): | ||
raise TypeError('keys may only contain a combination of ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not true. either all have to be scalars, or all list-likes (yes they don't have to be the same, just list-convertible).
Please make this more readable from a user perspective. If I got this I would have no idea what to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See longer comment in the main thread.
Tuples are allowed because they might be valid column labels.
with tm.assert_raises_regex(KeyError, '.*'): | ||
df.set_index(keys, drop=drop, append=append) | ||
# can't drop same column twice | ||
first_drop = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this tests is very confusing. pls make it simpler.
This is wrong on both counts. On master, a mix of column labels and a list:
Regarding the argument types, the list I wrote (and test) is exhaustive (see https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L3908) - currently only
|
Codecov Report
@@ Coverage Diff @@
## master #22486 +/- ##
==========================================
+ Coverage 92.19% 92.19% +<.01%
==========================================
Files 169 169
Lines 50956 50966 +10
==========================================
+ Hits 46978 46988 +10
Misses 3978 3978
Continue to review full report at Codecov.
|
@jreback |
@jreback |
@jreback |
pandas/core/frame.py
Outdated
missing = [] | ||
for x in keys: | ||
if not (is_scalar(x) or isinstance(x, tuple)): | ||
if not isinstance(x, (ABCSeries, ABCIndexClass, ABCMultiIndex, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use is_list_like here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. How to deal with tuples then. Can be valid column keys and valid list-likes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would also change the capabilities of set_index
. e.g. an iterator has is_list_like = True
- do you want to support that? Currently, only the types that I'm instance-checking (plus column keys) are allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to explicity list types here as its verbose and fragile, use is_list_like and not is_iterator if you must (though to be honest an iterator is actually ok)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to explicity list types here as its verbose and fragile
I have a feeling the implementation you request will be much more verbose, but OK
OK. How to deal with tuples then. Can be valid column keys and valid list-likes.
You want to try tuples as column keys first and then as an array (would make the most sense to me), or something else?
df.set_index([df['A'], df['B'], 'X'], drop=drop, append=append) | ||
|
||
rgx = 'The parameter "keys" may only contain a combination of.*' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rgx -> msg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback Changed. Please let me know what you want me to do about the allowed argument signature. Allowing all list-likes also means checking ndim==1
for np.ndarrays
, etc.
See also:
#22486 (comment)
#22486 (comment)
pandas/core/frame.py
Outdated
missing = [] | ||
for x in keys: | ||
if not (is_scalar(x) or isinstance(x, tuple)): | ||
if not isinstance(x, (ABCSeries, ABCIndexClass, ABCMultiIndex, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to explicity list types here as its verbose and fragile, use is_list_like and not is_iterator if you must (though to be honest an iterator is actually ok)
pandas/core/frame.py
Outdated
# tuples that are not column keys are considered list-like, | ||
# not considered missing | ||
missing.append(col) | ||
elif (not is_list_like(col) or isinstance(col, set) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am not averse to excluding, but i also don't want ad-infinitem special cases here. This code is already too complex. We accept an iterable, so this is a valid input, if the user wants to do it that's there issue; these are not currently excluded.
@@ -126,21 +131,29 @@ def test_set_index_pass_single_array(self, frame_of_index_cols, | |||
df.index.name = index_name | |||
|
|||
key = box(df['B']) | |||
# np.array and list "forget" the name of B | |||
name = [None if box in [np.array, list] else 'B'] | |||
if box == list: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why exactly is a list interpreted differently here? this makes this test really really odd. I am worried something changed here, as this appears to work just fine in the master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will see above (line 123) that list
wasn't tested - for precisely this reason. df.set_index(['A', 'B'])
interprets a A
& B
as column keys, so (assuming this df
was length 2) it would not use the list ['A', 'B']
as the index. To do that, one would have to pass [['A', 'B']]
. This PR proposes to add tests for the current behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback
In case the above was not very clearly worded, this corresponds exactly to behaviour on master
:
>>> df = pd.DataFrame(np.random.randn(2,3))
>>> df.set_index(['A', 'B'])
KeyError: 'A'
>>> df.set_index([['A', 'B']])
0 1 2
A 1.962370 -1.150770 0.843600
B -0.417274 0.509781 -0.697802
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for review; added responses
pandas/core/frame.py
Outdated
# tuples that are not column keys are considered list-like, | ||
# not considered missing | ||
missing.append(col) | ||
elif (not is_list_like(col) or isinstance(col, set) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The larger issue IMO is that is_list_like(<set>)
should be False and not True. It's the only exclusion here, and it makes sense. On many issues, pandas prides itself on enforcing sensible defaults and behaviour. Why would we want to enable something that's broken by design?
I may also add that this is not something that's currently allowed, and it shouldn't be IMO
@@ -126,21 +131,29 @@ def test_set_index_pass_single_array(self, frame_of_index_cols, | |||
df.index.name = index_name | |||
|
|||
key = box(df['B']) | |||
# np.array and list "forget" the name of B | |||
name = [None if box in [np.array, list] else 'B'] | |||
if box == list: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will see above (line 123) that list
wasn't tested - for precisely this reason. df.set_index(['A', 'B'])
interprets a A
& B
as column keys, so (assuming this df
was length 2) it would not use the list ['A', 'B']
as the index. To do that, one would have to pass [['A', 'B']]
. This PR proposes to add tests for the current behaviour.
@jreback |
@h-vetinari why don't you try (separate PR) excluding set from is_list_like and see what the implications of that are. |
|
thanks! |
This reverts commit 145c227.
git diff upstream/master -u -- "*.py" | flake8 --diff
split off of #22236, and so builds on top of it.