-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert DataFrame.rename to keyword only; simplify axis validation #29140
Conversation
@@ -4031,7 +4033,25 @@ def drop( | |||
"mapper", | |||
[("copy", True), ("inplace", False), ("level", None), ("errors", "ignore")], | |||
) | |||
def rename(self, *args, **kwargs): | |||
def rename( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change here was to be explicit about what is accepted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to being explicit. The only reason we had *args, **kwargs in the first place was to support the ambiguous case.
pandas/core/generic.py
Outdated
raise TypeError("Cannot specify both 'mapper' and any of 'index' or 'columns'") | ||
else: | ||
# use the mapper argument | ||
if axis in {1, "columns"}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We accept 3 arguments, mapper
, index
and columns
. mapper
can conflict with index
or columns
if either is provided, which the checks before this take care of. At this point if there aren't any conflicts, I think easiest to just combine the object
Note that this is arguably a little too strict, as this is no longer valid:
df.rename({"key": "val"}, axis=1, index={"key2": "val2"})
But I don't think there is a lot to gain in supporting that given the variety of other options available here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this is arguably a little too strict, as this is no longer valid:
I don't recall if that case was originally discussed, but I think it's OK to break it.
for axis in range(self._AXIS_LEN): | ||
v = axes.get(self._AXIS_NAMES[axis]) | ||
if v is None: | ||
for axis_no, replacements in enumerate((index, columns)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ultimately I am trying to get rid of all of the internal _AXIS_*
functions and mappings in core.generic, which seem like relics of the panel days
If we buy into just having index
and columns
being the potential axes in that order things can be greatly simplified
@@ -4079,7 +4080,18 @@ def align( | |||
broadcast_axis=broadcast_axis, | |||
) | |||
|
|||
def rename(self, index=None, **kwargs): | |||
def rename( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Series method doesn't completely align with the DataFrame method, but I think this is an easy way to go about it while maintain backwards compat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the value in accepting axis
here, if we don't also have mapper
and columns?
I don't see an easy way to accept mapper
. That's essentially fulfilled by index
.
So maybe my question is: should we also accept columns=
? (Not arguing for this btw. I don't know what's best here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we validate axis at all, or just ignore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here axis gets ignored. Looks like this was implemented as part of #18589
So maybe my question is: should we also accept
columns=
? (Not arguing for this btw. I don't know what's best here).
Yea I'm not sure myself. I was thinking ideally it would have been great to align the signatures better but for where we are that would cause some API churn. So I guess best to leave unless there's a ringing need to do something here (?)
@@ -1391,14 +1417,6 @@ def test_set_index_preserve_categorical_dtype(self): | |||
result = result.reindex(columns=df.columns) | |||
tm.assert_frame_equal(result, df) | |||
|
|||
def test_ambiguous_warns(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These use cases are now covered by the test_rename_positional_raises
test case above, so deleted as duplicative
@@ -83,8 +83,9 @@ def test_rename_axis_supported(self): | |||
s = Series(range(5)) | |||
s.rename({}, axis=0) | |||
s.rename({}, axis="index") | |||
with pytest.raises(ValueError, match="No axis named 5"): | |||
s.rename({}, axis=5) | |||
# TODO: clean up shared index validation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented this out for now as I didn't think it was hugely critical and seemed somewhat orthoganal to the test, but can add back in if there is a huge objection.
Ideally though we would have a centralized place for index validation that doesn't mutate the arguments of the caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine to move this elsewhere. I do have a slight preference for not completely ignoring arguments that are there just for compatibility, but I don't know what our general rule is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea let me take another look I think should be able to keep this. Both self._get_axis_number
and self._get_axis_name
have some kind of validation in generic.
I don't know if there is one consistent way we do this right now, but want to march towards that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the preference should be that the axis parameter should be validated if specified on a Series for consistency with most Series methods that accept axis (e.g. fillna) and all DataFrame methods that accept axis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both
self._get_axis_number
andself._get_axis_name
have some kind of validation in generic.
normally you can just pass axis to super and it'll get validated? in this case there is also the call to self._set_name (cause of #28920). so probably will need the call to self._get_axis_number.
@@ -4140,12 +4160,8 @@ def rename(self, *args, **kwargs): | |||
2 2 5 | |||
4 3 6 | |||
""" | |||
axes = validate_axis_style_args(self, args, kwargs, "mapper", "rename") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note - there is only one other use now of this validate_axis_style_args
function (see DataFrame.reindex
). The intent here is good, but I think the fact that it mutates the args and kwargs being passed it makes it very difficult to reason about. As a follow up to this would like to replace entirely with something simple that just operates on the axis argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anything blocking a similar PR for .reindex
?
As a follow up to this would like to replace entirely with something simple that just operates on the axis argument
Is this referring to DataFrame.reindex, or validate_axis_style_args? Ideally we can completely remove validate_axis_style_args
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't look before because I was trying to keep this focused, but it might not be too much extra effort. I'll take a look and if it gets out of hand let you know
pandas/core/frame.py
Outdated
Union[Mapping[Hashable, Hashable], Callable[[Hashable], Hashable]] | ||
] = None, | ||
columns: Optional[ | ||
Union[Mapping[Hashable, Hashable], Callable[[Hashable], Hashable]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it worth making a name for this in _typing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was thinking the same thing. Any suggestions? LabelReplacement
?
@@ -4031,7 +4033,25 @@ def drop( | |||
"mapper", | |||
[("copy", True), ("inplace", False), ("level", None), ("errors", "ignore")], | |||
) | |||
def rename(self, *args, **kwargs): | |||
def rename( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to being explicit. The only reason we had *args, **kwargs in the first place was to support the ambiguous case.
@@ -4140,12 +4160,8 @@ def rename(self, *args, **kwargs): | |||
2 2 5 | |||
4 3 6 | |||
""" | |||
axes = validate_axis_style_args(self, args, kwargs, "mapper", "rename") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anything blocking a similar PR for .reindex
?
As a follow up to this would like to replace entirely with something simple that just operates on the axis argument
Is this referring to DataFrame.reindex, or validate_axis_style_args? Ideally we can completely remove validate_axis_style_args
.
pandas/core/generic.py
Outdated
raise TypeError("Cannot specify both 'mapper' and any of 'index' or 'columns'") | ||
else: | ||
# use the mapper argument | ||
if axis in {1, "columns"}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this is arguably a little too strict, as this is no longer valid:
I don't recall if that case was originally discussed, but I think it's OK to break it.
@@ -4079,7 +4080,18 @@ def align( | |||
broadcast_axis=broadcast_axis, | |||
) | |||
|
|||
def rename(self, index=None, **kwargs): | |||
def rename( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the value in accepting axis
here, if we don't also have mapper
and columns?
I don't see an easy way to accept mapper
. That's essentially fulfilled by index
.
So maybe my question is: should we also accept columns=
? (Not arguing for this btw. I don't know what's best here).
@@ -4079,7 +4080,18 @@ def align( | |||
broadcast_axis=broadcast_axis, | |||
) | |||
|
|||
def rename(self, index=None, **kwargs): | |||
def rename( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we validate axis at all, or just ignore?
@@ -1294,7 +1294,7 @@ def test_rename_mapper_multi(self): | |||
def test_rename_positional_named(self): | |||
# https://github.com/pandas-dev/pandas/issues/12392 | |||
df = DataFrame({"a": [1, 2], "b": [1, 2]}, index=["X", "Y"]) | |||
result = df.rename(str.lower, columns=str.upper) | |||
result = df.rename(index=str.lower, columns=str.upper) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So did this break? If so, is it solely because of the keyword only change to index
and columns
?
If so, I might prefer a proper deprecation warning for this change... But could be convinced otherwise, since I'd consider the old code bad style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be related to https://github.com/pandas-dev/pandas/pull/29140/files#r337242915
This is a little tough to support without keeping the ambiguity, and it doesn't really work outside of this one niche application.
For instance, if you tried to apply str.lower
to the columns and str.upper
to the index you would get the following error:
>>> df.rename(str.lower, index=str.upper, axis=1)
TypeError: Cannot specify both 'axis' and any of 'index' or 'columns'.
Even explicitly stating the axis in this case would raise an error:
>>> df.rename(str.lower, columns=str.upper, axis=0)
TypeError: Cannot specify both 'axis' and any of 'index' or 'columns'.
So we could still support this but it would take some gymnastics and not sure it makes things clearer, though open to the thoughts of others.
Maybe can be clarified further in the whatsnew to mitigate confusion?
@@ -83,8 +83,9 @@ def test_rename_axis_supported(self): | |||
s = Series(range(5)) | |||
s.rename({}, axis=0) | |||
s.rename({}, axis="index") | |||
with pytest.raises(ValueError, match="No axis named 5"): | |||
s.rename({}, axis=5) | |||
# TODO: clean up shared index validation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine to move this elsewhere. I do have a slight preference for not completely ignoring arguments that are there just for compatibility, but I don't know what our general rule is.
Closing to clear queue; will pick back up when I get a little more free time |
Re-open? This looks like a nice bugfix |
Yea sure I'll try to take another look |
@@ -4032,7 +4058,7 @@ def add_prefix(self: FrameOrSeries, prefix: str) -> FrameOrSeries: | |||
f = functools.partial("{prefix}{}".format, prefix=prefix) | |||
|
|||
mapper = {self._info_axis_name: f} | |||
return self.rename(**mapper) | |||
return self.rename(**mapper) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't seem to get dict unpacking to place nicely with mypy. @simonjayhawkins
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. @TomAugspurger
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glancing through, https://github.com/pandas-dev/pandas/pull/29140/files#r337263499 is I think my main outstanding concern.
df.reanme(str.lower, columns=str.upper)
will I think be broken, which is unfortunate. I think I'm OK with it if others are though, since we'd recommend .rename(index=str.lower, columns=str.upper)
anyway.
@WillAyd conflict and some comments |
@TomAugspurger yea that is the one case where this would break things, but the flip side is that also invites ambiguous uses like |
Mmm I'd forgotten that case. So +1 to whatever the group thinks here :) |
Planning to merge this in an hour, just before tagging the RC. |
Thanks @WillAyd! |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
I think we can greatly simplify core if we update our assumptions around the axes of dimensions we deal with. I came to this change trying to get rid of
NDFrame._construct_axes_from_arguments
which appears to do some argument mutation that isn't very clear. This was one of the few methods that called that, while also callingvalidate_axis_style_args
which is guilty of the same kind of mutationBy being more explicit about the signature and converting this to keyword only arguments (which a FutureWarning detailed anyway) we can simplify the code and I think make more readable
cc @TomAugspurger