Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert DataFrame.rename to keyword only; simplify axis validation #29140

Merged
merged 18 commits into from
Jan 9, 2020

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Oct 21, 2019

I think we can greatly simplify core if we update our assumptions around the axes of dimensions we deal with. I came to this change trying to get rid of NDFrame._construct_axes_from_arguments which appears to do some argument mutation that isn't very clear. This was one of the few methods that called that, while also calling validate_axis_style_args which is guilty of the same kind of mutation

By being more explicit about the signature and converting this to keyword only arguments (which a FutureWarning detailed anyway) we can simplify the code and I think make more readable

cc @TomAugspurger

@WillAyd WillAyd added the Clean label Oct 21, 2019
@pep8speaks
Copy link

pep8speaks commented Oct 21, 2019

Hello @WillAyd! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-01-09 17:06:47 UTC

@@ -4031,7 +4033,25 @@ def drop(
"mapper",
[("copy", True), ("inplace", False), ("level", None), ("errors", "ignore")],
)
def rename(self, *args, **kwargs):
def rename(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change here was to be explicit about what is accepted

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to being explicit. The only reason we had *args, **kwargs in the first place was to support the ambiguous case.

raise TypeError("Cannot specify both 'mapper' and any of 'index' or 'columns'")
else:
# use the mapper argument
if axis in {1, "columns"}:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We accept 3 arguments, mapper, index and columns. mapper can conflict with index or columns if either is provided, which the checks before this take care of. At this point if there aren't any conflicts, I think easiest to just combine the object

Note that this is arguably a little too strict, as this is no longer valid:

df.rename({"key": "val"}, axis=1, index={"key2": "val2"})

But I don't think there is a lot to gain in supporting that given the variety of other options available here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is arguably a little too strict, as this is no longer valid:

I don't recall if that case was originally discussed, but I think it's OK to break it.

for axis in range(self._AXIS_LEN):
v = axes.get(self._AXIS_NAMES[axis])
if v is None:
for axis_no, replacements in enumerate((index, columns)):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultimately I am trying to get rid of all of the internal _AXIS_* functions and mappings in core.generic, which seem like relics of the panel days

If we buy into just having index and columns being the potential axes in that order things can be greatly simplified

@@ -4079,7 +4080,18 @@ def align(
broadcast_axis=broadcast_axis,
)

def rename(self, index=None, **kwargs):
def rename(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Series method doesn't completely align with the DataFrame method, but I think this is an easy way to go about it while maintain backwards compat

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the value in accepting axis here, if we don't also have mapper and columns?

I don't see an easy way to accept mapper. That's essentially fulfilled by index.

So maybe my question is: should we also accept columns=? (Not arguing for this btw. I don't know what's best here).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we validate axis at all, or just ignore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here axis gets ignored. Looks like this was implemented as part of #18589

So maybe my question is: should we also accept columns=? (Not arguing for this btw. I don't know what's best here).

Yea I'm not sure myself. I was thinking ideally it would have been great to align the signatures better but for where we are that would cause some API churn. So I guess best to leave unless there's a ringing need to do something here (?)

@@ -1391,14 +1417,6 @@ def test_set_index_preserve_categorical_dtype(self):
result = result.reindex(columns=df.columns)
tm.assert_frame_equal(result, df)

def test_ambiguous_warns(self):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These use cases are now covered by the test_rename_positional_raises test case above, so deleted as duplicative

@@ -83,8 +83,9 @@ def test_rename_axis_supported(self):
s = Series(range(5))
s.rename({}, axis=0)
s.rename({}, axis="index")
with pytest.raises(ValueError, match="No axis named 5"):
s.rename({}, axis=5)
# TODO: clean up shared index validation
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented this out for now as I didn't think it was hugely critical and seemed somewhat orthoganal to the test, but can add back in if there is a huge objection.

Ideally though we would have a centralized place for index validation that doesn't mutate the arguments of the caller.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine to move this elsewhere. I do have a slight preference for not completely ignoring arguments that are there just for compatibility, but I don't know what our general rule is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea let me take another look I think should be able to keep this. Both self._get_axis_number and self._get_axis_name have some kind of validation in generic.

I don't know if there is one consistent way we do this right now, but want to march towards that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the preference should be that the axis parameter should be validated if specified on a Series for consistency with most Series methods that accept axis (e.g. fillna) and all DataFrame methods that accept axis.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both self._get_axis_number and self._get_axis_name have some kind of validation in generic.

normally you can just pass axis to super and it'll get validated? in this case there is also the call to self._set_name (cause of #28920). so probably will need the call to self._get_axis_number.

@@ -4140,12 +4160,8 @@ def rename(self, *args, **kwargs):
2 2 5
4 3 6
"""
axes = validate_axis_style_args(self, args, kwargs, "mapper", "rename")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note - there is only one other use now of this validate_axis_style_args function (see DataFrame.reindex). The intent here is good, but I think the fact that it mutates the args and kwargs being passed it makes it very difficult to reason about. As a follow up to this would like to replace entirely with something simple that just operates on the axis argument

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything blocking a similar PR for .reindex?

As a follow up to this would like to replace entirely with something simple that just operates on the axis argument

Is this referring to DataFrame.reindex, or validate_axis_style_args? Ideally we can completely remove validate_axis_style_args.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look before because I was trying to keep this focused, but it might not be too much extra effort. I'll take a look and if it gets out of hand let you know

Union[Mapping[Hashable, Hashable], Callable[[Hashable], Hashable]]
] = None,
columns: Optional[
Union[Mapping[Hashable, Hashable], Callable[[Hashable], Hashable]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it worth making a name for this in _typing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was thinking the same thing. Any suggestions? LabelReplacement?

@@ -4031,7 +4033,25 @@ def drop(
"mapper",
[("copy", True), ("inplace", False), ("level", None), ("errors", "ignore")],
)
def rename(self, *args, **kwargs):
def rename(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to being explicit. The only reason we had *args, **kwargs in the first place was to support the ambiguous case.

@@ -4140,12 +4160,8 @@ def rename(self, *args, **kwargs):
2 2 5
4 3 6
"""
axes = validate_axis_style_args(self, args, kwargs, "mapper", "rename")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything blocking a similar PR for .reindex?

As a follow up to this would like to replace entirely with something simple that just operates on the axis argument

Is this referring to DataFrame.reindex, or validate_axis_style_args? Ideally we can completely remove validate_axis_style_args.

raise TypeError("Cannot specify both 'mapper' and any of 'index' or 'columns'")
else:
# use the mapper argument
if axis in {1, "columns"}:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is arguably a little too strict, as this is no longer valid:

I don't recall if that case was originally discussed, but I think it's OK to break it.

@@ -4079,7 +4080,18 @@ def align(
broadcast_axis=broadcast_axis,
)

def rename(self, index=None, **kwargs):
def rename(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the value in accepting axis here, if we don't also have mapper and columns?

I don't see an easy way to accept mapper. That's essentially fulfilled by index.

So maybe my question is: should we also accept columns=? (Not arguing for this btw. I don't know what's best here).

@@ -4079,7 +4080,18 @@ def align(
broadcast_axis=broadcast_axis,
)

def rename(self, index=None, **kwargs):
def rename(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we validate axis at all, or just ignore?

@@ -1294,7 +1294,7 @@ def test_rename_mapper_multi(self):
def test_rename_positional_named(self):
# https://github.com/pandas-dev/pandas/issues/12392
df = DataFrame({"a": [1, 2], "b": [1, 2]}, index=["X", "Y"])
result = df.rename(str.lower, columns=str.upper)
result = df.rename(index=str.lower, columns=str.upper)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So did this break? If so, is it solely because of the keyword only change to index and columns?

If so, I might prefer a proper deprecation warning for this change... But could be convinced otherwise, since I'd consider the old code bad style.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be related to https://github.com/pandas-dev/pandas/pull/29140/files#r337242915

This is a little tough to support without keeping the ambiguity, and it doesn't really work outside of this one niche application.

For instance, if you tried to apply str.lower to the columns and str.upper to the index you would get the following error:

>>> df.rename(str.lower, index=str.upper, axis=1)
TypeError: Cannot specify both 'axis' and any of 'index' or 'columns'.

Even explicitly stating the axis in this case would raise an error:

>>> df.rename(str.lower, columns=str.upper, axis=0)
TypeError: Cannot specify both 'axis' and any of 'index' or 'columns'.

So we could still support this but it would take some gymnastics and not sure it makes things clearer, though open to the thoughts of others.

Maybe can be clarified further in the whatsnew to mitigate confusion?

@@ -83,8 +83,9 @@ def test_rename_axis_supported(self):
s = Series(range(5))
s.rename({}, axis=0)
s.rename({}, axis="index")
with pytest.raises(ValueError, match="No axis named 5"):
s.rename({}, axis=5)
# TODO: clean up shared index validation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine to move this elsewhere. I do have a slight preference for not completely ignoring arguments that are there just for compatibility, but I don't know what our general rule is.

@WillAyd
Copy link
Member Author

WillAyd commented Nov 7, 2019

Closing to clear queue; will pick back up when I get a little more free time

@WillAyd WillAyd closed this Nov 7, 2019
@jbrockmendel
Copy link
Member

Re-open? This looks like a nice bugfix

@WillAyd WillAyd reopened this Dec 27, 2019
@WillAyd
Copy link
Member Author

WillAyd commented Dec 27, 2019

Yea sure I'll try to take another look

@@ -4032,7 +4058,7 @@ def add_prefix(self: FrameOrSeries, prefix: str) -> FrameOrSeries:
f = functools.partial("{prefix}{}".format, prefix=prefix)

mapper = {self._info_axis_name: f}
return self.rename(**mapper)
return self.rename(**mapper) # type: ignore
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't seem to get dict unpacking to place nicely with mypy. @simonjayhawkins

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback jreback added this to the 1.0 milestone Jan 9, 2020
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glancing through, https://github.com/pandas-dev/pandas/pull/29140/files#r337263499 is I think my main outstanding concern.

df.reanme(str.lower, columns=str.upper)

will I think be broken, which is unfortunate. I think I'm OK with it if others are though, since we'd recommend .rename(index=str.lower, columns=str.upper) anyway.

@jreback
Copy link
Contributor

jreback commented Jan 9, 2020

@WillAyd conflict and some comments

@WillAyd
Copy link
Member Author

WillAyd commented Jan 9, 2020

@TomAugspurger yea that is the one case where this would break things, but the flip side is that also invites ambiguous uses like df.rename(str.lower, index=str.upper) (on master silently discards index) so not sure there's a perfect solution

@TomAugspurger
Copy link
Contributor

Mmm I'd forgotten that case. So +1 to whatever the group thinks here :)

@TomAugspurger TomAugspurger mentioned this pull request Jan 9, 2020
@TomAugspurger
Copy link
Contributor

Planning to merge this in an hour, just before tagging the RC.

@TomAugspurger TomAugspurger merged commit df75ea6 into pandas-dev:master Jan 9, 2020
@TomAugspurger
Copy link
Contributor

Thanks @WillAyd!

@WillAyd WillAyd deleted the fix-rename branch February 13, 2020 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFrame.rename only validates column arguments
6 participants