Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check API changes in whatsnew #34801

Closed
TomAugspurger opened this issue Jun 15, 2020 · 11 comments
Closed

Check API changes in whatsnew #34801

TomAugspurger opened this issue Jun 15, 2020 · 11 comments
Labels
Blocker Blocking issue or pull request for an upcoming release Docs
Milestone

Comments

@TomAugspurger
Copy link
Contributor

Our whatsnew has several sections detailing API changes:

Many of these, like adding DataFrame.value_counts are fine and should just be moved to a different section.

Others like #31905 will need to be looked at closely, and we'll need to determine if they're API breaking or bug fixes.

@TomAugspurger TomAugspurger added Docs Blocker Blocking issue or pull request for an upcoming release labels Jun 15, 2020
@TomAugspurger TomAugspurger added this to the 1.1 milestone Jun 15, 2020
@TomAugspurger
Copy link
Contributor Author

I think a section like "Bug Fixes with potential API implications" might make sense. e.g. https://pandas.pydata.org/pandas-docs/dev/whatsnew/v1.1.0.html#consistency-across-groupby-reductions might fit there.

@jorisvandenbossche
Copy link
Member

There are indeed a lot of items in there that are not actually API changes. For the ones in the bullet points, made an overview here with a first judgment on what I think they are.

In addition to a potential section "Bug Fixes with potential API implications", it might also make sense to have a section with "changed errors" for all cases where we made raised errors more consistent (there are quite a few in the list below). As that is only relevant for you if you actually catch errors (so most users could fully skip such a section).

  • - Series.describe() will now show distribution percentiles for datetime dtypes, statistics first and last will now be min and max to match with numeric dtypes in DataFrame.describe() (GH30164)
    • Actual API change, currently being discussed
  • - Added DataFrame.value_counts() (GH5377)
    • Enhancement, move there
  • - Groupby.groups() now returns an abbreviated representation when called on large dataframes (GH1135)
    • This is only about the repr, not an actual API change
  • - loc lookups with an object-dtype Index and an integer key will now raise KeyError instead of TypeError when key is missing (GH31905)
    • Changed error
  • - Using a pandas.api.indexers.BaseIndexer() with count, min, max, median, skew, cov, corr will now return correct results for any monotonic pandas.api.indexers.BaseIndexer() descendant (GH32865)
    • Bug fix??
  • - Added a pandas.api.indexers.FixedForwardWindowIndexer() class to support forward-looking windows during rolling operations.
    • Enhancement, move there
  • - Added pandas.errors.InvalidIndexError (GH34570).
    • Enhancement, move there
  • - DataFrame.swaplevels() now raises a TypeError if the axis is not a MultiIndex. Previously an AttributeError was raised (GH31126)
    • Changed error
  • - DataFrame.xs() now raises a TypeError if a level keyword is supplied and the axis is not a MultiIndex. Previously an AttributeError was raised (GH33610)
    • Changed error
  • - DataFrameGroupby.mean() and SeriesGroupby.mean() (and similarly for median(), std() and var()) now raise a TypeError if a not-accepted keyword argument is passed into it. Previously a UnsupportedFunctionCall was raised (AssertionError if min_count passed into median()) (GH31485)
    • Changed error
  • - DataFrame.at() and Series.at() will raise a TypeError instead of a ValueError if an incompatible key is passed, and KeyError if a missing key is passed, matching the behavior of .loc[] (GH31722)
    • Changed error
  • - Passing an integer dtype other than int64 to np.array(period_index, dtype=...) will now raise TypeError instead of incorrectly using int64 (GH32255)
    • Actual API change, OK
  • - Passing an invalid fill_value to Categorical.take() raises a ValueError instead of TypeError (GH33660)
    • Changed error
  • - Combining a Categorical with integer categories and which contains missing values with a float dtype column in operations such as concat() or append() will now result in a float column instead of an object dtyped column (GH33607)
    • Actual API change, OK
  • - Series.to_timestamp() now raises a TypeError if the axis is not a PeriodIndex. Previously an AttributeError was raised (GH33327)
    • Changed error
  • - Series.to_period() now raises a TypeError if the axis is not a DatetimeIndex. Previously an AttributeError was raised (GH33327)
    • Changed error
  • - func pandas.api.dtypes.is_string_dtype no longer incorrectly identifies categorical series as string.
    • Actual API change, OK
  • - read_excel() no longer takes **kwds arguments. This means that passing in keyword chunksize now raises a TypeError (previously raised a NotImplementedError), while passing in keyword encoding now raises a TypeError (GH34464)
    • Changed error for chunksize, API change for encoding
  • - func merge now checks suffixes parameter type to be tuple and raises TypeError, whereas before a list or set were accepted and that the set could produce unexpected results (GH33740)
    • Actual API change, being discussed to revert/deprecate
  • - Period no longer accepts tuples for the freq argument (GH34658)
    • Actual API change, OK
  • - Series.interpolate() and DataFrame.interpolate() now raises ValueError if limit_direction is ‘forward’ or ‘both’ and method is ‘backfill’ or ‘bfill’ or limit_direction is ‘backward’ or ‘both’ and method is ‘pad’ or ‘ffill’ (GH34746)
    • Bug fix ?

This only includes the "small ones", not the API changes that have its own subsection.

@jorisvandenbossche
Copy link
Member

About this one:

read_excel() no longer takes **kwds arguments. This means that passing in keyword chunksize now raises a TypeError (previously raised a NotImplementedError), while passing in keyword encoding now raises a TypeError (GH34464)

People are certainly passing encoding (incorrectly or not) to read_excel (see eg https://stackoverflow.com/q/60157255/653364, https://stackoverflow.com/a/60753334/653364). So we might want to raise a warning that encoding is ignored and will raise in the future instead.

@TomAugspurger
Copy link
Contributor Author

it might also make sense to have a section with "changed errors" for all cases where we made raised errors more consistent

Do we have a stance on whether changing the exception raised counts as an API breaking change?

@jreback
Copy link
Contributor

jreback commented Jun 16, 2020

I think changed errors is enough, these are api breaking, but not worth deprecating

@jorisvandenbossche
Copy link
Member

An overview of the other items that each have their own subsections:

@alonme
Copy link
Contributor

alonme commented Jul 9, 2020

apply and applymap on DataFrame evaluates first row/column only once

Not sure how to exactly call this,
This behavior was once noted in the documentation
and then it was removed, but not actually fixed.

@TomAugspurger
Copy link
Contributor Author

apply and applymap on DataFrame evaluates first row/column only once

I think the docs made it clear that that was an implementation detail that could change. Let's call it a bugfix.

@jorisvandenbossche
Copy link
Member

To summarize my thoughts here from what I recall from the dev chat about this (but probably more my opinion though).

  • As proposed above, add a section "Bug Fixes with potential API implications" -> here we can put bug fixes that seem important enough to give a before / after example in their own subsection (thus part of the ones that now are in the API changes section), instead of just a bullet point like the other bug fixes.
    This are typically things were the previous behaviour was working but wrong (like a wrong number in the output, and not something that was erroring before), and thus more likely impact users that unknowingly used the wrong output.

  • Add a section to group all the changed error types: this typically doesn't impact a end-user, but rather a library developer that might be catching specific errors. For those library developers, it are technically speaking api changes, but since they all fix inconsistencies in error messages, it can also be seen as bug fixes (and as said before doesn't necessarily impact end users).

  • Move some of the enhancements to the enhancement section.


@Dr-Irv you also mentioned something about more clearly putting it in the policy that certain kinds of api changes/bug fixes can still happen. There is already some text about this: https://pandas.pydata.org/docs/dev/development/policies.html (the note), but if you have concrete feedback, certainly welcome.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jul 15, 2020

@jorisvandenbossche That policy only says "API breaking change". What I was talking about is that there are different kinds of "API breaking changes". Here are different kinds, based on what I could think of at the moment. The list is probably longer:

  1. Parameters to a method are different in name, order, or type
  2. Return value of a method is different in type
  3. Method/Class is removed or renamed due to deprecation cycle being reached
  4. Method modifies an object in a different way, so the behavior is different, creating a type of side effect
    • Could again be due to a bug fix
  5. Return value of a method is different in value
    • Could be because we fixed a bug or changed how a computation is done, so a slightly different result might appear (e.g., due to order of numerical operations)
  6. Documentation was wrong, so the "API breaking change" from a user perspective is that we said things would work one way, but found out that the behavior was correct, but the documentation was wrong, but user code assumed the (incorrectly) documented behavior.

I think the first 3 are ones we want to reserve for major releases. The second two are a bit of grey area. I don't think the last one is an API breaking change.

So if you consider that list (and maybe others), then how you document those changes for each minor (or technical) release for each of those categories needs to be considered.

Hope this helps.

@TomAugspurger
Copy link
Contributor Author

The at on that page staring with "pandas will sometimes make behavior changing" attempts to describe that nuance. Expanding that to have concrete examples would probably be welcome.

We now have a "Notable bug fixes", so for my purposes this issue can be closed. I think additional improvements to the whatsnew would be welcome though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocker Blocking issue or pull request for an upcoming release Docs
Projects
None yet
Development

No branches or pull requests

5 participants