Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Handle IntegerArray in pd.cut #31290

Merged
merged 4 commits into from
Jan 28, 2020
Merged

Conversation

TomAugspurger
Copy link
Contributor

xref #30944.
I think this doesn't close it, since only the pd.cut compoment
is fixed.

cc @jorisvandenbossche @jreback. The changes here attempt to be extremely conservative, since we're backporting stuff. I'm trying to not change behavior for anything other than IntegerArray. In particular, I'm not trying to support arbitrary EAs in pd.cut. This leads to some code that's fairly ugly & specific to IntegerArray. I think we should attempt to clean that up in 1.1.

xref pandas-dev#30944.
I think this doesn't close it, since only the pd.cut compoment
is fixed.
)

if is_nullable_integer:
# TODO: Support other extension types somehow. We don't currently
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit interesting. We need to get integers for searchsorted and it doesn't really matter how the NA values are encoded since we mask them out later on.

I don't think we have anything in the interface like this right now. The closest is factorize. But that has specific restrictions on

  1. The array being an enumeration from 0, 1, ... number of uniques
  2. NA being -1.

Which is more work than we need here. Worth thinking about for the future.

@TomAugspurger TomAugspurger added this to the 1.0.0 milestone Jan 24, 2020
@gfyoung gfyoung added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Jan 25, 2020
@@ -209,16 +211,28 @@ def cut(
if is_scalar(bins) and bins < 1:
raise ValueError("`bins` should be a positive integer.")

try: # for array-like
sz = x.size
# TODO: Support arbitrary Extension Arrays. We need
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# TODO: Support arbitrary Extension Arrays. We need
# TODO: Support arbitrary Extension Arrays.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i know you are going for compatibility, but can this be simplified?

# TODO: Support arbitrary Extension Arrays. We need
# For now, we're only attempting to support IntegerArray.
# See the note on _bins_to_cuts about what is needed.
is_nullable_integer = is_extension_array_dtype(x.dtype) and is_integer_dtype(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't is_integer_dtype suffice here?

x.dtype
)
try:
if is_extension_array_dtype(x) and is_integer_dtype(x):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just do len(x)?

except AttributeError:
x = np.asarray(x)
sz = x.size

if sz == 0:
raise ValueError("Cannot cut empty array")

rng = (nanops.nanmin(x), nanops.nanmax(x))
if is_nullable_integer:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does just (x.min(), x.max()) work here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IntegerArray doesn't have a min / max yet.

@@ -383,10 +397,26 @@ def _bins_to_cuts(
bins = unique_bins

side = "left" if right else "right"
ids = ensure_int64(bins.searchsorted(x, side=side))
is_nullable_integer = is_extension_array_dtype(x.dtype) and is_integer_dtype(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above

@TomAugspurger
Copy link
Contributor Author

i know you are going for compatibility, but can this be simplified?

The answer to the rest of the inline comments is the same as one: Not especially, if we're trying to make this change for just IntegerArray and not other objects. Swapping in len(x) for x.size definitely works for both EAs and ndarrays. But we may have been unknowingly relying on the .size raising an AttributeError to do a conversion to an ndarray, and so changing this (might) change behavior.

@jreback
Copy link
Contributor

jreback commented Jan 27, 2020

i know you are going for compatibility, but can this be simplified?

The answer to the rest of the inline comments is the same as one: Not especially, if we're trying to make this change for just IntegerArray and not other objects. Swapping in len(x) for x.size definitely works for both EAs and ndarrays. But we may have been unknowingly relying on the .size raising an AttributeError to do a conversion to an ndarray, and so changing this (might) change behavior.

on the size issue

But we may have been unknowingly relying on the .size raising an AttributeError to do a conversion to an ndarray, and so changing this (might) change behavior.

I would simply change this, if it breaks better to do it now

@TomAugspurger
Copy link
Contributor Author

@jreback here's a POC for my proposed followup: https://github.com/pandas-dev/pandas/compare/master...TomAugspurger:cut-2?expand=1, which is much cleaner. The remaining TODO is to draft an API / expectations for what ExtensionArray._ndarray_values is for 3rd party EAs. For now, I fall back to _factorize() which has the right semantics but does more work than necessary.

But I'm uncomfortable including that large of changes in 1.0.0.

@jorisvandenbossche
Copy link
Member

Question: what was the reason that cut worked before for IntegerArray? Because it converted it to an object array with NaNs / or to a float array? Just asking, but if that was the way it worked, the short term hack can be to convert IntegerArray again to that format. And then indeed do a proper fix later.

@TomAugspurger
Copy link
Contributor Author

Because it converted it to an object array with NaNs / or to a float array?

It converted to object dtype with NaN. So, I suppose the most minimal fix is to just restore that. I'll modify this PR (and we still have https://github.com/pandas-dev/pandas/compare/master...TomAugspurger:cut-2?expand=1 to do things properly).

@jreback jreback merged commit a5daff2 into pandas-dev:master Jan 28, 2020
@jreback
Copy link
Contributor

jreback commented Jan 28, 2020

great thanks @TomAugspurger . do we have an issue for updating this to a more general soln?

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jan 28, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants