BUG: Handle IntegerArray in pd.cut #31290

TomAugspurger · 2020-01-24T18:18:11Z

xref #30944.
I think this doesn't close it, since only the pd.cut compoment
is fixed.

cc @jorisvandenbossche @jreback. The changes here attempt to be extremely conservative, since we're backporting stuff. I'm trying to not change behavior for anything other than IntegerArray. In particular, I'm not trying to support arbitrary EAs in pd.cut. This leads to some code that's fairly ugly & specific to IntegerArray. I think we should attempt to clean that up in 1.1.

xref pandas-dev#30944. I think this doesn't close it, since only the pd.cut compoment is fixed.

TomAugspurger · 2020-01-24T18:21:29Z

pandas/core/reshape/tile.py

+    )
+
+    if is_nullable_integer:
+        # TODO: Support other extension types somehow. We don't currently


This is a bit interesting. We need to get integers for searchsorted and it doesn't really matter how the NA values are encoded since we mask them out later on.

I don't think we have anything in the interface like this right now. The closest is factorize. But that has specific restrictions on

The array being an enumeration from 0, 1, ... number of uniques

NA being -1.

Which is more work than we need here. Worth thinking about for the future.

gfyoung · 2020-01-25T04:44:26Z

pandas/core/reshape/tile.py

@@ -209,16 +211,28 @@ def cut(
        if is_scalar(bins) and bins < 1:
            raise ValueError("`bins` should be a positive integer.")

-        try:  # for array-like
-            sz = x.size
+        # TODO: Support arbitrary Extension Arrays. We need


Suggested change

# TODO: Support arbitrary Extension Arrays. We need

# TODO: Support arbitrary Extension Arrays.

jreback

i know you are going for compatibility, but can this be simplified?

jreback · 2020-01-25T15:54:08Z

pandas/core/reshape/tile.py

+        # TODO: Support arbitrary Extension Arrays. We need
+        # For now, we're only attempting to support IntegerArray.
+        # See the note on _bins_to_cuts about what is needed.
+        is_nullable_integer = is_extension_array_dtype(x.dtype) and is_integer_dtype(


shouldn't is_integer_dtype suffice here?

jreback · 2020-01-25T15:54:18Z

pandas/core/reshape/tile.py

+            x.dtype
+        )
+        try:
+            if is_extension_array_dtype(x) and is_integer_dtype(x):


can we just do len(x)?

jreback · 2020-01-25T15:54:57Z

pandas/core/reshape/tile.py

        except AttributeError:
            x = np.asarray(x)
            sz = x.size

        if sz == 0:
            raise ValueError("Cannot cut empty array")

-        rng = (nanops.nanmin(x), nanops.nanmax(x))
+        if is_nullable_integer:


does just (x.min(), x.max()) work here?

IntegerArray doesn't have a min / max yet.

jreback · 2020-01-25T15:55:10Z

pandas/core/reshape/tile.py

@@ -383,10 +397,26 @@ def _bins_to_cuts(
            bins = unique_bins

    side = "left" if right else "right"
-    ids = ensure_int64(bins.searchsorted(x, side=side))
+    is_nullable_integer = is_extension_array_dtype(x.dtype) and is_integer_dtype(


same comment as above

TomAugspurger · 2020-01-27T13:03:58Z

i know you are going for compatibility, but can this be simplified?

The answer to the rest of the inline comments is the same as one: Not especially, if we're trying to make this change for just IntegerArray and not other objects. Swapping in len(x) for x.size definitely works for both EAs and ndarrays. But we may have been unknowingly relying on the .size raising an AttributeError to do a conversion to an ndarray, and so changing this (might) change behavior.

jreback · 2020-01-27T14:23:10Z

i know you are going for compatibility, but can this be simplified?

The answer to the rest of the inline comments is the same as one: Not especially, if we're trying to make this change for just IntegerArray and not other objects. Swapping in len(x) for x.size definitely works for both EAs and ndarrays. But we may have been unknowingly relying on the .size raising an AttributeError to do a conversion to an ndarray, and so changing this (might) change behavior.

on the size issue

But we may have been unknowingly relying on the .size raising an AttributeError to do a conversion to an ndarray, and so changing this (might) change behavior.

I would simply change this, if it breaks better to do it now

TomAugspurger · 2020-01-27T14:31:29Z

@jreback here's a POC for my proposed followup: https://github.com/pandas-dev/pandas/compare/master...TomAugspurger:cut-2?expand=1, which is much cleaner. The remaining TODO is to draft an API / expectations for what ExtensionArray._ndarray_values is for 3rd party EAs. For now, I fall back to _factorize() which has the right semantics but does more work than necessary.

But I'm uncomfortable including that large of changes in 1.0.0.

jorisvandenbossche · 2020-01-27T15:33:39Z

Question: what was the reason that cut worked before for IntegerArray? Because it converted it to an object array with NaNs / or to a float array? Just asking, but if that was the way it worked, the short term hack can be to convert IntegerArray again to that format. And then indeed do a proper fix later.

TomAugspurger · 2020-01-27T16:21:00Z

Because it converted it to an object array with NaNs / or to a float array?

It converted to object dtype with NaN. So, I suppose the most minimal fix is to just restore that. I'll modify this PR (and we still have https://github.com/pandas-dev/pandas/compare/master...TomAugspurger:cut-2?expand=1 to do things properly).

jreback · 2020-01-28T01:52:33Z

great thanks @TomAugspurger . do we have an issue for updating this to a more general soln?

Co-authored-by: Tom Augspurger <[email protected]>

TomAugspurger · 2020-01-28T12:40:59Z

#31389

…

On Mon, Jan 27, 2020 at 7:52 PM Jeff Reback ***@***.***> wrote: great thanks @TomAugspurger <https://github.com/TomAugspurger> . do we have an issue for updating this to a more general soln? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#31290?email_source=notifications&email_token=AAKAOIVKM6QZ7FJ6FQEEKKTQ76FWFA5CNFSM4KLKECIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKBXVSI#issuecomment-579041993>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIWRHBHJGXON7GK563DQ76FWFANCNFSM4KLKECIA> .

BUG: Handle IntegerArray in pd.cut

e6ec3b2

xref pandas-dev#30944. I think this doesn't close it, since only the pd.cut compoment is fixed.

TomAugspurger commented Jan 24, 2020

View reviewed changes

TomAugspurger added this to the 1.0.0 milestone Jan 24, 2020

gfyoung added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Jan 25, 2020

gfyoung reviewed Jan 25, 2020

View reviewed changes

jreback requested changes Jan 25, 2020

View reviewed changes

TomAugspurger added 3 commits January 27, 2020 10:21

Merge remote-tracking branch 'upstream/master' into cut-intna

c9d0a6a

revert

cc1a810

restore object, NaN behavior

458b19f

jorisvandenbossche approved these changes Jan 27, 2020

View reviewed changes

jreback approved these changes Jan 28, 2020

View reviewed changes

jreback merged commit a5daff2 into pandas-dev:master Jan 28, 2020

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Jan 28, 2020

Backport PR pandas-dev#31290: BUG: Handle IntegerArray in pd.cut

6998ea4

meeseeksmachine mentioned this pull request Jan 28, 2020

Backport PR #31290 on branch 1.0.x (BUG: Handle IntegerArray in pd.cut) #31374

Merged

jreback pushed a commit that referenced this pull request Jan 28, 2020

Backport PR #31290: BUG: Handle IntegerArray in pd.cut (#31374)

d80045e

Co-authored-by: Tom Augspurger <[email protected]>

TomAugspurger deleted the cut-intna branch January 28, 2020 12:32

TomAugspurger mentioned this pull request Jan 28, 2020

Handle ExtensionArrays in cut #31389

Open

selasley mentioned this pull request Feb 4, 2020

BUG: pandas.cut does not give the right answer with nullable integer Series input #31643

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Handle IntegerArray in pd.cut #31290

BUG: Handle IntegerArray in pd.cut #31290

TomAugspurger commented Jan 24, 2020

TomAugspurger Jan 24, 2020

gfyoung Jan 25, 2020

jreback left a comment

jreback Jan 25, 2020

jreback Jan 25, 2020

jreback Jan 25, 2020

TomAugspurger Jan 27, 2020

jreback Jan 25, 2020

TomAugspurger commented Jan 27, 2020

jreback commented Jan 27, 2020

TomAugspurger commented Jan 27, 2020

jorisvandenbossche commented Jan 27, 2020

TomAugspurger commented Jan 27, 2020

jreback commented Jan 28, 2020

TomAugspurger commented Jan 28, 2020 via email

	# TODO: Support arbitrary Extension Arrays. We need
	# TODO: Support arbitrary Extension Arrays.

BUG: Handle IntegerArray in pd.cut #31290

BUG: Handle IntegerArray in pd.cut #31290

Conversation

TomAugspurger commented Jan 24, 2020

TomAugspurger Jan 24, 2020

Choose a reason for hiding this comment

gfyoung Jan 25, 2020

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

jreback Jan 25, 2020

Choose a reason for hiding this comment

jreback Jan 25, 2020

Choose a reason for hiding this comment

jreback Jan 25, 2020

Choose a reason for hiding this comment

TomAugspurger Jan 27, 2020

Choose a reason for hiding this comment

jreback Jan 25, 2020

Choose a reason for hiding this comment

TomAugspurger commented Jan 27, 2020

jreback commented Jan 27, 2020

TomAugspurger commented Jan 27, 2020

jorisvandenbossche commented Jan 27, 2020

TomAugspurger commented Jan 27, 2020

jreback commented Jan 28, 2020

TomAugspurger commented Jan 28, 2020 via email