-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Handle IntegerArray in pd.cut #31290
Conversation
xref pandas-dev#30944. I think this doesn't close it, since only the pd.cut compoment is fixed.
pandas/core/reshape/tile.py
Outdated
) | ||
|
||
if is_nullable_integer: | ||
# TODO: Support other extension types somehow. We don't currently |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit interesting. We need to get integers for searchsorted and it doesn't really matter how the NA values are encoded since we mask them out later on.
I don't think we have anything in the interface like this right now. The closest is factorize
. But that has specific restrictions on
- The array being an enumeration from 0, 1, ... number of uniques
- NA being -1.
Which is more work than we need here. Worth thinking about for the future.
pandas/core/reshape/tile.py
Outdated
@@ -209,16 +211,28 @@ def cut( | |||
if is_scalar(bins) and bins < 1: | |||
raise ValueError("`bins` should be a positive integer.") | |||
|
|||
try: # for array-like | |||
sz = x.size | |||
# TODO: Support arbitrary Extension Arrays. We need |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# TODO: Support arbitrary Extension Arrays. We need | |
# TODO: Support arbitrary Extension Arrays. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i know you are going for compatibility, but can this be simplified?
pandas/core/reshape/tile.py
Outdated
# TODO: Support arbitrary Extension Arrays. We need | ||
# For now, we're only attempting to support IntegerArray. | ||
# See the note on _bins_to_cuts about what is needed. | ||
is_nullable_integer = is_extension_array_dtype(x.dtype) and is_integer_dtype( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't is_integer_dtype suffice here?
pandas/core/reshape/tile.py
Outdated
x.dtype | ||
) | ||
try: | ||
if is_extension_array_dtype(x) and is_integer_dtype(x): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just do len(x)?
pandas/core/reshape/tile.py
Outdated
except AttributeError: | ||
x = np.asarray(x) | ||
sz = x.size | ||
|
||
if sz == 0: | ||
raise ValueError("Cannot cut empty array") | ||
|
||
rng = (nanops.nanmin(x), nanops.nanmax(x)) | ||
if is_nullable_integer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does just (x.min(), x.max())
work here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IntegerArray doesn't have a min / max yet.
pandas/core/reshape/tile.py
Outdated
@@ -383,10 +397,26 @@ def _bins_to_cuts( | |||
bins = unique_bins | |||
|
|||
side = "left" if right else "right" | |||
ids = ensure_int64(bins.searchsorted(x, side=side)) | |||
is_nullable_integer = is_extension_array_dtype(x.dtype) and is_integer_dtype( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above
The answer to the rest of the inline comments is the same as one: Not especially, if we're trying to make this change for just IntegerArray and not other objects. Swapping in |
on the size issue
I would simply change this, if it breaks better to do it now |
@jreback here's a POC for my proposed followup: https://github.com/pandas-dev/pandas/compare/master...TomAugspurger:cut-2?expand=1, which is much cleaner. The remaining TODO is to draft an API / expectations for what But I'm uncomfortable including that large of changes in 1.0.0. |
Question: what was the reason that |
It converted to object dtype with NaN. So, I suppose the most minimal fix is to just restore that. I'll modify this PR (and we still have https://github.com/pandas-dev/pandas/compare/master...TomAugspurger:cut-2?expand=1 to do things properly). |
great thanks @TomAugspurger . do we have an issue for updating this to a more general soln? |
Co-authored-by: Tom Augspurger <[email protected]>
… On Mon, Jan 27, 2020 at 7:52 PM Jeff Reback ***@***.***> wrote:
great thanks @TomAugspurger <https://github.com/TomAugspurger> . do we
have an issue for updating this to a more general soln?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#31290?email_source=notifications&email_token=AAKAOIVKM6QZ7FJ6FQEEKKTQ76FWFA5CNFSM4KLKECIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKBXVSI#issuecomment-579041993>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIWRHBHJGXON7GK563DQ76FWFANCNFSM4KLKECIA>
.
|
xref #30944.
I think this doesn't close it, since only the pd.cut compoment
is fixed.
cc @jorisvandenbossche @jreback. The changes here attempt to be extremely conservative, since we're backporting stuff. I'm trying to not change behavior for anything other than IntegerArray. In particular, I'm not trying to support arbitrary EAs in
pd.cut
. This leads to some code that's fairly ugly & specific to IntegerArray. I think we should attempt to clean that up in 1.1.