-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF: Groupby.quantile allow EA dispatch #51003
Conversation
@mroeschke any big-picture thoughts here and/or #51116? i think something like this is inevitable, but am not married to this particular design pattern |
My one concern is how much "groupby" implementation the EA needs to be aware of in the signature of a groupby specific EA method. In an ideal world it would be great if the groupby operation just dispatches to the EA's _reduce. If this is needed/useful in the near term, it would be great if the implementation was "private" since how much groupby implementation the array needs to be aware of could change. |
updated to privatize |
FWIW the groupby kernels (like A bit off-topic, but specifically for the pyarrow dtypes, I also think that we could actually use the support for masked arrays in our cython groupby algos. The only conversion step then is the bitmask to a bytemask, but which should be cheaper than converting to a numpy array (if there are nulls). |
+1. i think something like a |
Closing, as I've convinced myself the cython quantile function is unecessary xref #51385 |
cc @jorisvandenbossche @mroeschke
This is partially a proof of concept for something I think we need to do for most of the groupby reductions (more generally, just about anywhere we call Manager.grouped_reduce and friends)
Many groupby reductions(ish) have special handling to convert EAs to numpy, then pass them to our cython code, then convert back. This would be pretty inefficient for the new pyarrow dtypes given that IIUC they have native implementations of some of the relevant methods. (I'm also imagining a future in which dask/modin/etc implement distributed EAs or cudf has as GPUEA)
I think this will also make it easier to fix e.g. dt64tz and PeriodDtype cases which are currently broken.
This puts the whole thing in an EA method groupby_quantile and calls that in EA cases. Lots of other reasonable ways this can be refactored; I'm open to suggestions.