Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: Groupby.quantile allow EA dispatch #51003

Closed
wants to merge 14 commits into from

Conversation

jbrockmendel
Copy link
Member

cc @jorisvandenbossche @mroeschke

This is partially a proof of concept for something I think we need to do for most of the groupby reductions (more generally, just about anywhere we call Manager.grouped_reduce and friends)

Many groupby reductions(ish) have special handling to convert EAs to numpy, then pass them to our cython code, then convert back. This would be pretty inefficient for the new pyarrow dtypes given that IIUC they have native implementations of some of the relevant methods. (I'm also imagining a future in which dask/modin/etc implement distributed EAs or cudf has as GPUEA)

I think this will also make it easier to fix e.g. dt64tz and PeriodDtype cases which are currently broken.

This puts the whole thing in an EA method groupby_quantile and calls that in EA cases. Lots of other reasonable ways this can be refactored; I'm open to suggestions.

@jbrockmendel jbrockmendel added Groupby quantile quantile method labels Feb 1, 2023
@jbrockmendel
Copy link
Member Author

@mroeschke any big-picture thoughts here and/or #51116? i think something like this is inevitable, but am not married to this particular design pattern

@mroeschke
Copy link
Member

My one concern is how much "groupby" implementation the EA needs to be aware of in the signature of a groupby specific EA method. In an ideal world it would be great if the groupby operation just dispatches to the EA's _reduce.

If this is needed/useful in the near term, it would be great if the implementation was "private" since how much groupby implementation the array needs to be aware of could change.

@jbrockmendel
Copy link
Member Author

updated to privatize

@jorisvandenbossche
Copy link
Member

This would be pretty inefficient for the new pyarrow dtypes given that IIUC they have native implementations of some of the relevant methods

FWIW the groupby kernels (like hash_sum) are not actually exposed right now as is in pyarrow (they are only usable in context of the full query engine, not as standalone kernels)

A bit off-topic, but specifically for the pyarrow dtypes, I also think that we could actually use the support for masked arrays in our cython groupby algos. The only conversion step then is the bitmask to a bytemask, but which should be cheaper than converting to a numpy array (if there are nulls).

@jbrockmendel
Copy link
Member Author

specifically for the pyarrow dtypes, I also think that we could actually use the support for masked arrays in our cython groupby algos

+1. i think something like a _to_masked_array and _from_masked_array would be useful for a bunch of these methods medium-term.

@jbrockmendel
Copy link
Member Author

Closing, as I've convinced myself the cython quantile function is unecessary xref #51385

@jbrockmendel jbrockmendel deleted the ref-ea-gb-quartile branch February 15, 2023 23:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby quantile quantile method
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants