REF: Groupby.quantile allow EA dispatch #51003

jbrockmendel · 2023-01-26T23:32:24Z

This is partially a proof of concept for something I think we need to do for most of the groupby reductions (more generally, just about anywhere we call Manager.grouped_reduce and friends)

Many groupby reductions(ish) have special handling to convert EAs to numpy, then pass them to our cython code, then convert back. This would be pretty inefficient for the new pyarrow dtypes given that IIUC they have native implementations of some of the relevant methods. (I'm also imagining a future in which dask/modin/etc implement distributed EAs or cudf has as GPUEA)

I think this will also make it easier to fix e.g. dt64tz and PeriodDtype cases which are currently broken.

This puts the whole thing in an EA method groupby_quantile and calls that in EA cases. Lots of other reasonable ways this can be refactored; I'm open to suggestions.

jbrockmendel · 2023-02-03T23:36:40Z

@mroeschke any big-picture thoughts here and/or #51116? i think something like this is inevitable, but am not married to this particular design pattern

mroeschke · 2023-02-06T22:47:28Z

My one concern is how much "groupby" implementation the EA needs to be aware of in the signature of a groupby specific EA method. In an ideal world it would be great if the groupby operation just dispatches to the EA's _reduce.

If this is needed/useful in the near term, it would be great if the implementation was "private" since how much groupby implementation the array needs to be aware of could change.

jbrockmendel · 2023-02-07T21:38:44Z

updated to privatize

jorisvandenbossche · 2023-02-08T15:25:22Z

This would be pretty inefficient for the new pyarrow dtypes given that IIUC they have native implementations of some of the relevant methods

FWIW the groupby kernels (like hash_sum) are not actually exposed right now as is in pyarrow (they are only usable in context of the full query engine, not as standalone kernels)

A bit off-topic, but specifically for the pyarrow dtypes, I also think that we could actually use the support for masked arrays in our cython groupby algos. The only conversion step then is the bitmask to a bytemask, but which should be cheaper than converting to a numpy array (if there are nulls).

jbrockmendel · 2023-02-08T15:45:04Z

specifically for the pyarrow dtypes, I also think that we could actually use the support for masked arrays in our cython groupby algos

+1. i think something like a _to_masked_array and _from_masked_array would be useful for a bunch of these methods medium-term.

jbrockmendel · 2023-02-15T23:58:21Z

Closing, as I've convinced myself the cython quantile function is unecessary xref #51385

jbrockmendel added 8 commits January 26, 2023 13:09

REF: implemeent EA.groupby_quantile

89c97cf

REF: simplify groupby_quantile

df77e66

REF: separate out BaseMaskedArray.groupby_quantile

a9ee791

REF: implement groupby_quantile_ndim_compat

c441208

lint fixup

d9f053f

lint fixup

8c540fa

mypy fixup

fcc0bcb

Merge branch 'main' into ref-ea-gb-quartile

9e9c6b3

jbrockmendel added Groupby quantile quantile method labels Feb 1, 2023

jbrockmendel added 3 commits February 1, 2023 14:02

Merge branch 'main' into ref-ea-gb-quartile

5d9a9a8

Merge branch 'main' into ref-ea-gb-quartile

4b9bc6b

Merge branch 'main' into ref-ea-gb-quartile

45cb8a8

jbrockmendel mentioned this pull request Feb 1, 2023

REF: implement groupby std, min, max as EA methods #51116

Closed

jbrockmendel mentioned this pull request Feb 4, 2023

REF: let EAs override WrappedCythonOp groupby implementations #51166

Merged

1 task

jbrockmendel added 2 commits February 7, 2023 13:37

Merge branch 'main' into ref-ea-gb-quartile

1529035

REF: privatize groupby_quantile

ae135d6

typo fixup

f152220

jbrockmendel mentioned this pull request Feb 7, 2023

REF: Make function for dtype casting in pre-processing of groupby #37505

Open

jbrockmendel closed this Feb 15, 2023

jbrockmendel deleted the ref-ea-gb-quartile branch February 15, 2023 23:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REF: Groupby.quantile allow EA dispatch #51003

REF: Groupby.quantile allow EA dispatch #51003

jbrockmendel commented Jan 26, 2023

jbrockmendel commented Feb 3, 2023

mroeschke commented Feb 6, 2023

jbrockmendel commented Feb 7, 2023

jorisvandenbossche commented Feb 8, 2023

jbrockmendel commented Feb 8, 2023

jbrockmendel commented Feb 15, 2023

REF: Groupby.quantile allow EA dispatch #51003

REF: Groupby.quantile allow EA dispatch #51003

Conversation

jbrockmendel commented Jan 26, 2023

jbrockmendel commented Feb 3, 2023

mroeschke commented Feb 6, 2023

jbrockmendel commented Feb 7, 2023

jorisvandenbossche commented Feb 8, 2023

jbrockmendel commented Feb 8, 2023

jbrockmendel commented Feb 15, 2023