-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataFrame groupby is extremely slow when grouping by a column of pandas Period values #18053
Comments
of course, Periods are object dtypes. |
well its also related to the offset caching issues that @jbrockmendel has been looking at
|
We're a few steps away from having this fixed. In the interim the workaround I've been using is casting to strings, then groupby/sort/whatever, then cast back to Period. |
@jreback, it is fine that a series of pandas Periods has dtype But grouping by
|
@nmusolino and you are welcome to have a look |
Some updated timings.
On the PeriodArray PR, we're down to In [4]: %timeit df.groupby('month_periods')['x'].sum()
6.2 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) Which I think is good enough to call this closed, though @nmusolino if you want to do more optimizations after #22862 lands, then feel free. |
That's great news, thank you very much! |
Pandas Period objects are slow to groupby: pandas-dev/pandas#18053 Fix by using `dt.to_period` and converting to strings before groupby.
Steps to reproduce
Problem description
When a DataFrame column contains pandas.Period values, and the user attempts to groupby this column, the resulting operation is very, very slow, when compared to grouping by columns of integers or by columns of Python objects.
In the example above, a DataFrame with 120,000 rows is created, and a groupby operation is performed on three columns. On the integer column, the groupby-sum took 2.3 milliseconds; on the column containing datetime.date objects, the groupby-sum took 6.7 milliseconds; and on the column containing pandas.Period objects, the groupby-sum took 2.4 seconds.
Note that in this case, the dtype of the
'month_periods'
column isobject
. I attempted to convert this column to a period-specific data type usingdf['month_periods'] .astype('period[M]')
, but this lead to a TypeError:TypeError: data type "period[M]" not understood
.In any case, the series was returned by
.dt.to_period('M')
, so I would expect this to be a well-formed series of periods.Expected Behavior
When grouping on a period column, it should be possible to group by the underlying integer values used for storing periods, and thus the performance should roughly match the performance of grouping by integers.
In the worst case, the performance should match the performance of comparing small Python objects (i.e. those with trivial
__eq__
functions).Workaround
Making the column categorical avoids the performance hit, and roughly matches the integer column performance:
Output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: