-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
equality comparison with a scalar is slow for category (performance regression) #23814
Comments
It seems like a lot of time is spent in if isinstance(data, (ABCSeries, ABCIndexClass)):
data = data._values
if isinstance(data, type(self)):
dtype = data.dtype
codes = data.codes We would need to handle not-default args for
@colinfang can you post your pandas version? |
I am using v0.23.4. The performance for v0.20.3 is good. |
Can you see if my suggestion would fix it? I think it may need to happen
even before
https://github.com/pandas-dev/pandas/blob/029d57cb828b053d112ed4e23f446ee07d147935/pandas/core/arrays/categorical.py#L331,
because we need to know whether the user passed None for all the optional
parameters.
…On Tue, Nov 20, 2018 at 6:30 AM colinfang ***@***.***> wrote:
I am using v0.23.4. The performance for v0.20.3 is good.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#23814 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHInyzbvgbkKtMKetkzNMTK_Uvb3AMks5uw_X2gaJpZM4Yq9mh>
.
|
@TomAugspurger / @colinfang , I tried inserting code similar to Tom's suggestion, near the beginning of Categorical.init, and it appears it does indeed fix the perf issue. I also noticed there is a "fastpath" in that method also, not sure if that is applicable in this case. I'd be interested in working on a fix for this issue, unless you were wanting to @colinfang. I understand there would need to be some more investigation about where exactly to do this short-circuiting, and handling non-default args. Also, are there perf tests that can be used for this type of thing (those are probably tricky since it may be machine dependent, not sure if there is a good solution for that)? if isinstance(values, (ABCSeries, ABCIndexClass)):
values = values._values
if isinstance(values, type(self)):
self._dtype = values.dtype
self._codes = values.codes
return In [1]: import pandas as pd
In [2]: s = pd.Series(list('abcd') * 1000000).astype('category')
In [3]: %timeit s == 'a'
3.13 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [4]: %timeit s.cat.codes == s.cat.categories.get_loc('a')
3.25 ms ± 68.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) |
I'd say go ahead and work on this.
You'd need to add an ASV in asv_bench/benchmarks/categoricals.py for timing
Categorical(categorical).
…On Tue, Nov 20, 2018 at 5:09 PM Erik ***@***.***> wrote:
@TomAugspurger <https://github.com/TomAugspurger> / @colinfang
<https://github.com/colinfang> , I tried inserting code similar to Tom's
suggestion, near the beginning of Categorical.*init*, and it appears it
does indeed fix the perf issue. I also noticed there is a "fastpath" in
that method also, not sure if that is applicable in this case.
I'd be interested in working on a fix for this issue, unless you were
wanting to @colinfang <https://github.com/colinfang>. I understand there
would need to be some more investigation about where exactly to do this
short-circuiting, and handling non-default args. Also, are there perf tests
that can be used for this type of thing (those are probably tricky since it
may be machine dependent, not sure if there is a good solution for that)?
if isinstance(values, (ABCSeries, ABCIndexClass)):
values = values._valuesif isinstance(values, type(self)):
self._dtype = values.dtype
self._codes = values.codes
return
In [1]: import pandas as pd
In [2]: s = pd.Series(list('abcd') * 1000000).astype('category')
In [3]: %timeit s == 'a'3.13 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [4]: %timeit s.cat.codes == s.cat.categories.get_loc('a')3.25 ms ± 68.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#23814 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIofWioeIOoIYkS8aCOjaEAU0tKVKks5uxIu9gaJpZM4Yq9mh>
.
|
One option I saw may be to short-circuit within _recode_for_categories() (not sure if this would help in any other scenarios also). It may be safer, since all of the other logic for the other parameters runs as it did before? In _recode_for_categories(), I tried adding the following code: if new_categories.equals(old_categories):
# Same categories, so no need to actually recode
return codes.copy() Which yielded: In [3]: %timeit s == 'a'
7.29 ms ± 33.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [4]: %timeit s.cat.codes == s.cat.categories.get_loc('a')
3.14 ms ± 29.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) If I instead add the following code basically at the beginning of init() (the approach first discussed): if categories is None and ordered is None and dtype is None:
if isinstance(values, (ABCSeries, ABCIndexClass)):
values = values._values
if isinstance(values, type(self)):
self._dtype = values.dtype
self._codes = values.codes.copy()
return This yielded: In [3]: %timeit s == 'a'
5.38 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) So, not a huge difference between the two approaches. Also, I'm not sure whether I need to copy() the codes as I have in the above snippets. @TomAugspurger any thoughts? I can just send a PR and we can continue to discuss there if you prefer.. |
I think the change to The downside to adding a |
Sounds good, I will go with the init change. |
Are the following 2 ways to compare a series to a scalar equivalent (ignore missing values)? I have to write the hard way in order to take advantage of the category properties.
The text was updated successfully, but these errors were encountered: