equality comparison with a scalar is slow for category (performance regression) #23814

colinfang · 2018-11-20T12:02:03Z

Are the following 2 ways to compare a series to a scalar equivalent (ignore missing values)? I have to write the hard way in order to take advantage of the category properties.

```python
x = pd.Series(list('abcd') * 1000000).astype('category')
%timeit x == 'a'
# 10 loops, best of 3: 25.2 ms per loop
%timeit x.cat.codes == x.cat.categories.get_loc('a')
# 1000 loops, best of 3: 750 µs per loop
```

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-11-20T12:18:10Z

It seems like a lot of time is spent in Categorical.__init__(Series[Categorical]), when that should essentially just be

if isinstance(data, (ABCSeries, ABCIndexClass)):
    data = data._values
if isinstance(data, type(self)):
    dtype = data.dtype
    codes = data.codes

We would need to handle not-default args for categories, ordered, and codes.

Timer unit: 1e-06 s

Total time: 0.02405 s
File: /Users/taugspurger/sandbox/pandas/pandas/core/arrays/categorical.py
Function: __init__ at line 329

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   329                                               def __init__(self, values, categories=None, ordered=None, dtype=None,
   330                                                            fastpath=False):
   331                                           
   332                                                   # Ways of specifying the dtype (prioritized ordered)
   333                                                   # 1. dtype is a CategoricalDtype
   334                                                   #    a.) with known categories, use dtype.categories
   335                                                   #    b.) else with Categorical values, use values.dtype
   336                                                   #    c.) else, infer from values
   337                                                   #    d.) specifying dtype=CategoricalDtype and categories is an error
   338                                                   # 2. dtype is a string 'category'
   339                                                   #    a.) use categories, ordered
   340                                                   #    b.) use values.dtype
   341                                                   #    c.) infer from values
   342                                                   # 3. dtype is None
   343                                                   #    a.) use categories, ordered
   344                                                   #    b.) use values.dtype
   345                                                   #    c.) infer from values
   346         1          2.0      2.0      0.0          if dtype is not None:
   347                                                       # The dtype argument takes precedence over values.dtype (if any)
   348                                                       if isinstance(dtype, compat.string_types):
   349                                                           if dtype == 'category':
   350                                                               dtype = CategoricalDtype(categories, ordered)
   351                                                           else:
   352                                                               msg = "Unknown `dtype` {dtype}"
   353                                                               raise ValueError(msg.format(dtype=dtype))
   354                                                       elif categories is not None or ordered is not None:
   355                                                           raise ValueError("Cannot specify both `dtype` and `categories`"
   356                                                                            " or `ordered`.")
   357                                           
   358                                                       categories = dtype.categories
   359                                           
   360         1         13.0     13.0      0.1          elif is_categorical(values):
   361                                                       # If no "dtype" was passed, use the one from "values", but honor
   362                                                       # the "ordered" and "categories" arguments
   363         1          5.0      5.0      0.0              dtype = values.dtype._from_categorical_dtype(values.dtype,
   364         1          5.0      5.0      0.0                                                           categories, ordered)
   365                                                   else:
   366                                                       # If dtype=None and values is not categorical, create a new dtype
   367                                                       dtype = CategoricalDtype(categories, ordered)
   368                                           
   369                                                   # At this point, dtype is always a CategoricalDtype
   370                                                   # if dtype.categories is None, we are inferring
   371                                           
   372         1          1.0      1.0      0.0          if fastpath:
   373                                                       self._codes = coerce_indexer_dtype(values, categories)
   374                                                       self._dtype = self._dtype.update_dtype(dtype)
   375                                                       return
   376                                           
   377                                                   # null_mask indicates missing values we want to exclude from inference.
   378                                                   # This means: only missing values in list-likes (not arrays/ndframes).
   379         1         10.0     10.0      0.0          null_mask = np.array(False)
   380                                           
   381                                                   # sanitize input
   382         1          8.0      8.0      0.0          if is_categorical_dtype(values):
   383         1          5.0      5.0      0.0              if dtype.categories is None:
   384                                                           dtype = CategoricalDtype(values.categories, dtype.ordered)
   385                                           
   386                                                   elif not isinstance(values, (ABCIndexClass, ABCSeries)):
   387                                                       # _sanitize_array coerces np.nan to a string under certain versions
   388                                                       # of numpy
   389                                                       values = maybe_infer_to_datetimelike(values, convert_dates=True)
   390                                                       if not isinstance(values, np.ndarray):
   391                                                           values = _convert_to_list_like(values)
   392                                                           from pandas.core.series import _sanitize_array
   393                                                           # By convention, empty lists result in object dtype:
   394                                                           if len(values) == 0:
   395                                                               sanitize_dtype = 'object'
   396                                                           else:
   397                                                               sanitize_dtype = None
   398                                                           null_mask = isna(values)
   399                                                           if null_mask.any():
   400                                                               values = [values[idx] for idx in np.where(~null_mask)[0]]
   401                                                           values = _sanitize_array(values, None, dtype=sanitize_dtype)
   402                                           
   403         1          2.0      2.0      0.0          if dtype.categories is None:
   404                                                       try:
   405                                                           codes, categories = factorize(values, sort=True)
   406                                                       except TypeError:
   407                                                           codes, categories = factorize(values, sort=False)
   408                                                           if dtype.ordered:
   409                                                               # raise, as we don't have a sortable data structure and so
   410                                                               # the user should give us one by specifying categories
   411                                                               raise TypeError("'values' is not ordered, please "
   412                                                                               "explicitly specify the categories order "
   413                                                                               "by passing in a categories argument.")
   414                                                       except ValueError:
   415                                           
   416                                                           # FIXME
   417                                                           raise NotImplementedError("> 1 ndim Categorical are not "
   418                                                                                     "supported at this time")
   419                                           
   420                                                       # we're inferring from values
   421                                                       dtype = CategoricalDtype(categories, dtype.ordered)
   422                                           
   423         1          7.0      7.0      0.0          elif is_categorical_dtype(values):
   424         1        394.0    394.0      1.6              old_codes = (values.cat.codes if isinstance(values, ABCSeries)
   425                                                                    else values.codes)
   426         1          5.0      5.0      0.0              codes = _recode_for_categories(old_codes, values.dtype.categories,
   427         1      23530.0  23530.0     97.8                                             dtype.categories)
   428                                           
   429                                                   else:
   430                                                       codes = _get_codes_for_values(values, dtype.categories)
   431                                           
   432         1         27.0     27.0      0.1          if null_mask.any():
   433                                                       # Reinsert -1 placeholders for previously removed missing values
   434                                                       full_codes = - np.ones(null_mask.shape, dtype=codes.dtype)
   435                                                       full_codes[~null_mask] = codes
   436                                                       codes = full_codes
   437                                           
   438         1         27.0     27.0      0.1          self._dtype = self._dtype.update_dtype(dtype)
   439         1          9.0      9.0      0.0          self._codes = coerce_indexer_dtype(codes, dtype.categories)

@colinfang can you post your pandas version?

colinfang · 2018-11-20T12:30:41Z

I am using v0.23.4. The performance for v0.20.3 is good.

TomAugspurger · 2018-11-20T12:38:19Z

Can you see if my suggestion would fix it? I think it may need to happen even before https://github.com/pandas-dev/pandas/blob/029d57cb828b053d112ed4e23f446ee07d147935/pandas/core/arrays/categorical.py#L331, because we need to know whether the user passed None for all the optional parameters.

…

On Tue, Nov 20, 2018 at 6:30 AM colinfang ***@***.***> wrote: I am using v0.23.4. The performance for v0.20.3 is good. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#23814 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHInyzbvgbkKtMKetkzNMTK_Uvb3AMks5uw_X2gaJpZM4Yq9mh> .

eoveson · 2018-11-20T23:09:40Z

@TomAugspurger / @colinfang , I tried inserting code similar to Tom's suggestion, near the beginning of Categorical.init, and it appears it does indeed fix the perf issue. I also noticed there is a "fastpath" in that method also, not sure if that is applicable in this case.

I'd be interested in working on a fix for this issue, unless you were wanting to @colinfang. I understand there would need to be some more investigation about where exactly to do this short-circuiting, and handling non-default args. Also, are there perf tests that can be used for this type of thing (those are probably tricky since it may be machine dependent, not sure if there is a good solution for that)?

if isinstance(values, (ABCSeries, ABCIndexClass)):
    values = values._values
if isinstance(values, type(self)):
    self._dtype = values.dtype
    self._codes = values.codes
    return

In [1]: import pandas as pd

In [2]: s = pd.Series(list('abcd') * 1000000).astype('category')

In [3]: %timeit s == 'a'
3.13 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit s.cat.codes == s.cat.categories.get_loc('a')
3.25 ms ± 68.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

TomAugspurger · 2018-11-22T14:57:04Z

I'd say go ahead and work on this. You'd need to add an ASV in asv_bench/benchmarks/categoricals.py for timing Categorical(categorical).

…

On Tue, Nov 20, 2018 at 5:09 PM Erik ***@***.***> wrote: @TomAugspurger <https://github.com/TomAugspurger> / @colinfang <https://github.com/colinfang> , I tried inserting code similar to Tom's suggestion, near the beginning of Categorical.*init*, and it appears it does indeed fix the perf issue. I also noticed there is a "fastpath" in that method also, not sure if that is applicable in this case. I'd be interested in working on a fix for this issue, unless you were wanting to @colinfang <https://github.com/colinfang>. I understand there would need to be some more investigation about where exactly to do this short-circuiting, and handling non-default args. Also, are there perf tests that can be used for this type of thing (those are probably tricky since it may be machine dependent, not sure if there is a good solution for that)? if isinstance(values, (ABCSeries, ABCIndexClass)): values = values._valuesif isinstance(values, type(self)): self._dtype = values.dtype self._codes = values.codes return In [1]: import pandas as pd In [2]: s = pd.Series(list('abcd') * 1000000).astype('category') In [3]: %timeit s == 'a'3.13 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [4]: %timeit s.cat.codes == s.cat.categories.get_loc('a')3.25 ms ± 68.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#23814 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIofWioeIOoIYkS8aCOjaEAU0tKVKks5uxIu9gaJpZM4Yq9mh> .

eoveson · 2018-11-24T03:04:02Z

One option I saw may be to short-circuit within _recode_for_categories() (not sure if this would help in any other scenarios also). It may be safer, since all of the other logic for the other parameters runs as it did before? In _recode_for_categories(), I tried adding the following code:

if new_categories.equals(old_categories):
    # Same categories, so no need to actually recode
    return codes.copy()

Which yielded:

In [3]: %timeit s == 'a'
7.29 ms ± 33.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit s.cat.codes == s.cat.categories.get_loc('a')
3.14 ms ± 29.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

If I instead add the following code basically at the beginning of init() (the approach first discussed):

if categories is None and ordered is None and dtype is None:
    if isinstance(values, (ABCSeries, ABCIndexClass)):
        values = values._values
    if isinstance(values, type(self)):
        self._dtype = values.dtype
        self._codes = values.codes.copy()
        return

This yielded:

In [3]: %timeit s == 'a'
5.38 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So, not a huge difference between the two approaches. Also, I'm not sure whether I need to copy() the codes as I have in the above snippets. @TomAugspurger any thoughts? I can just send a PR and we can continue to discuss there if you prefer..

TomAugspurger · 2018-11-24T19:36:18Z

I think the change to Categorical.__init__ is clearly good.

The downside to adding a .equals check to _recode_for_categories is that you always have to pay the cost to check .equals, even if they aren't actually equal. I suspect that in many cases where you'd end up taking the shortcut in _recode_for_categories, you'll already know whether on not the categories are equal or not. Still, there may be some cases where you can't know that, so it may be OK to just check that.

eoveson · 2018-11-24T22:41:40Z

Sounds good, I will go with the init change.

TomAugspurger added Performance Memory or execution speed performance Categorical Categorical Data Type labels Nov 20, 2018

TomAugspurger added this to the Contributions Welcome milestone Nov 20, 2018

TomAugspurger added Effort Medium good first issue labels Nov 20, 2018

colinfang changed the title ~~equality comparison with a scalar is slow for category~~ equality comparison with a scalar is slow for category (performance regression) Nov 20, 2018

eoveson mentioned this issue Nov 24, 2018

PERF: For GH23814, return early in Categorical.__init__ #23888

Merged

4 tasks

jreback modified the milestones: Contributions Welcome, 0.24.0 Nov 29, 2018

jreback closed this as completed in #23888 Nov 30, 2018

This was referenced Feb 15, 2019

pd.Categorical(Series, categories=..) returns broken data with out-of-bound codes #25318

Closed

BUG: Fix potential segfault after pd.Categorical(pd.Series(...), categories=...) #25368

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

equality comparison with a scalar is slow for category (performance regression) #23814

equality comparison with a scalar is slow for category (performance regression) #23814

colinfang commented Nov 20, 2018 •

edited

Loading

TomAugspurger commented Nov 20, 2018

colinfang commented Nov 20, 2018

TomAugspurger commented Nov 20, 2018 via email

eoveson commented Nov 20, 2018

TomAugspurger commented Nov 22, 2018 via email

eoveson commented Nov 24, 2018

TomAugspurger commented Nov 24, 2018

eoveson commented Nov 24, 2018

equality comparison with a scalar is slow for category (performance regression) #23814

equality comparison with a scalar is slow for category (performance regression) #23814

Comments

colinfang commented Nov 20, 2018 • edited Loading

TomAugspurger commented Nov 20, 2018

colinfang commented Nov 20, 2018

TomAugspurger commented Nov 20, 2018 via email

eoveson commented Nov 20, 2018

TomAugspurger commented Nov 22, 2018 via email

eoveson commented Nov 24, 2018

TomAugspurger commented Nov 24, 2018

eoveson commented Nov 24, 2018

colinfang commented Nov 20, 2018 •

edited

Loading