Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Series.__getitem__ materializes Categorical to ndarray #19318

Closed
TomAugspurger opened this issue Jan 19, 2018 · 3 comments
Closed

Series.__getitem__ materializes Categorical to ndarray #19318

TomAugspurger opened this issue Jan 19, 2018 · 3 comments
Labels
Categorical Categorical Data Type Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Milestone

Comments

@TomAugspurger
Copy link
Contributor

For

In [12]: c = pd.Series(pd.Categorical(['a'] * 1000))

In [13]: c[0]

we hit

def get_value(self, series, key):
"""
Fast lookup of value from 1-dimensional ndarray. Only use this if you
know what you're doing
"""
# if we have something that is Index-like, then
# use this, e.g. DatetimeIndex
s = getattr(series, '_values', None)
if isinstance(s, Index) and is_scalar(key):
try:
return s[key]
except (IndexError, ValueError):
# invalid type as an indexer
pass
s = _values_from_object(series)
k = _values_from_object(key)

_values_from_object calls series.get_values(), which hits Categorical.get_values, which coerces to the ndarray of values.

I have a branch based on my ExtensionArray stuff that "fixes" this by seeing if s is an instance of ExtensionArray, which has the correct semantics for what we need here. But that's not necessarily the best fix here.

master:

In [3]: c = pd.Series(pd.Categorical(['a'] * 1000))

In [4]: %timeit c[0]
50.1 µs ± 1.86 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

My branch:

In [4]: %timeit c[0]
5.76 µs ± 126 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
@TomAugspurger TomAugspurger added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance Categorical Categorical Data Type Difficulty Intermediate labels Jan 19, 2018
@TomAugspurger TomAugspurger changed the title Series.__getitem__ with scalar materializes Categorical to ndarray Series.__getitem__ materializes Categorical to ndarray Jan 19, 2018
@TomAugspurger
Copy link
Contributor Author

Note that this affects all Series.__getitem__ operations, since this is tried at the start of Series.__getitem__, before falling back to _get_with.

@jreback
Copy link
Contributor

jreback commented Jan 19, 2018

this is the same as this: #19214

@TomAugspurger
Copy link
Contributor Author

Ah indeed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

2 participants