-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: casting loc to labels dtype before searchsorted #14551
PERF: casting loc to labels dtype before searchsorted #14551
Conversation
@@ -1907,6 +1907,7 @@ def convert_indexer(start, stop, step, indexer=indexer, labels=labels): | |||
return np.array(labels == loc, dtype=bool) | |||
else: | |||
# sorted, so can return slice object -> view | |||
loc = labels.dtype.type(loc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe should do this like
loc, orig_loc = lables.dtype.type(loc), loc
if loc != orig_loc:
loc = orig_loc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I am not sure how much checking would be needed here.
My understanding is that it probably is not needed in this case, as loc
is coming from loc = level_index.get_loc(key)
, and labels
and level_index
are from the same MultiIndex. So I would assume that get_loc
can only return an existing label, and so should fit in the dtype of labels
?
(but probably also not that a perf issue to do the check)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that sounds right; if the test suite passes, prob ok!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently not :-)
But it's only a partial indexing test that fails. So loc
could also be a slice, and it is expected in partial indexing that this raises an error in searchsorted, but now it raises already a slightly different error in dtype.type(loc)
. So that can easily be solved with:
try:
loc = labels.dtype.type(loc)
except TypeError:
# this occurs when loc is a slice (partial string indexing)
# but the TypeError raised by searchsorted in this case
# is catched in Index._has_valid_type()
pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, or just tests if its a scalar to begin with
Current coverage is 85.26% (diff: 100%)@@ master #14551 diff @@
==========================================
Files 140 140
Lines 50672 50676 +4
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 43207 43211 +4
Misses 7465 7465
Partials 0 0
|
lgtm. just release note. I think this can only help :> |
OK, added whatsnew notice. Will merge, but will open a new issue for follow-up on this (other cases, benchmarks that capture this (at the moment there are none)). |
Version 0.19.1 * tag 'v0.19.1': (43 commits) RLS: v0.19.1 DOC: update whatsnew/release notes for 0.19.1 (pandas-dev#14573) [Backport pandas-dev#14545] BUG/API: Index.append with mixed object/Categorical indices (pandas-dev#14545) DOC: rst fixes [Backport pandas-dev#14567] DEPR: add deprecation warning for com.array_equivalent (pandas-dev#14567) [Backport pandas-dev#14551] PERF: casting loc to labels dtype before searchsorted (pandas-dev#14551) [Backport pandas-dev#14536] BUG: DataFrame.quantile with NaNs (GH14357) (pandas-dev#14536) [Backport pandas-dev#14520] BUG: don't close user-provided file handles in C parser (GH14418) (pandas-dev#14520) [Backport pandas-dev#14392] BUG: Dataframe constructor when given dict with None value (pandas-dev#14392) [Backport pandas-dev#14514] BUG: Don't parse inline quotes in skipped lines (pandas-dev#14514) [Bacport pandas-dev#14543] BUG: tseries ceil doc fix (pandas-dev#14543) [Backport pandas-dev#14541] DOC: Simplify the gbq integration testing procedure for contributors (pandas-dev#14541) [Backport pandas-dev#14527] BUG/ERR: raise correct error when sql driver is not installed (pandas-dev#14527) [Backport pandas-dev#14501] BUG: fix DatetimeIndex._maybe_cast_slice_bound for empty index (GH14354) (pandas-dev#14501) [Backport pandas-dev#14442] DOC: Expand on reference docs for read_json() (pandas-dev#14442) BLD: fix 3.4 build for cython to 0.24.1 [Backport pandas-dev#14492] BUG: Accept unicode quotechars again in pd.read_csv [Backport pandas-dev#14496] BLD: Support Cython 0.25 [Backport pandas-dev#14498] COMPAT/TST: fix test for range testing of negative integers to neg powers [Backport pandas-dev#14476] PERF: performance regression in Series.asof (pandas-dev#14476) ...
Intrigued by the profiling results of the below example (Multi-index
loc
indexing, based on the example in #14549), wheresearchsorted
seemed to take the majority of the computation time.And it seems that
searchsorted
casts both inputs (in this caselabels
andloc
) to a common dtype, and thelabels
of the MultiIndex were in this caseint16
, whileloc
(output fromIndex.get_loc
) is a python int.By casting
loc
to the dtype oflabels
, the specific example gets a ca 20 x speed improvementOn master:
with this PR:
Just putting it here as a showcase.
Actual PR probably needs some more work (other places where this can be done, can loc ever be out of bound for that dtype?, benchmarks, ..)