PERF: casting loc to labels dtype before searchsorted #14551

jorisvandenbossche · 2016-11-01T15:08:03Z

Intrigued by the profiling results of the below example (Multi-index loc indexing, based on the example in #14549), where searchsorted seemed to take the majority of the computation time.
And it seems that searchsorted casts both inputs (in this case labels and loc) to a common dtype, and the labels of the MultiIndex were in this case int16, while loc (output from Index.get_loc) is a python int.

By casting loc to the dtype of labels, the specific example gets a ca 20 x speed improvement

df = pd.DataFrame({'a': np.random.randn(500*5000)}, index=pd.MultiIndex.from_product([date_range("2014-01-01", periods=500), range(5000)]))
dt = pd.Timestamp('2015-01-01')
%timeit df.loc[dt]

On master:

In [3]: %timeit df.loc[dt]
The slowest run took 5.70 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 9.39 ms per loop

with this PR:

In [3]: %timeit df.loc[dt]
The slowest run took 122.51 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 422 µs per loop

Just putting it here as a showcase.
Actual PR probably needs some more work (other places where this can be done, can loc ever be out of bound for that dtype?, benchmarks, ..)

jreback · 2016-11-01T15:18:55Z

pandas/indexes/multi.py

@@ -1907,6 +1907,7 @@ def convert_indexer(start, stop, step, indexer=indexer, labels=labels):
                return np.array(labels == loc, dtype=bool)
            else:
                # sorted, so can return slice object -> view
+                loc = labels.dtype.type(loc)


maybe should do this like

loc, orig_loc = lables.dtype.type(loc), loc if loc != orig_loc: loc = orig_loc

Yeah, I am not sure how much checking would be needed here.

My understanding is that it probably is not needed in this case, as loc is coming from loc = level_index.get_loc(key), and labels and level_index are from the same MultiIndex. So I would assume that get_loc can only return an existing label, and so should fit in the dtype of labels?

(but probably also not that a perf issue to do the check)

that sounds right; if the test suite passes, prob ok!

Apparently not :-)
But it's only a partial indexing test that fails. So loc could also be a slice, and it is expected in partial indexing that this raises an error in searchsorted, but now it raises already a slightly different error in dtype.type(loc). So that can easily be solved with:

try: loc = labels.dtype.type(loc) except TypeError: # this occurs when loc is a slice (partial string indexing) # but the TypeError raised by searchsorted in this case # is catched in Index._has_valid_type() pass

sure, or just tests if its a scalar to begin with

codecov-io · 2016-11-01T17:14:13Z

Current coverage is 85.26% (diff: 100%)

Merging #14551 into master will increase coverage by <.01%

@@             master     #14551   diff @@
==========================================
  Files           140        140          
  Lines         50672      50676     +4   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43207      43211     +4   
  Misses         7465       7465          
  Partials          0          0

Powered by Codecov. Last update 60a335e...6447e4c

jreback · 2016-11-01T22:19:17Z

lgtm. just release note. I think this can only help :>

jorisvandenbossche · 2016-11-02T15:16:05Z

OK, added whatsnew notice. Will merge, but will open a new issue for follow-up on this (other cases, benchmarks that capture this (at the moment there are none)).

…ed (#14551) (cherry picked from commit 1d95179)

Version 0.19.1 * tag 'v0.19.1': (43 commits) RLS: v0.19.1 DOC: update whatsnew/release notes for 0.19.1 (pandas-dev#14573) [Backport pandas-dev#14545] BUG/API: Index.append with mixed object/Categorical indices (pandas-dev#14545) DOC: rst fixes [Backport pandas-dev#14567] DEPR: add deprecation warning for com.array_equivalent (pandas-dev#14567) [Backport pandas-dev#14551] PERF: casting loc to labels dtype before searchsorted (pandas-dev#14551) [Backport pandas-dev#14536] BUG: DataFrame.quantile with NaNs (GH14357) (pandas-dev#14536) [Backport pandas-dev#14520] BUG: don't close user-provided file handles in C parser (GH14418) (pandas-dev#14520) [Backport pandas-dev#14392] BUG: Dataframe constructor when given dict with None value (pandas-dev#14392) [Backport pandas-dev#14514] BUG: Don't parse inline quotes in skipped lines (pandas-dev#14514) [Bacport pandas-dev#14543] BUG: tseries ceil doc fix (pandas-dev#14543) [Backport pandas-dev#14541] DOC: Simplify the gbq integration testing procedure for contributors (pandas-dev#14541) [Backport pandas-dev#14527] BUG/ERR: raise correct error when sql driver is not installed (pandas-dev#14527) [Backport pandas-dev#14501] BUG: fix DatetimeIndex._maybe_cast_slice_bound for empty index (GH14354) (pandas-dev#14501) [Backport pandas-dev#14442] DOC: Expand on reference docs for read_json() (pandas-dev#14442) BLD: fix 3.4 build for cython to 0.24.1 [Backport pandas-dev#14492] BUG: Accept unicode quotechars again in pd.read_csv [Backport pandas-dev#14496] BLD: Support Cython 0.25 [Backport pandas-dev#14498] COMPAT/TST: fix test for range testing of negative integers to neg powers [Backport pandas-dev#14476] PERF: performance regression in Series.asof (pandas-dev#14476) ...

PERF: casting loc to labels dtype before searchsorted

b580799

jreback reviewed Nov 1, 2016

View reviewed changes

catch error if loc is a slice (partial indexing)

6447e4c

jorisvandenbossche mentioned this pull request Nov 1, 2016

Pandas 0.12 is much faster than Pandas 0.18 #14549

Closed

jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Nov 1, 2016

jreback added this to the 0.19.1 milestone Nov 1, 2016

add whatsnew notice

700afa5

jorisvandenbossche merged commit 1d95179 into pandas-dev:master Nov 2, 2016

jorisvandenbossche mentioned this pull request Nov 2, 2016

PERF: better use of searchsorted for indexing performance #14565

Closed

jorisvandenbossche added a commit that referenced this pull request Nov 3, 2016

[Backport #14551] PERF: casting loc to labels dtype before searchsort…

a95ce63

…ed (#14551) (cherry picked from commit 1d95179)

jorisvandenbossche mentioned this pull request Apr 7, 2017

DEPR: Panel deprecated #15601

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: casting loc to labels dtype before searchsorted #14551

PERF: casting loc to labels dtype before searchsorted #14551

jorisvandenbossche commented Nov 1, 2016 •

edited

Loading

jreback Nov 1, 2016

jorisvandenbossche Nov 1, 2016

jreback Nov 1, 2016

jorisvandenbossche Nov 1, 2016

jreback Nov 1, 2016

codecov-io commented Nov 1, 2016 •

edited

Loading

jreback commented Nov 1, 2016

jorisvandenbossche commented Nov 2, 2016

PERF: casting loc to labels dtype before searchsorted #14551

PERF: casting loc to labels dtype before searchsorted #14551

Conversation

jorisvandenbossche commented Nov 1, 2016 • edited Loading

jreback Nov 1, 2016

Choose a reason for hiding this comment

jorisvandenbossche Nov 1, 2016

Choose a reason for hiding this comment

jreback Nov 1, 2016

Choose a reason for hiding this comment

jorisvandenbossche Nov 1, 2016

Choose a reason for hiding this comment

jreback Nov 1, 2016

Choose a reason for hiding this comment

codecov-io commented Nov 1, 2016 • edited Loading

Current coverage is 85.26% (diff: 100%)

jreback commented Nov 1, 2016

jorisvandenbossche commented Nov 2, 2016

jorisvandenbossche commented Nov 1, 2016 •

edited

Loading

codecov-io commented Nov 1, 2016 •

edited

Loading