PERF: better use of searchsorted for indexing performance #14565

jorisvandenbossche · 2016-11-02T15:20:04Z

Follow-up on #14551

When searchsorted is used (eg as labels.searchsorted(loc)), it is important that loc is first cast to the same dtype as labels for performance reasons.

PR #14551 fixed one such example, but further actions:

check the other usages of searchsorted for this pattern
add benchmarks to track the improvement

The text was updated successfully, but these errors were encountered:

mroeschke · 2017-01-27T19:03:24Z

Hi @jorisvandenbossche

I've been playing around with your original speedup:

try:
    loc = labels.dtype.type(loc)
except: TypeError:
    pass

labels.searchsorted(loc)

And I noticed that normally searchsorted is usually called on a FrozenNDArray object here (for MultiIndexes at least). Would it make sense to create a _searchsorted method in FrozenNDArray like so:

class FrozenNDArray(...):
    ...
    ...
    def _searchsorted(self, v, side='left', sorter=None):
        try:
            v = self.values.dtype.type(v)
        except: TypeError:
            pass
        return self.seachsorted(self.values, v,  side=side, sorter=sorter)

And then call searchsorted like so?

labels._searchsorted(loc)

jorisvandenbossche · 2017-01-28T22:00:10Z

@mroeschke That seems to make sense, yes. Do you know if there are many occurrences of the use of searchsorted that should be adapted?
It may be cleaner to make a private helper function/method that does this instead adding it to the FrozenNDArray.

mroeschke · 2017-02-07T01:23:07Z

Sorry for the delay in response @jorisvandenbossche. Reviewing where searchsorted can be adapted:

There are four occurrences in this block that speeds up df.loc[(idx,)]
https://github.com/pandas-dev/pandas/blob/master/pandas/indexes/multi.py#L1636
https://github.com/pandas-dev/pandas/blob/master/pandas/indexes/multi.py#L1639

Two more in this block (not sure of the performance use case, but similar code path to your original fix):
https://github.com/pandas-dev/pandas/blob/master/pandas/indexes/multi.py#L1933

And I believe one on the _bound property (not sure where this is used):
https://github.com/pandas-dev/pandas/blob/master/pandas/indexes/multi.py#L2309

So maybe a private helper function in pandas/indexes/multi.py might be better here.

Also I saw that there's a Series.searchsorted() and Index.searchsorted() functionality as well. I did not dig into their implementations too much, but they may also benefit if they are subclassed from NDarray?

jorisvandenbossche · 2017-02-07T10:36:34Z

So maybe a private helper function in pandas/indexes/multi.py might be better here.

+1

The _bound I am also not directly sure what it is for (from a quick search, I actually don't find any place where this is used)

Also I saw that there's a Series.searchsorted() and Index.searchsorted() functionality as well.

I would first focus on the internal use cases. In those cases, you would need a bit more checking, as you can in principle pass a float to a searchsorted of ints, and doing a dtype.type(val) would round the float and can give different results, so you would need to check the equality before and after (in the internal multi-index case, we know that the locs are no floats). The searchsorted method on Series and Index just pass things through to numpy searchsorted, so would leave that to keep things simpler.

jorisvandenbossche · 2017-05-15T15:39:59Z

In the meantime, I think this has been fixed in the deprecation panel PR by @jreback (see discussion here: #15601 (comment), and this commit: c39453a)

So basically the approach of having a FrozenNDArray searchsorted method to handle this as proposed above

jorisvandenbossche · 2017-05-15T15:41:09Z

Would be good to still do the second point in the original issue here: add benchmarks for the affected cases.

gfyoung · 2018-10-28T08:18:12Z

@jreback @jorisvandenbossche @mroeschke : I'm planning to deprecate FrozenNDArray pretty soon (see here for rationale).

Not to mention, the performance optimization for FrozenNDArray was handled in #15601.

Given that FrozenNDArray was pretty central to this conversation, any reason to keep this open?

jorisvandenbossche added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance Difficulty Intermediate labels Nov 2, 2016

jorisvandenbossche modified the milestones: Next Major Release, 0.20.0 Nov 2, 2016

jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017

jreback added the Prio-high label Mar 29, 2017

jreback modified the milestones: Next Major Release, Next Minor Release Mar 29, 2017

jorisvandenbossche modified the milestones: 0.20.0, Interesting Issues May 15, 2017

gfyoung removed this from the 0.20.0 milestone Oct 28, 2018

jreback mentioned this issue Oct 28, 2018

DEPR: Panel deprecated #15601

Closed

gfyoung mentioned this issue Oct 28, 2018

DEPR: Deprecate FrozenNDArray #23396

Merged

topper-123 mentioned this issue Nov 20, 2018

PERF/REF: improve performance of Series.searchsorted, PandasArray.searchsorted, collect functionality #22034

Merged

4 tasks

jreback added this to the 0.25.0 milestone Feb 2, 2019

jreback closed this as completed in #22034 Feb 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: better use of searchsorted for indexing performance #14565

PERF: better use of searchsorted for indexing performance #14565

jorisvandenbossche commented Nov 2, 2016

mroeschke commented Jan 27, 2017

jorisvandenbossche commented Jan 28, 2017

mroeschke commented Feb 7, 2017 •

edited

Loading

jorisvandenbossche commented Feb 7, 2017

jorisvandenbossche commented May 15, 2017

jorisvandenbossche commented May 15, 2017

gfyoung commented Oct 28, 2018 •

edited

Loading

PERF: better use of searchsorted for indexing performance #14565

PERF: better use of searchsorted for indexing performance #14565

Comments

jorisvandenbossche commented Nov 2, 2016

mroeschke commented Jan 27, 2017

jorisvandenbossche commented Jan 28, 2017

mroeschke commented Feb 7, 2017 • edited Loading

jorisvandenbossche commented Feb 7, 2017

jorisvandenbossche commented May 15, 2017

jorisvandenbossche commented May 15, 2017

gfyoung commented Oct 28, 2018 • edited Loading

mroeschke commented Feb 7, 2017 •

edited

Loading

gfyoung commented Oct 28, 2018 •

edited

Loading