Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: better use of searchsorted for indexing performance #14565

Closed
jorisvandenbossche opened this issue Nov 2, 2016 · 7 comments · Fixed by #22034
Closed

PERF: better use of searchsorted for indexing performance #14565

jorisvandenbossche opened this issue Nov 2, 2016 · 7 comments · Fixed by #22034
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Milestone

Comments

@jorisvandenbossche
Copy link
Member

Follow-up on #14551

When searchsorted is used (eg as labels.searchsorted(loc)), it is important that loc is first cast to the same dtype as labels for performance reasons.

PR #14551 fixed one such example, but further actions:

  • check the other usages of searchsorted for this pattern
  • add benchmarks to track the improvement
@jorisvandenbossche jorisvandenbossche added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance Difficulty Intermediate labels Nov 2, 2016
@jorisvandenbossche jorisvandenbossche modified the milestones: Next Major Release, 0.20.0 Nov 2, 2016
@mroeschke
Copy link
Member

Hi @jorisvandenbossche

I've been playing around with your original speedup:

try:
    loc = labels.dtype.type(loc)
except: TypeError:
    pass

labels.searchsorted(loc)

And I noticed that normally searchsorted is usually called on a FrozenNDArray object here (for MultiIndexes at least). Would it make sense to create a _searchsorted method in FrozenNDArray like so:

class FrozenNDArray(...):
    ...
    ...
    def _searchsorted(self, v, side='left', sorter=None):
        try:
            v = self.values.dtype.type(v)
        except: TypeError:
            pass
        return self.seachsorted(self.values, v,  side=side, sorter=sorter)

And then call searchsorted like so?

labels._searchsorted(loc)

@jorisvandenbossche
Copy link
Member Author

@mroeschke That seems to make sense, yes. Do you know if there are many occurrences of the use of searchsorted that should be adapted?
It may be cleaner to make a private helper function/method that does this instead adding it to the FrozenNDArray.

@mroeschke
Copy link
Member

mroeschke commented Feb 7, 2017

Sorry for the delay in response @jorisvandenbossche. Reviewing where searchsorted can be adapted:

There are four occurrences in this block that speeds up df.loc[(idx,)]
https://github.com/pandas-dev/pandas/blob/master/pandas/indexes/multi.py#L1636
https://github.com/pandas-dev/pandas/blob/master/pandas/indexes/multi.py#L1639

Two more in this block (not sure of the performance use case, but similar code path to your original fix):
https://github.com/pandas-dev/pandas/blob/master/pandas/indexes/multi.py#L1933

And I believe one on the _bound property (not sure where this is used):
https://github.com/pandas-dev/pandas/blob/master/pandas/indexes/multi.py#L2309

So maybe a private helper function in pandas/indexes/multi.py might be better here.

Also I saw that there's a Series.searchsorted() and Index.searchsorted() functionality as well. I did not dig into their implementations too much, but they may also benefit if they are subclassed from NDarray?

@jorisvandenbossche
Copy link
Member Author

So maybe a private helper function in pandas/indexes/multi.py might be better here.

+1

The _bound I am also not directly sure what it is for (from a quick search, I actually don't find any place where this is used)

Also I saw that there's a Series.searchsorted() and Index.searchsorted() functionality as well.

I would first focus on the internal use cases. In those cases, you would need a bit more checking, as you can in principle pass a float to a searchsorted of ints, and doing a dtype.type(val) would round the float and can give different results, so you would need to check the equality before and after (in the internal multi-index case, we know that the locs are no floats). The searchsorted method on Series and Index just pass things through to numpy searchsorted, so would leave that to keep things simpler.

@jreback jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017
@jreback jreback modified the milestones: Next Major Release, Next Minor Release Mar 29, 2017
@jorisvandenbossche
Copy link
Member Author

In the meantime, I think this has been fixed in the deprecation panel PR by @jreback (see discussion here: #15601 (comment), and this commit: c39453a)

So basically the approach of having a FrozenNDArray searchsorted method to handle this as proposed above

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.20.0, Interesting Issues May 15, 2017
@jorisvandenbossche
Copy link
Member Author

Would be good to still do the second point in the original issue here: add benchmarks for the affected cases.

@gfyoung
Copy link
Member

gfyoung commented Oct 28, 2018

@jreback @jorisvandenbossche @mroeschke : I'm planning to deprecate FrozenNDArray pretty soon (see here for rationale).

Not to mention, the performance optimization for FrozenNDArray was handled in #15601.

Given that FrozenNDArray was pretty central to this conversation, any reason to keep this open?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
4 participants