Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: casting loc to labels dtype before searchsorted #14551

Merged

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Nov 1, 2016

Intrigued by the profiling results of the below example (Multi-index loc indexing, based on the example in #14549), where searchsorted seemed to take the majority of the computation time.
And it seems that searchsorted casts both inputs (in this case labels and loc) to a common dtype, and the labels of the MultiIndex were in this case int16, while loc (output from Index.get_loc) is a python int.

By casting loc to the dtype of labels, the specific example gets a ca 20 x speed improvement

df = pd.DataFrame({'a': np.random.randn(500*5000)}, index=pd.MultiIndex.from_product([date_range("2014-01-01", periods=500), range(5000)]))
dt = pd.Timestamp('2015-01-01')
%timeit df.loc[dt]

On master:

In [3]: %timeit df.loc[dt]
The slowest run took 5.70 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 9.39 ms per loop

with this PR:

In [3]: %timeit df.loc[dt]
The slowest run took 122.51 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 422 µs per loop

Just putting it here as a showcase.
Actual PR probably needs some more work (other places where this can be done, can loc ever be out of bound for that dtype?, benchmarks, ..)

@@ -1907,6 +1907,7 @@ def convert_indexer(start, stop, step, indexer=indexer, labels=labels):
return np.array(labels == loc, dtype=bool)
else:
# sorted, so can return slice object -> view
loc = labels.dtype.type(loc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe should do this like

loc, orig_loc = lables.dtype.type(loc), loc
if  loc != orig_loc:
    loc = orig_loc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I am not sure how much checking would be needed here.

My understanding is that it probably is not needed in this case, as loc is coming from loc = level_index.get_loc(key), and labels and level_index are from the same MultiIndex. So I would assume that get_loc can only return an existing label, and so should fit in the dtype of labels?

(but probably also not that a perf issue to do the check)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that sounds right; if the test suite passes, prob ok!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently not :-)
But it's only a partial indexing test that fails. So loc could also be a slice, and it is expected in partial indexing that this raises an error in searchsorted, but now it raises already a slightly different error in dtype.type(loc). So that can easily be solved with:

                try:
                    loc = labels.dtype.type(loc)
                except TypeError:
                    # this occurs when loc is a slice (partial string indexing)
                    # but the TypeError raised by searchsorted in this case
                    # is catched in Index._has_valid_type()
                    pass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, or just tests if its a scalar to begin with

@codecov-io
Copy link

codecov-io commented Nov 1, 2016

Current coverage is 85.26% (diff: 100%)

Merging #14551 into master will increase coverage by <.01%

@@             master     #14551   diff @@
==========================================
  Files           140        140          
  Lines         50672      50676     +4   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43207      43211     +4   
  Misses         7465       7465          
  Partials          0          0          

Powered by Codecov. Last update 60a335e...6447e4c

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Nov 1, 2016
@jreback jreback added this to the 0.19.1 milestone Nov 1, 2016
@jreback
Copy link
Contributor

jreback commented Nov 1, 2016

lgtm. just release note. I think this can only help :>

@jorisvandenbossche
Copy link
Member Author

OK, added whatsnew notice. Will merge, but will open a new issue for follow-up on this (other cases, benchmarks that capture this (at the moment there are none)).

@jorisvandenbossche jorisvandenbossche merged commit 1d95179 into pandas-dev:master Nov 2, 2016
jorisvandenbossche added a commit that referenced this pull request Nov 3, 2016
yarikoptic added a commit to neurodebian/pandas that referenced this pull request Nov 18, 2016
Version 0.19.1

* tag 'v0.19.1': (43 commits)
  RLS: v0.19.1
  DOC: update whatsnew/release notes for 0.19.1 (pandas-dev#14573)
  [Backport pandas-dev#14545] BUG/API: Index.append with mixed object/Categorical indices (pandas-dev#14545)
  DOC: rst fixes
  [Backport pandas-dev#14567] DEPR: add deprecation warning for com.array_equivalent (pandas-dev#14567)
  [Backport pandas-dev#14551] PERF: casting loc to labels dtype before searchsorted (pandas-dev#14551)
  [Backport pandas-dev#14536] BUG: DataFrame.quantile with NaNs (GH14357) (pandas-dev#14536)
  [Backport pandas-dev#14520] BUG: don't close user-provided file handles in C parser (GH14418) (pandas-dev#14520)
  [Backport pandas-dev#14392] BUG: Dataframe constructor when given dict with None value (pandas-dev#14392)
  [Backport pandas-dev#14514] BUG: Don't parse inline quotes in skipped lines (pandas-dev#14514)
  [Bacport pandas-dev#14543] BUG: tseries ceil doc fix (pandas-dev#14543)
  [Backport pandas-dev#14541] DOC: Simplify the gbq integration testing procedure for contributors (pandas-dev#14541)
  [Backport pandas-dev#14527] BUG/ERR: raise correct error when sql driver is not installed (pandas-dev#14527)
  [Backport pandas-dev#14501] BUG: fix DatetimeIndex._maybe_cast_slice_bound for empty index (GH14354) (pandas-dev#14501)
  [Backport pandas-dev#14442] DOC: Expand on reference docs for read_json() (pandas-dev#14442)
  BLD: fix 3.4 build for cython to 0.24.1
  [Backport pandas-dev#14492] BUG: Accept unicode quotechars again in pd.read_csv
  [Backport pandas-dev#14496] BLD: Support Cython 0.25
  [Backport pandas-dev#14498] COMPAT/TST: fix test for range testing of negative integers to neg powers
  [Backport pandas-dev#14476] PERF: performance regression in Series.asof (pandas-dev#14476)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants