-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: performance regression in Series.asof #14476
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs an asv on several different cases
this will be very slow with lots of nulls
@jreback you mean we should always pre-compute nulls for DataFrame? |
the answer is probably yes |
Current coverage is 85.26% (diff: 100%)@@ master #14476 diff @@
==========================================
Files 140 140
Lines 50667 50670 +3
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 43203 43206 +3
Misses 7464 7464
Partials 0 0
|
@jreback what's asv? |
There are some benchmarks for the series case already: https://github.com/pandas-dev/pandas/blob/master/asv_bench/benchmarks/timeseries.py#L283. We should indeed add for DataFrame. For Series I suspect that the approach in this PR will also be faster with lots of NaNs, as the indexing of Series is much slower that the isnull checking on single values. For DataFrame not directly sure, so there is may make sense to precompute the NaNs (should be checked). |
asv ... -b ^timeseries
|
@laudney You can selectively run only the asof ones with |
asv ... -b asof
|
need several more asvs |
goal_time = 0.2 | ||
|
||
def setup(self): | ||
self.N = 10000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are too small make 100000
put nans at beginning and another one at the end
I've updated my pull request code to refactor existing asv tests for Series.asof and add new ones for DataFrame.asof. The main findings are:
asv results below:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can u add your explanation of perf in a comment
with a ref to this pr number in a comment as well
lgtm otherwise
@jreback I've added comment in the PR. Please check and let me know. Off to bed. |
I will add two more asv tests and more detailed comment tonight. I will ping you when ready. |
lgtm. pls add a whatsnew entry in the performance section. ping on green. |
…ulls and returning value by indexing the underlying ndarray.
@jreback @jorisvandenbossche I've updated the PR: more comment and more asv tests. It's ready. Please review and merge. impact as below:
|
@laudney pls add a whatsnew note |
@jreback whatsnew updated for 0.19.1 |
@jreback @jorisvandenbossche It's all green on my side. Let me know if you need anything else. |
@laudney Thanks! |
…of (pandas-dev#14476) * Fix performance regression in Series.asof by avoiding pre-computing nulls and returning value by indexing the underlying ndarray. (cherry picked from commit e3d943d)
Version 0.19.1 * tag 'v0.19.1': (43 commits) RLS: v0.19.1 DOC: update whatsnew/release notes for 0.19.1 (pandas-dev#14573) [Backport pandas-dev#14545] BUG/API: Index.append with mixed object/Categorical indices (pandas-dev#14545) DOC: rst fixes [Backport pandas-dev#14567] DEPR: add deprecation warning for com.array_equivalent (pandas-dev#14567) [Backport pandas-dev#14551] PERF: casting loc to labels dtype before searchsorted (pandas-dev#14551) [Backport pandas-dev#14536] BUG: DataFrame.quantile with NaNs (GH14357) (pandas-dev#14536) [Backport pandas-dev#14520] BUG: don't close user-provided file handles in C parser (GH14418) (pandas-dev#14520) [Backport pandas-dev#14392] BUG: Dataframe constructor when given dict with None value (pandas-dev#14392) [Backport pandas-dev#14514] BUG: Don't parse inline quotes in skipped lines (pandas-dev#14514) [Bacport pandas-dev#14543] BUG: tseries ceil doc fix (pandas-dev#14543) [Backport pandas-dev#14541] DOC: Simplify the gbq integration testing procedure for contributors (pandas-dev#14541) [Backport pandas-dev#14527] BUG/ERR: raise correct error when sql driver is not installed (pandas-dev#14527) [Backport pandas-dev#14501] BUG: fix DatetimeIndex._maybe_cast_slice_bound for empty index (GH14354) (pandas-dev#14501) [Backport pandas-dev#14442] DOC: Expand on reference docs for read_json() (pandas-dev#14442) BLD: fix 3.4 build for cython to 0.24.1 [Backport pandas-dev#14492] BUG: Accept unicode quotechars again in pd.read_csv [Backport pandas-dev#14496] BLD: Support Cython 0.25 [Backport pandas-dev#14498] COMPAT/TST: fix test for range testing of negative integers to neg powers [Backport pandas-dev#14476] PERF: performance regression in Series.asof (pandas-dev#14476) ...
Fix performance regression in Series.asof by avoiding pre-computing nulls and returning value by indexing the underlying ndarray.