pd.Series.asof performance regression: 14x slower #14461

laudney · 2016-10-20T16:05:57Z

pd.Series.asof takes 14x longer in 0.19.0 than 0.18.1

In[14]: pd.__version__
Out[14]: '0.18.1'
In[15]: s = pd.Series([1, 1, 1, np.nan, np.nan])
In[16]: %timeit s.asof(4)
The slowest run took 8.25 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 8.96 µs per loop

In[8]: pd.__version__
Out[8]: '0.19.0'
In[9]: s = pd.Series([1, 1, 1, np.nan, np.nan])
In[10]: %timeit s.asof(4)
10000 loops, best of 3: 125 µs per loop

jorisvandenbossche · 2016-10-20T17:54:40Z

Yes, I have noticed that in our benchmarks as well, but forgot to open an issue. Thanks for reporting!
If you want to have a look what could be the cause, very welcome!

jreback · 2016-10-20T18:01:04Z

well < 0.19.0 was pretty much broken for non-monotonic indexes and when nulls existed. so you get correct or you can have a tiny differential in perf.

Not sure how this actually matters anyhow except if you are calling this in a loop, which is completely non-idiomatic.

laudney · 2016-10-21T09:24:30Z

Not sure how this actually matters anyhow except if you are calling this in a loop, which is completely non-idiomatic.

pd.Series.asof is extremely flexible. If I pass in a value before the start of the index, it returns NA. If I pass in a value after the end of the index, it returns the very last non-NA value.

Now in one of my cases I have several thousand time series and for a given date, I want to get the asof value for each one and I currently do this in a loop. The date can be anything like a weekend or a holiday.

I've thought about creating a giant DataFrame, reindex it to include every single calendar date, then fillna(method='ffill) but I still need to deal with the dates before the start or after the end of the index with if statements.

And if those time series cover different date/time ranges, the resulting DataFrame is extremely sparse.

Maybe there is a better and more idiomatic way for this particular case?

In other cases, asof has to be called on individual Series without the possibility of concatenating them into a giant DataFrame. The speed of the call is therefore very imporant.

jreback · 2016-10-21T10:47:48Z

So this is a general method which can handle both Series and DataFrames. an implementation that handles both directly in cython could certainly be done. I would encourage a pull-request to do this.

For example the nulls are pre-computed here, which makes the code much simpler, but is not necessary when iterating (which you can only do in a performant way in cython, comparing nulls as you go).

jorisvandenbossche · 2016-10-21T10:47:53Z

well < 0.19.0 was pretty much broken for non-monotonic indexes and when nulls existed. so you get correct or you can have a tiny differential in perf.

Non-monotonic indices are not involved I think, as it just raises for that (the error raise is added in 0.19.0, but that is of course not the perf issue). I rather think this is a case where the generalization of the method for both Series and DataFrame has impacted the performance for (certain) Series only cases.

With a few small adaptions in the current code I can bring it back to 0.18 performance for series (not calculating all nulls in advance + indexing the values instead of Series (in case of Series you return a scalar, so don't need to index the object itself with iloc))

jreback · 2016-10-21T10:49:39Z

@jorisvandenbossche the non-monotonic checks actually are important, they do take some small amount of time. Again when iterating you can do these in-line.

@laudney would welcome a pull-request.

jorisvandenbossche · 2016-10-22T08:33:27Z

@laudney I think it would be rather simple to fix the performance issue. You have to compare the implemention of Series.asof of 0.18 with the current implementation (the differences are the pre-computations of the nulls and no longer working with the underlying array but with the series itself. Both give a performance degradation). As @jreback said, pull request is very welcome!

laudney · 2016-10-22T14:42:07Z

@jreback @jorisvandenbossche Let me take a look but no promise though!

laudney · 2016-10-22T16:22:10Z

@jreback @jorisvandenbossche please check my pull request #14476

jorisvandenbossche added Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version labels Oct 20, 2016

jorisvandenbossche added this to the 0.19.1 milestone Oct 20, 2016

laudney mentioned this issue Oct 22, 2016

PERF: performance regression in Series.asof #14476

Merged

2 tasks

jorisvandenbossche closed this as completed in #14476 Oct 26, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pd.Series.asof performance regression: 14x slower #14461

pd.Series.asof performance regression: 14x slower #14461

laudney commented Oct 20, 2016

jorisvandenbossche commented Oct 20, 2016

jreback commented Oct 20, 2016

laudney commented Oct 21, 2016

jreback commented Oct 21, 2016

jorisvandenbossche commented Oct 21, 2016

jreback commented Oct 21, 2016

jorisvandenbossche commented Oct 22, 2016

laudney commented Oct 22, 2016

laudney commented Oct 22, 2016

pd.Series.asof performance regression: 14x slower #14461

pd.Series.asof performance regression: 14x slower #14461

Comments

laudney commented Oct 20, 2016

pd.Series.asof takes 14x longer in 0.19.0 than 0.18.1

jorisvandenbossche commented Oct 20, 2016

jreback commented Oct 20, 2016

laudney commented Oct 21, 2016

jreback commented Oct 21, 2016

jorisvandenbossche commented Oct 21, 2016

jreback commented Oct 21, 2016

jorisvandenbossche commented Oct 22, 2016

laudney commented Oct 22, 2016

laudney commented Oct 22, 2016