Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pd.Series.asof performance regression: 14x slower #14461

Closed
laudney opened this issue Oct 20, 2016 · 9 comments · Fixed by #14476
Closed

pd.Series.asof performance regression: 14x slower #14461

laudney opened this issue Oct 20, 2016 · 9 comments · Fixed by #14476
Labels
Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@laudney
Copy link
Contributor

laudney commented Oct 20, 2016

pd.Series.asof takes 14x longer in 0.19.0 than 0.18.1

In[14]: pd.__version__
Out[14]: '0.18.1'
In[15]: s = pd.Series([1, 1, 1, np.nan, np.nan])
In[16]: %timeit s.asof(4)
The slowest run took 8.25 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 8.96 µs per loop

In[8]: pd.__version__
Out[8]: '0.19.0'
In[9]: s = pd.Series([1, 1, 1, np.nan, np.nan])
In[10]: %timeit s.asof(4)
10000 loops, best of 3: 125 µs per loop
@jorisvandenbossche
Copy link
Member

Yes, I have noticed that in our benchmarks as well, but forgot to open an issue. Thanks for reporting!
If you want to have a look what could be the cause, very welcome!

@jorisvandenbossche jorisvandenbossche added Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version labels Oct 20, 2016
@jorisvandenbossche jorisvandenbossche added this to the 0.19.1 milestone Oct 20, 2016
@jreback
Copy link
Contributor

jreback commented Oct 20, 2016

well < 0.19.0 was pretty much broken for non-monotonic indexes and when nulls existed. so you get correct or you can have a tiny differential in perf.

Not sure how this actually matters anyhow except if you are calling this in a loop, which is completely non-idiomatic.

@laudney
Copy link
Contributor Author

laudney commented Oct 21, 2016

Not sure how this actually matters anyhow except if you are calling this in a loop, which is completely non-idiomatic.

pd.Series.asof is extremely flexible. If I pass in a value before the start of the index, it returns NA. If I pass in a value after the end of the index, it returns the very last non-NA value.

Now in one of my cases I have several thousand time series and for a given date, I want to get the asof value for each one and I currently do this in a loop. The date can be anything like a weekend or a holiday.

I've thought about creating a giant DataFrame, reindex it to include every single calendar date, then fillna(method='ffill) but I still need to deal with the dates before the start or after the end of the index with if statements.

And if those time series cover different date/time ranges, the resulting DataFrame is extremely sparse.

Maybe there is a better and more idiomatic way for this particular case?

In other cases, asof has to be called on individual Series without the possibility of concatenating them into a giant DataFrame. The speed of the call is therefore very imporant.

@jreback
Copy link
Contributor

jreback commented Oct 21, 2016

So this is a general method which can handle both Series and DataFrames. an implementation that handles both directly in cython could certainly be done. I would encourage a pull-request to do this.

For example the nulls are pre-computed here, which makes the code much simpler, but is not necessary when iterating (which you can only do in a performant way in cython, comparing nulls as you go).

@jorisvandenbossche
Copy link
Member

well < 0.19.0 was pretty much broken for non-monotonic indexes and when nulls existed. so you get correct or you can have a tiny differential in perf.

Non-monotonic indices are not involved I think, as it just raises for that (the error raise is added in 0.19.0, but that is of course not the perf issue). I rather think this is a case where the generalization of the method for both Series and DataFrame has impacted the performance for (certain) Series only cases.

With a few small adaptions in the current code I can bring it back to 0.18 performance for series (not calculating all nulls in advance + indexing the values instead of Series (in case of Series you return a scalar, so don't need to index the object itself with iloc))

@jreback
Copy link
Contributor

jreback commented Oct 21, 2016

@jorisvandenbossche the non-monotonic checks actually are important, they do take some small amount of time. Again when iterating you can do these in-line.

@laudney would welcome a pull-request.

@jorisvandenbossche
Copy link
Member

@laudney I think it would be rather simple to fix the performance issue. You have to compare the implemention of Series.asof of 0.18 with the current implementation (the differences are the pre-computations of the nulls and no longer working with the underlying array but with the series itself. Both give a performance degradation). As @jreback said, pull request is very welcome!

@laudney
Copy link
Contributor Author

laudney commented Oct 22, 2016

@jreback @jorisvandenbossche Let me take a look but no promise though!

@laudney
Copy link
Contributor Author

laudney commented Oct 22, 2016

@jreback @jorisvandenbossche please check my pull request #14476

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants