Speedup of summary_data_from_transaction_data #237

MichaelSchreier · 2018-12-13T21:18:42Z

Replaced pd.period instances with timestamps for the actual calculations / aggregations which yields a speedup of several factors x10 on large datasets. For a dummy dataset with 100000 entries execution time drops from ~7 seconds to 0.13 seconds on my machine.

.to_period is still applied to truncate dates (more consistently in fact) which means results should remain largely unaffected.

Note that starting from pandas 0.24.0 some or all of the operations involving pd.period will also be considerably faster, however, the calculations via timestamps (i.e. integers) here should still have an edge by a couple of factors.

Replaced pd.period instances with timestamps for the actual calculations / aggregations which yields a speedup of several factors x10 on large datasets.

CamDavidsonPilon · 2018-12-13T22:32:20Z

utils.py

@@ -0,0 +1,551 @@
+"""Lifetimes utils and helpers."""


hm, did you mean to edit the lifetimes.utils python file? As is, this PR adds a brand new utils.py file.

That's what you get for being lazy and using the web interface rather than a proper client -.-
This is fixed, now though there seem to be some some residual issues with period values which I will fix asap.

…to ensure compatibility with other methods

MichaelSchreier · 2018-12-14T20:55:09Z

For compatibility reasons _find_first_transactions has to return datetime_col as pd.Period. This requires castings the timestamps back to periods before returning the results. This cast turns out to be relatively expensive, degrading the overall performance by a factor of 2-3 (on very large datasets) compared to when no cast is made.
The implementation remains substantially faster than before and probably faster still after pandas 0.24.0 will have addressed some of the performance issues with period values.

CamDavidsonPilon · 2019-01-07T14:57:27Z

Thanks for the performance additions @MichaelSchreier! lgtm!

Speedup of summary_data_from_transaction_data

3eb45a0

Replaced pd.period instances with timestamps for the actual calculations / aggregations which yields a speedup of several factors x10 on large datasets.

CamDavidsonPilon reviewed Dec 13, 2018

View reviewed changes

MichaelSchreier added 2 commits December 14, 2018 06:34

Moved utils.py to proper location

3e00457

_find_first_transactions now returns datetime_col as pd.Period again …

5b4e51d

…to ensure compatibility with other methods

CamDavidsonPilon merged commit 00ca929 into CamDavidsonPilon:master Jan 7, 2019

CamDavidsonPilon mentioned this pull request Jan 7, 2019

v0.10.1 #241

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup of summary_data_from_transaction_data #237

Speedup of summary_data_from_transaction_data #237

MichaelSchreier commented Dec 13, 2018

CamDavidsonPilon Dec 13, 2018

MichaelSchreier Dec 14, 2018

MichaelSchreier commented Dec 14, 2018

CamDavidsonPilon commented Jan 7, 2019

Speedup of summary_data_from_transaction_data #237

Speedup of summary_data_from_transaction_data #237

Conversation

MichaelSchreier commented Dec 13, 2018

CamDavidsonPilon Dec 13, 2018

Choose a reason for hiding this comment

MichaelSchreier Dec 14, 2018

Choose a reason for hiding this comment

MichaelSchreier commented Dec 14, 2018

CamDavidsonPilon commented Jan 7, 2019