Skip to content
This repository has been archived by the owner on Jun 28, 2024. It is now read-only.

Speedup of summary_data_from_transaction_data #237

Merged
merged 3 commits into from
Jan 7, 2019

Conversation

MichaelSchreier
Copy link
Contributor

Replaced pd.period instances with timestamps for the actual calculations / aggregations which yields a speedup of several factors x10 on large datasets. For a dummy dataset with 100000 entries execution time drops from ~7 seconds to 0.13 seconds on my machine.

.to_period is still applied to truncate dates (more consistently in fact) which means results should remain largely unaffected.

Note that starting from pandas 0.24.0 some or all of the operations involving pd.period will also be considerably faster, however, the calculations via timestamps (i.e. integers) here should still have an edge by a couple of factors.

Replaced pd.period instances with timestamps for the actual calculations / aggregations which yields a speedup of several factors x10 on large datasets.
utils.py Outdated
@@ -0,0 +1,551 @@
"""Lifetimes utils and helpers."""
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, did you mean to edit the lifetimes.utils python file? As is, this PR adds a brand new utils.py file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what you get for being lazy and using the web interface rather than a proper client -.-
This is fixed, now though there seem to be some some residual issues with period values which I will fix asap.

@MichaelSchreier
Copy link
Contributor Author

For compatibility reasons _find_first_transactions has to return datetime_col as pd.Period. This requires castings the timestamps back to periods before returning the results. This cast turns out to be relatively expensive, degrading the overall performance by a factor of 2-3 (on very large datasets) compared to when no cast is made.
The implementation remains substantially faster than before and probably faster still after pandas 0.24.0 will have addressed some of the performance issues with period values.

@CamDavidsonPilon
Copy link
Owner

Thanks for the performance additions @MichaelSchreier! lgtm!

@CamDavidsonPilon CamDavidsonPilon merged commit 00ca929 into CamDavidsonPilon:master Jan 7, 2019
@CamDavidsonPilon CamDavidsonPilon mentioned this pull request Jan 7, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants