Speed up calibration_and_holdout_data #311

orenshk · 2019-08-13T20:41:36Z

Pandas Period objects are slow to groupby:
pandas-dev/pandas#18053

Fix by using dt.to_period and converting to strings before groupby.

Pandas Period objects are slow to groupby: pandas-dev/pandas#18053 Fix by using `dt.to_period` and converting to strings before groupby.

psygo · 2019-08-14T10:53:01Z

Thanks for the contribution, I'll hopefully review this soon.

As a best practice and help for the reviewer(s) though, I would kindly recommend you to:

Prove that what you've implemented is better than what existed.
1. In your case, you could simply show how much time each implementation takes using the benchmark datasets of this library.
Create/modify an automated test to guarantee that what you did is not going to break some other thing inside the library.
1. In the same spirit of 1.i, I've been recommending people to put example scripts inside a new folder called benchmarks (on the top level). I don't know if that's going to scale very well in the far future, but it's going to help development for sure. It wouldn't go into the master branch, so I think it won't be that big a problem.

I don't recall if I implemented those best practices in the PRs I've created in the past (probably not), but I plan on adding them.

orenshk · 2019-08-14T13:19:35Z

Will do. Perhaps a CONTRIBUTING.md file with these instructions?

psygo · 2019-08-15T11:23:08Z

Good suggestion. I was kind of doing that in the Wiki Section but I agree that it's better to add a CONTRIBUTING.md file. I've also added a reference to it in the README.md.

Feel free to add a PR to it; as stated inside it, PRs referring to documentation can be more quickly incorporated to the master branch.

orenshk · 2019-08-15T15:38:01Z

One thing I just realized I misread in your earlier comment:

Create/modify an automated test to guarantee that what you did is not going to break some other thing inside the library.

so there is the test folder for unit tests that is run by travis on the PR . Are you suggesting an additional benchmarks folder for performance regression tests, adding it to travis as well?

orenshk · 2019-08-15T16:05:14Z

benchmark info.
script:

import timeit
from lifetimes.utils import calibration_and_holdout_data
from lifetimes.datasets import load_transaction_data

transaction_data = load_transaction_data()

run = (
    '''calibration_and_holdout_data(transaction_data, 'id', 'date',
                                    calibration_period_end='2014-09-01',
                                    observation_period_end='2014-12-31' )''')

print('running: {}'.format(run))
print(timeit.timeit(run, globals=globals(), number=1))

On dev branch: 0.606072392
On speed_up_calibration_and_holdout_data branch: 0.08493921299999996

I'm not quite sure how to add this and automate in a way that catches a regression.

psygo · 2019-08-16T11:37:28Z

so there is the test folder for unit tests that is run by travis on the PR . Are you suggesting an additional benchmarks folder for performance regression tests, adding it to travis as well?

What I described as the benchmarks folder indeed may look ambiguous or redundant. However, what I meant was more like a scientist's log or report. It would be a way for others to understand the contributor's reasoning. In the end, it shouldn't be that much work in most cases because, when trying to fix something, we already organize our thinking in some file that compares what exists to what we want to achieve. The only thing I'm suggesting is for us to comment it more and perhaps edit the script to be more story-like. This is very usual when Data Scientists develop Jupyter Notebooks — notebooks are another option for the benchmarks folder —, where extensive comments and annotations are crucial for peer review.

The tests are the more succinct version of the above and, in fact, they focus more on making sure the production version won't break. And, since these tests are made by humans, which might not prescribe the correct requirements, the benchmarks folder serves as an extra guarantee that we can always access the reasoning behind the change and add/remove/fix tests.

I'm not quite sure how to add this and automate in a way that catches a regression.

In your case, in my opinion, the benchmarks proof is the different timings you've exposed.

And the test that would seem to prove your contribution won't break what we have is to compare the resulting DataFrames to see if they are equal. One way to achieve this is with pandas.testing.assert_frame_equal.

When creating the test, I would suggest you to use one of the lifetimes datasets. There are already some functions that load them inside the lifetimes/datasets/__init__.py file.

psygo · 2019-08-16T11:39:05Z

Once we reach a more stable understanding of each other, I'll make sure to update the CONTRIBUTING.md file.

orenshk · 2019-08-16T14:07:24Z

OK I think that's my confusion.

And the test that would seem to prove your contribution won't break what we have is to compare the resulting DataFrames to see if they are equal.

There's a battery of tests in test/test_util.py that tests calibration_and_holdout_data that are run automatically by travisCI on the PR.

Why isn't this enough for the 'doesn't break test? That's the purpose of unit tests.

The difficulty with comparing to a changed code explicitly, is that by definition, the code doesn't exist in the branch anymore.

One thing I can see is a notebook / comment in the script as you suggest, which references the commit hash of the code before your change. Someone wishing to run the benchmark can then checkout the dev branch and the other branch, and compare the the. But automation seems tricky.

In addition, a speed comparison has the added wrinkle of needing to record the 'old' speed in order to compare to the new. But on what machine, and with what tolerance?

psygo · 2019-08-16T16:39:58Z

There's a battery of tests in test/test_util.py that tests calibration_and_holdout_data that are run automatically by travisCI on the PR.

If you find an existing test that guarantees your work is not breaking anything and you pass it, then there is no need to add another one. But I don't know if it does exist, so what I was saying came from a more general point of view. If I were in your place, in that case, I would either create a test anyway and put it in the notebook or only reference the name of the test(s) as a comment in my benchmarks file.

My mistake for making it seem that it was "good" to create another redundant test.

In addition, a speed comparison has the added wrinkle of needing to record the 'old' speed in order to compare to the new. But on what machine, and with what tolerance?

I wouldn't say we need to have absolute numbers, we are comparing the speed of the old (current) implementation to the newer one, we only need to know that one is better than the other. As long as you perform both tests on the same machine (running the same programs), it shouldn't be very relevant factor I believe.

With respect to the tolearnce, indeed that's tricky, but your results prove to be more than 7x faster. I would say that's enough quite a lot actually. Instead returning results with absolute numbers, you could return the ratio with which your implementation outpaces the older one.

orenshk · 2019-08-17T02:13:15Z

OK, I think I'm finally groking you. I'll make the changes and recommit.

orenshk · 2019-08-17T02:43:07Z

@psygo changes added, lmk if it's what you had in mind.

…311

…amDavidsonPilon#311

jmorten · 2022-01-13T00:05:22Z

Can this be merged?

Speed up calibration_and_holdout_data

dff04c8

Pandas Period objects are slow to groupby: pandas-dev/pandas#18053 Fix by using `dt.to_period` and converting to strings before groupby.

psygo self-requested a review August 14, 2019 10:53

psygo added the code improvement existing code improvement label Aug 14, 2019

orenshk changed the base branch from master to dev August 15, 2019 16:07

run timeit on new calibration_and_holdout_data

3cda10b

add commit hash for code before change

e3caead

orenshk added 2 commits August 16, 2019 22:41

create notebook detailing changes

6051716

ignore jupyter checkpoints

9ca5608

psygo added a commit that referenced this pull request Aug 21, 2019

better descriptions of how to proceed; comes from a discussion in PR #…

a1315dd

…311

meremeev pushed a commit to meremeev/lifetimes that referenced this pull request Oct 29, 2019

better descriptions of how to proceed; comes from a discussion in PR C…

c3064de

…amDavidsonPilon#311

CamDavidsonPilon closed this Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up calibration_and_holdout_data #311

Speed up calibration_and_holdout_data #311

orenshk commented Aug 13, 2019

psygo commented Aug 14, 2019

orenshk commented Aug 14, 2019

psygo commented Aug 15, 2019

orenshk commented Aug 15, 2019 •

edited

Loading

orenshk commented Aug 15, 2019

psygo commented Aug 16, 2019

psygo commented Aug 16, 2019

orenshk commented Aug 16, 2019 •

edited

Loading

psygo commented Aug 16, 2019 •

edited

Loading

orenshk commented Aug 17, 2019

orenshk commented Aug 17, 2019

jmorten commented Jan 13, 2022

Speed up calibration_and_holdout_data #311

Speed up calibration_and_holdout_data #311

Conversation

orenshk commented Aug 13, 2019

psygo commented Aug 14, 2019

orenshk commented Aug 14, 2019

psygo commented Aug 15, 2019

orenshk commented Aug 15, 2019 • edited Loading

orenshk commented Aug 15, 2019

psygo commented Aug 16, 2019

psygo commented Aug 16, 2019

orenshk commented Aug 16, 2019 • edited Loading

psygo commented Aug 16, 2019 • edited Loading

orenshk commented Aug 17, 2019

orenshk commented Aug 17, 2019

jmorten commented Jan 13, 2022

orenshk commented Aug 15, 2019 •

edited

Loading

orenshk commented Aug 16, 2019 •

edited

Loading

psygo commented Aug 16, 2019 •

edited

Loading