Skip to content
This repository has been archived by the owner on Jun 28, 2024. It is now read-only.

Speed up calibration_and_holdout_data #311

Closed
wants to merge 5 commits into from

Conversation

orenshk
Copy link

@orenshk orenshk commented Aug 13, 2019

Pandas Period objects are slow to groupby:
pandas-dev/pandas#18053

Fix by using dt.to_period and converting to strings before groupby.

Pandas Period objects are slow to groupby:
pandas-dev/pandas#18053

Fix by using `dt.to_period` and converting to strings before groupby.
@psygo
Copy link
Collaborator

psygo commented Aug 14, 2019

Thanks for the contribution, I'll hopefully review this soon.

As a best practice and help for the reviewer(s) though, I would kindly recommend you to:

  1. Prove that what you've implemented is better than what existed.
    1. In your case, you could simply show how much time each implementation takes using the benchmark datasets of this library.
  2. Create/modify an automated test to guarantee that what you did is not going to break some other thing inside the library.
    1. In the same spirit of 1.i, I've been recommending people to put example scripts inside a new folder called benchmarks (on the top level). I don't know if that's going to scale very well in the far future, but it's going to help development for sure. It wouldn't go into the master branch, so I think it won't be that big a problem.

I don't recall if I implemented those best practices in the PRs I've created in the past (probably not), but I plan on adding them.

@psygo psygo self-requested a review August 14, 2019 10:53
@psygo psygo added the code improvement existing code improvement label Aug 14, 2019
@orenshk
Copy link
Author

orenshk commented Aug 14, 2019

Will do. Perhaps a CONTRIBUTING.md file with these instructions?

@psygo
Copy link
Collaborator

psygo commented Aug 15, 2019

Good suggestion. I was kind of doing that in the Wiki Section but I agree that it's better to add a CONTRIBUTING.md file. I've also added a reference to it in the README.md.

Feel free to add a PR to it; as stated inside it, PRs referring to documentation can be more quickly incorporated to the master branch.

@orenshk
Copy link
Author

orenshk commented Aug 15, 2019

One thing I just realized I misread in your earlier comment:

Create/modify an automated test to guarantee that what you did is not going to break some other thing inside the library.

so there is the test folder for unit tests that is run by travis on the PR . Are you suggesting an additional benchmarks folder for performance regression tests, adding it to travis as well?

@orenshk
Copy link
Author

orenshk commented Aug 15, 2019

benchmark info.
script:

import timeit
from lifetimes.utils import calibration_and_holdout_data
from lifetimes.datasets import load_transaction_data

transaction_data = load_transaction_data()

run = (
    '''calibration_and_holdout_data(transaction_data, 'id', 'date',
                                    calibration_period_end='2014-09-01',
                                    observation_period_end='2014-12-31' )''')

print('running: {}'.format(run))
print(timeit.timeit(run, globals=globals(), number=1))

On dev branch: 0.606072392
On speed_up_calibration_and_holdout_data branch: 0.08493921299999996

I'm not quite sure how to add this and automate in a way that catches a regression.

@orenshk orenshk changed the base branch from master to dev August 15, 2019 16:07
@psygo
Copy link
Collaborator

psygo commented Aug 16, 2019

so there is the test folder for unit tests that is run by travis on the PR . Are you suggesting an additional benchmarks folder for performance regression tests, adding it to travis as well?

What I described as the benchmarks folder indeed may look ambiguous or redundant. However, what I meant was more like a scientist's log or report. It would be a way for others to understand the contributor's reasoning. In the end, it shouldn't be that much work in most cases because, when trying to fix something, we already organize our thinking in some file that compares what exists to what we want to achieve. The only thing I'm suggesting is for us to comment it more and perhaps edit the script to be more story-like. This is very usual when Data Scientists develop Jupyter Notebooks — notebooks are another option for the benchmarks folder —, where extensive comments and annotations are crucial for peer review.

The tests are the more succinct version of the above and, in fact, they focus more on making sure the production version won't break. And, since these tests are made by humans, which might not prescribe the correct requirements, the benchmarks folder serves as an extra guarantee that we can always access the reasoning behind the change and add/remove/fix tests.

I'm not quite sure how to add this and automate in a way that catches a regression.

In your case, in my opinion, the benchmarks proof is the different timings you've exposed.

And the test that would seem to prove your contribution won't break what we have is to compare the resulting DataFrames to see if they are equal. One way to achieve this is with pandas.testing.assert_frame_equal.

When creating the test, I would suggest you to use one of the lifetimes datasets. There are already some functions that load them inside the lifetimes/datasets/__init__.py file.

@psygo
Copy link
Collaborator

psygo commented Aug 16, 2019

Once we reach a more stable understanding of each other, I'll make sure to update the CONTRIBUTING.md file.

@orenshk
Copy link
Author

orenshk commented Aug 16, 2019

OK I think that's my confusion.

And the test that would seem to prove your contribution won't break what we have is to compare the resulting DataFrames to see if they are equal.

There's a battery of tests in test/test_util.py that tests calibration_and_holdout_data that are run automatically by travisCI on the PR.

Why isn't this enough for the 'doesn't break test? That's the purpose of unit tests.

The difficulty with comparing to a changed code explicitly, is that by definition, the code doesn't exist in the branch anymore.

One thing I can see is a notebook / comment in the script as you suggest, which references the commit hash of the code before your change. Someone wishing to run the benchmark can then checkout the dev branch and the other branch, and compare the the. But automation seems tricky.

In addition, a speed comparison has the added wrinkle of needing to record the 'old' speed in order to compare to the new. But on what machine, and with what tolerance?

@psygo
Copy link
Collaborator

psygo commented Aug 16, 2019

There's a battery of tests in test/test_util.py that tests calibration_and_holdout_data that are run automatically by travisCI on the PR.

If you find an existing test that guarantees your work is not breaking anything and you pass it, then there is no need to add another one. But I don't know if it does exist, so what I was saying came from a more general point of view. If I were in your place, in that case, I would either create a test anyway and put it in the notebook or only reference the name of the test(s) as a comment in my benchmarks file.

My mistake for making it seem that it was "good" to create another redundant test.

In addition, a speed comparison has the added wrinkle of needing to record the 'old' speed in order to compare to the new. But on what machine, and with what tolerance?

I wouldn't say we need to have absolute numbers, we are comparing the speed of the old (current) implementation to the newer one, we only need to know that one is better than the other. As long as you perform both tests on the same machine (running the same programs), it shouldn't be very relevant factor I believe.

With respect to the tolearnce, indeed that's tricky, but your results prove to be more than 7x faster. I would say that's enough quite a lot actually. Instead returning results with absolute numbers, you could return the ratio with which your implementation outpaces the older one.

@orenshk
Copy link
Author

orenshk commented Aug 17, 2019

OK, I think I'm finally groking you. I'll make the changes and recommit.

@orenshk
Copy link
Author

orenshk commented Aug 17, 2019

@psygo changes added, lmk if it's what you had in mind.

@jmorten
Copy link

jmorten commented Jan 13, 2022

Can this be merged?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
code improvement existing code improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants