-
Notifications
You must be signed in to change notification settings - Fork 377
Conversation
Pandas Period objects are slow to groupby: pandas-dev/pandas#18053 Fix by using `dt.to_period` and converting to strings before groupby.
Thanks for the contribution, I'll hopefully review this soon. As a best practice and help for the reviewer(s) though, I would kindly recommend you to:
I don't recall if I implemented those best practices in the PRs I've created in the past (probably not), but I plan on adding them. |
Will do. Perhaps a CONTRIBUTING.md file with these instructions? |
Good suggestion. I was kind of doing that in the Wiki Section but I agree that it's better to add a Feel free to add a PR to it; as stated inside it, PRs referring to documentation can be more quickly incorporated to the |
One thing I just realized I misread in your earlier comment:
so there is the test folder for unit tests that is run by travis on the PR . Are you suggesting an additional |
benchmark info. import timeit
from lifetimes.utils import calibration_and_holdout_data
from lifetimes.datasets import load_transaction_data
transaction_data = load_transaction_data()
run = (
'''calibration_and_holdout_data(transaction_data, 'id', 'date',
calibration_period_end='2014-09-01',
observation_period_end='2014-12-31' )''')
print('running: {}'.format(run))
print(timeit.timeit(run, globals=globals(), number=1)) On I'm not quite sure how to add this and automate in a way that catches a regression. |
What I described as the The tests are the more succinct version of the above and, in fact, they focus more on making sure the production version won't break. And, since these tests are made by humans, which might not prescribe the correct requirements, the
In your case, in my opinion, the And the test that would seem to prove your contribution won't break what we have is to compare the resulting DataFrames to see if they are equal. One way to achieve this is with When creating the test, I would suggest you to use one of the |
Once we reach a more stable understanding of each other, I'll make sure to update the |
OK I think that's my confusion.
There's a battery of tests in Why isn't this enough for the 'doesn't break test? That's the purpose of unit tests. The difficulty with comparing to a changed code explicitly, is that by definition, the code doesn't exist in the branch anymore. One thing I can see is a notebook / comment in the script as you suggest, which references the commit hash of the code before your change. Someone wishing to run the benchmark can then checkout the dev branch and the other branch, and compare the the. But automation seems tricky. In addition, a speed comparison has the added wrinkle of needing to record the 'old' speed in order to compare to the new. But on what machine, and with what tolerance? |
If you find an existing test that guarantees your work is not breaking anything and you pass it, then there is no need to add another one. But I don't know if it does exist, so what I was saying came from a more general point of view. If I were in your place, in that case, I would either create a test anyway and put it in the notebook or only reference the name of the test(s) as a comment in my My mistake for making it seem that it was "good" to create another redundant test.
I wouldn't say we need to have absolute numbers, we are comparing the speed of the old (current) implementation to the newer one, we only need to know that one is better than the other. As long as you perform both tests on the same machine (running the same programs), it shouldn't be very relevant factor I believe. With respect to the tolearnce, indeed that's tricky, but your results prove to be more than 7x faster. I would say that's enough quite a lot actually. Instead returning results with absolute numbers, you could return the ratio with which your implementation outpaces the older one. |
OK, I think I'm finally groking you. I'll make the changes and recommit. |
@psygo changes added, lmk if it's what you had in mind. |
Can this be merged? |
Pandas Period objects are slow to
groupby
:pandas-dev/pandas#18053
Fix by using
dt.to_period
and converting to strings beforegroupby
.