-
Notifications
You must be signed in to change notification settings - Fork 252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calculating CLV causing cluster to crash for not so large dataset #1077
Comments
I have found another interesting thread that makes me wonder if there is something wrong with the combination of Databricks with Pymc-marketing package. |
Hey @haseeb1431, Try using |
@ColtAllen using the A quick question: Currently I am calculating the prediction for years 1, 3, 5, 7, and 10. Does it make predictions progressively for future periods? |
Yes; it will iteratively run predictions and sum them together, so 1 year would be 12 monthly predictions, 3 years would be 36, and so on. For intermediate results you can try the following approach adapted from the tutorial notebooks: time_periods = 10
expected_clv = xarray.concat(
objs=[
ggm.expected_customer_lifetime_value(
future_t=t,
*args,
)
for t in range(time_periods)
],
dim="t",
).transpose(..., "t")
arviz.plot_hdi(
range(time_periods),
expected_clv,
*args,
) |
@ColtAllen Thanks for the suggestions however it seems I couldn't explain the issue quite well. Let me try again. My goal is to calculate the CLV for 1, 3, 5, 7, and 10 years. Now I call the
So there are a lot of calculations that are repeated. What I wanted is that if we calculate the CLV for 10th year, Can we get the intermediate results output as well such as prediction at each month level so that we can reuse the data and avoid running predictions again and again. I think it's not possible as per the code unless we update it to store the monthly values. Thoughts? |
Are you able to perform the 10 year calculation with the whole (or partial) dataset? |
@wd60622 Yes with the whole dataset for 10 years. However, I am using the thin model. This code took over an hour for me to execute
|
Correct - intermediate values are not currently being stored if you want to create an issue for it. Interval selection is possible for purchase probabilities, but not for expected purchases. |
I've created an issue to speed up fitting this model: #1123 Also, have you tried using
|
@haseeb1431 the new default priors for |
I will implement it once it's released and let you know. Thanks |
Hi,
I have used the lifetime package to write a small program that has been calculating CLV for some of my customers and it has worked fine for most of the datasets. Recently I started exploring this package and implemented the package using the quick start guide. Thanks for putting up an amazing guide.
I am using the BG/NBD and Gamma Gamma model to calculate the CLV. Initially, I took a subset of dataset to complete the end to end implementation and most of the things worked fine.
Now, I have the dataset with 20 millions rows with 200K customers (rfm datasets size). When I try to calculate the CLV for the whole dataset, it crashed the cluster. My code is as follows
When I run with 1000 rows, I see that it calculates the clv_estimates, even for 10,000 rows as well but any more rows lead to the cluster crash. I wonder if I am missing something that such a small dataset (around 6 MB) does require more 40 GB of memory.
Currently, I am thinking to use the batch approach where I will split my rfm_data into smaller batches and calculate the clv for each of the customers. However, I wanted to be sure if I haven't missed something out that is causing such a worse performance.
Some other details:
The text was updated successfully, but these errors were encountered: