Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculating CLV causing cluster to crash for not so large dataset #1077

Open
haseeb1431 opened this issue Oct 6, 2024 · 12 comments
Open

Calculating CLV causing cluster to crash for not so large dataset #1077

haseeb1431 opened this issue Oct 6, 2024 · 12 comments
Labels

Comments

@haseeb1431
Copy link

Hi,

I have used the lifetime package to write a small program that has been calculating CLV for some of my customers and it has worked fine for most of the datasets. Recently I started exploring this package and implemented the package using the quick start guide. Thanks for putting up an amazing guide.

I am using the BG/NBD and Gamma Gamma model to calculate the CLV. Initially, I took a subset of dataset to complete the end to end implementation and most of the things worked fine.

Now, I have the dataset with 20 millions rows with 200K customers (rfm datasets size). When I try to calculate the CLV for the whole dataset, it crashed the cluster. My code is as follows

clv_estimate = gg.expected_customer_lifetime_value(
    transaction_model=bgm,
    data=rfm_data,
    future_t=12,  # months
    discount_rate=0.01,  # monthly discount rate ~ 12.7% annually
    time_unit="D", 
)

When I run with 1000 rows, I see that it calculates the clv_estimates, even for 10,000 rows as well but any more rows lead to the cluster crash. I wonder if I am missing something that such a small dataset (around 6 MB) does require more 40 GB of memory.

Currently, I am thinking to use the batch approach where I will split my rfm_data into smaller batches and calculate the clv for each of the customers. However, I wanted to be sure if I haven't missed something out that is causing such a worse performance.

Some other details:

  • My small cluster is two nodes with 64 GB memory and 36 cores each (c4.8xlarge on AWS Databricks).
  • The data is read using spark but converted to Pandas for the library methods
@haseeb1431
Copy link
Author

The following image shows that initially I tried to run with 1,000 rows and then I ran it for 100,000 rows and used memory has gone up also 10 times(test cluster with 192 GB memory).
Screenshot 2024-10-07 at 00 02 03

@haseeb1431
Copy link
Author

I have found another interesting thread that makes me wonder if there is something wrong with the combination of Databricks with Pymc-marketing package.

@ColtAllen
Copy link
Collaborator

Hey @haseeb1431,

Try using thin_fit_result() to reduce the number of computations. For a full MCMC fit, there will be (n_customers * chains * draws * time periods) CLV calculations ran, which is a lot to hold in memory. For extremely large datasets you may want to just do a MAP fit instead.

@haseeb1431
Copy link
Author

@ColtAllen using the thin_fit_result() has been really helpful.

A quick question: Currently I am calculating the prediction for years 1, 3, 5, 7, and 10. Does it make predictions progressively for future periods?
For example, If I run a prediction for one year, that's faster than if I run a prediction for 10 years. if it's running progressively then Is there a way to expose those intermediate results as well where I run a prediction for 1oth year but get the values for all the years from 1 till 10.

@ColtAllen
Copy link
Collaborator

A quick question: Currently I am calculating the prediction for years 1, 3, 5, 7, and 10. Does it make predictions progressively for future periods?

Yes; it will iteratively run predictions and sum them together, so 1 year would be 12 monthly predictions, 3 years would be 36, and so on.

For intermediate results you can try the following approach adapted from the tutorial notebooks:

time_periods = 10

expected_clv = xarray.concat(
    objs=[
        ggm.expected_customer_lifetime_value(
            future_t=t,
            *args,
        )
        for t in range(time_periods)
    ],
    dim="t",
).transpose(..., "t")

arviz.plot_hdi(
        range(time_periods),
        expected_clv,
        *args,
    )

@haseeb1431
Copy link
Author

@ColtAllen Thanks for the suggestions however it seems I couldn't explain the issue quite well. Let me try again.

My goal is to calculate the CLV for 1, 3, 5, 7, and 10 years. Now I call the expected_customer_lifetime_value for each year (1,3,5,7,10 separately) and it starts calculating from month 0 every time. As we know CLV calculation is progressive so If I calculate the CLV for 10 years, internally we have calculated the CLV for all the years from the first year till year 10.

  • Year 1 CLV will calculate all the months from 1 to 12
  • Year 3 CLLV will calculate all the months from 1 to 36 (including the 12 months which we calculated last time)
  • Year 5 CLLV will calculate all the months from 1 to 60 (including the 36 months which we calculated last time)
  • Year 7 CLLV will calculate all the months from 1 to 84 (including the 60 months which we calculated last time)
  • Year 10 CLV will calculate all the months from 1 to 120 (including the 84 months which we calculated last time)

So there are a lot of calculations that are repeated. What I wanted is that if we calculate the CLV for 10th year, Can we get the intermediate results output as well such as prediction at each month level so that we can reuse the data and avoid running predictions again and again. I think it's not possible as per the code unless we update it to store the monthly values. Thoughts?

@wd60622
Copy link
Contributor

wd60622 commented Oct 17, 2024

Are you able to perform the 10 year calculation with the whole (or partial) dataset?

@haseeb1431
Copy link
Author

haseeb1431 commented Oct 17, 2024

@wd60622 Yes with the whole dataset for 10 years. However, I am using the thin model.

This code took over an hour for me to execute

rfm_data_pandas = data.toPandas()
for i in [1,3,5,7,10]:
    print(rfm_data_pandas.head())
    clv_thinned = fitted_gg_thinned.expected_customer_lifetime_value(
        transaction_model=fitted_bg_thinned,
        data=rfm_data_pandas, 
        future_t=12*i,  # months
        discount_rate=0.01,  # monthly discount rate ~ 12.7% annually
        time_unit="D",  # original data is in days
    )
    print('predictioned and saving for year {i}')
    rfm_data_pandas[f'clv_{i}_years']= clv_thinned.mean(("chain", "draw")).values.round(2)

@ColtAllen
Copy link
Collaborator

ColtAllen commented Oct 21, 2024

Can we get the intermediate results output as well such as prediction at each month level so that we can reuse the data and avoid running predictions again and again. I think it's not possible as per the code unless we update it to store the monthly values. Thoughts?

Correct - intermediate values are not currently being stored if you want to create an issue for it.

Interval selection is possible for purchase probabilities, but not for expected purchases.

@ColtAllen
Copy link
Collaborator

This code took over an hour for me to execute

I've created an issue to speed up fitting this model: #1123

Also, have you tried using nutpie?

model.fit(fit_method="map") would still be the most performant solution overall, and no different from the default MCMC results given the amount of data you're working with. However, you'll lose credibility intervals for predictions.

@ColtAllen
Copy link
Collaborator

@haseeb1431 the new default priors for BetaGeoModel can speed up fit times 40-50%, and will be available in the next release in January.

@haseeb1431
Copy link
Author

I will implement it once it's released and let you know. Thanks

@ColtAllen ColtAllen removed their assignment Jan 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants