Calculating CLV causing cluster to crash for not so large dataset #1077

haseeb1431 · 2024-10-06T21:05:13Z

Hi,

I have used the lifetime package to write a small program that has been calculating CLV for some of my customers and it has worked fine for most of the datasets. Recently I started exploring this package and implemented the package using the quick start guide. Thanks for putting up an amazing guide.

I am using the BG/NBD and Gamma Gamma model to calculate the CLV. Initially, I took a subset of dataset to complete the end to end implementation and most of the things worked fine.

Now, I have the dataset with 20 millions rows with 200K customers (rfm datasets size). When I try to calculate the CLV for the whole dataset, it crashed the cluster. My code is as follows

clv_estimate = gg.expected_customer_lifetime_value(
    transaction_model=bgm,
    data=rfm_data,
    future_t=12,  # months
    discount_rate=0.01,  # monthly discount rate ~ 12.7% annually
    time_unit="D", 
)

When I run with 1000 rows, I see that it calculates the clv_estimates, even for 10,000 rows as well but any more rows lead to the cluster crash. I wonder if I am missing something that such a small dataset (around 6 MB) does require more 40 GB of memory.

Currently, I am thinking to use the batch approach where I will split my rfm_data into smaller batches and calculate the clv for each of the customers. However, I wanted to be sure if I haven't missed something out that is causing such a worse performance.

Some other details:

My small cluster is two nodes with 64 GB memory and 36 cores each (c4.8xlarge on AWS Databricks).
The data is read using spark but converted to Pandas for the library methods

The text was updated successfully, but these errors were encountered:

haseeb1431 · 2024-10-06T22:06:16Z

The following image shows that initially I tried to run with 1,000 rows and then I ran it for 100,000 rows and used memory has gone up also 10 times(test cluster with 192 GB memory).

haseeb1431 · 2024-10-07T09:20:25Z

I have found another interesting thread that makes me wonder if there is something wrong with the combination of Databricks with Pymc-marketing package.

ColtAllen · 2024-10-07T14:46:12Z

Hey @haseeb1431,

Try using thin_fit_result() to reduce the number of computations. For a full MCMC fit, there will be (n_customers * chains * draws * time periods) CLV calculations ran, which is a lot to hold in memory. For extremely large datasets you may want to just do a MAP fit instead.

haseeb1431 · 2024-10-09T14:40:54Z

@ColtAllen using the thin_fit_result() has been really helpful.

A quick question: Currently I am calculating the prediction for years 1, 3, 5, 7, and 10. Does it make predictions progressively for future periods?
For example, If I run a prediction for one year, that's faster than if I run a prediction for 10 years. if it's running progressively then Is there a way to expose those intermediate results as well where I run a prediction for 1oth year but get the values for all the years from 1 till 10.

ColtAllen · 2024-10-09T23:02:24Z

A quick question: Currently I am calculating the prediction for years 1, 3, 5, 7, and 10. Does it make predictions progressively for future periods?

Yes; it will iteratively run predictions and sum them together, so 1 year would be 12 monthly predictions, 3 years would be 36, and so on.

For intermediate results you can try the following approach adapted from the tutorial notebooks:

time_periods = 10

expected_clv = xarray.concat(
    objs=[
        ggm.expected_customer_lifetime_value(
            future_t=t,
            *args,
        )
        for t in range(time_periods)
    ],
    dim="t",
).transpose(..., "t")

arviz.plot_hdi(
        range(time_periods),
        expected_clv,
        *args,
    )

haseeb1431 · 2024-10-17T14:08:08Z

@ColtAllen Thanks for the suggestions however it seems I couldn't explain the issue quite well. Let me try again.

My goal is to calculate the CLV for 1, 3, 5, 7, and 10 years. Now I call the expected_customer_lifetime_value for each year (1,3,5,7,10 separately) and it starts calculating from month 0 every time. As we know CLV calculation is progressive so If I calculate the CLV for 10 years, internally we have calculated the CLV for all the years from the first year till year 10.

Year 1 CLV will calculate all the months from 1 to 12
Year 3 CLLV will calculate all the months from 1 to 36 (including the 12 months which we calculated last time)
Year 5 CLLV will calculate all the months from 1 to 60 (including the 36 months which we calculated last time)
Year 7 CLLV will calculate all the months from 1 to 84 (including the 60 months which we calculated last time)
Year 10 CLV will calculate all the months from 1 to 120 (including the 84 months which we calculated last time)

So there are a lot of calculations that are repeated. What I wanted is that if we calculate the CLV for 10th year, Can we get the intermediate results output as well such as prediction at each month level so that we can reuse the data and avoid running predictions again and again. I think it's not possible as per the code unless we update it to store the monthly values. Thoughts?

wd60622 · 2024-10-17T15:38:15Z

Are you able to perform the 10 year calculation with the whole (or partial) dataset?

haseeb1431 · 2024-10-17T15:56:36Z

@wd60622 Yes with the whole dataset for 10 years. However, I am using the thin model.

This code took over an hour for me to execute

rfm_data_pandas = data.toPandas()
for i in [1,3,5,7,10]:
    print(rfm_data_pandas.head())
    clv_thinned = fitted_gg_thinned.expected_customer_lifetime_value(
        transaction_model=fitted_bg_thinned,
        data=rfm_data_pandas, 
        future_t=12*i,  # months
        discount_rate=0.01,  # monthly discount rate ~ 12.7% annually
        time_unit="D",  # original data is in days
    )
    print('predictioned and saving for year {i}')
    rfm_data_pandas[f'clv_{i}_years']= clv_thinned.mean(("chain", "draw")).values.round(2)

ColtAllen · 2024-10-21T12:01:44Z

Can we get the intermediate results output as well such as prediction at each month level so that we can reuse the data and avoid running predictions again and again. I think it's not possible as per the code unless we update it to store the monthly values. Thoughts?

Correct - intermediate values are not currently being stored if you want to create an issue for it.

Interval selection is possible for purchase probabilities, but not for expected purchases.

ColtAllen · 2024-10-28T13:06:55Z

This code took over an hour for me to execute

I've created an issue to speed up fitting this model: #1123

Also, have you tried using nutpie?

model.fit(fit_method="map") would still be the most performant solution overall, and no different from the default MCMC results given the amount of data you're working with. However, you'll lose credibility intervals for predictions.

ColtAllen · 2024-12-16T06:29:39Z

@haseeb1431 the new default priors for BetaGeoModel can speed up fit times 40-50%, and will be available in the next release in January.

haseeb1431 · 2024-12-16T12:53:27Z

I will implement it once it's released and let you know. Thanks

wd60622 added API CLV and removed API labels Oct 7, 2024

wd60622 assigned ColtAllen Oct 7, 2024

haseeb1431 mentioned this issue Oct 21, 2024

Intermediate CLV results need to be stored #1097

Open

ColtAllen mentioned this issue Dec 12, 2024

Improved Default Priors for BetaGeoModel #1264

Merged

13 tasks

ColtAllen removed their assignment Jan 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calculating CLV causing cluster to crash for not so large dataset #1077

Calculating CLV causing cluster to crash for not so large dataset #1077

haseeb1431 commented Oct 6, 2024

haseeb1431 commented Oct 6, 2024

haseeb1431 commented Oct 7, 2024

ColtAllen commented Oct 7, 2024

haseeb1431 commented Oct 9, 2024

ColtAllen commented Oct 9, 2024

haseeb1431 commented Oct 17, 2024

wd60622 commented Oct 17, 2024 •

edited

Loading

haseeb1431 commented Oct 17, 2024 •

edited

Loading

ColtAllen commented Oct 21, 2024 •

edited

Loading

ColtAllen commented Oct 28, 2024

ColtAllen commented Dec 16, 2024

haseeb1431 commented Dec 16, 2024

Calculating CLV causing cluster to crash for not so large dataset #1077

Calculating CLV causing cluster to crash for not so large dataset #1077

Comments

haseeb1431 commented Oct 6, 2024

haseeb1431 commented Oct 6, 2024

haseeb1431 commented Oct 7, 2024

ColtAllen commented Oct 7, 2024

haseeb1431 commented Oct 9, 2024

ColtAllen commented Oct 9, 2024

haseeb1431 commented Oct 17, 2024

wd60622 commented Oct 17, 2024 • edited Loading

haseeb1431 commented Oct 17, 2024 • edited Loading

ColtAllen commented Oct 21, 2024 • edited Loading

ColtAllen commented Oct 28, 2024

ColtAllen commented Dec 16, 2024

haseeb1431 commented Dec 16, 2024

wd60622 commented Oct 17, 2024 •

edited

Loading

haseeb1431 commented Oct 17, 2024 •

edited

Loading

ColtAllen commented Oct 21, 2024 •

edited

Loading