Functionality to let LightGBM effectively handle categorical features #1585

rijkvandermeulen · 2023-02-21T11:45:29Z

Summary

This PR allows the user to specify (when fitting the model) which of the covariates should be treated as categorical features by the underlying LightGBMRegressor. Supports categorical past-, fut-, and static covariates.
A similar approach can be taken for the other RegressionModels that use gradient boosting on decision trees (XGBoost and CatBoost). They would require a bit more effort though. CatBoost, for example, does not allow passing floating point numbers as categorical features. Hence, we cannot simply pass X as an ndarray as we currently do. We would have to have a step in between to preprocess the training_samples in a format that the model accepts. Would you like to this as part of this PR or create a separate ticket?

Why have this?

For categorical features (especially with high cardinality), the LightGBM native way of handling categorical features works better than using one-hot (or any other type of) encoding. In my experience, this can make quite a big impact in practice --> better performance and faster training. For more info see: https://lightgbm.readthedocs.io/en/latest/Features.html#optimal-split-for-categorical-features

Practicalities

This PR is still WIP. Functionality wise it seems to work, but it requires a bit of cleaning up, documenting, more extensive unit tests etc before we merge. Before going into this, I wanted to first check whether you guys agree with the approach or maybe have some other ideas?
An alternative to passing the categorical_covariates when fitting the model (as is currently implemented) we could also make these attributes of the TimeSeries class (e.g., something like categorical_components and categorical_static_covariates). In fact, I played around with this a bit, but I feel like it would be a less robust and user-friendly solution.

@hrzn @dennisbader could you have a look and share your thoughts? :)

…of lgbm directly instead of attributes TimeSeries object

codecov-commenter · 2023-02-21T12:19:43Z

Codecov Report

Patch coverage: 95.91% and project coverage change: -0.08 ⚠️

Comparison is base (28d3e2a) 94.14% compared to head (0836ff2) 94.07%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1585      +/-   ##
==========================================
- Coverage   94.14%   94.07%   -0.08%     
==========================================
  Files         125      125              
  Lines       11318    11350      +32     
==========================================
+ Hits        10655    10677      +22     
- Misses        663      673      +10

Impacted Files	Coverage Δ
darts/timeseries.py	`92.30% <ø> (-0.22%)`	⬇️
darts/models/forecasting/regression_model.py	`97.11% <95.34%> (-0.24%)`	⬇️
darts/models/forecasting/lgbm.py	`100.00% <100.00%> (ø)`
darts/utils/utils.py	`90.51% <100.00%> (+0.28%)`	⬆️

... and 8 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

dennisbader

Hey @rijkvandermeulen and thanks for giving this a go! :) Looks like a good start already 🚀

I haven't fully reviewed, but want to address some points already:

I would opt for moving the categorical variable definition out of fit() and into the model constructor (__init__()).
there is a bit of code duplication which we could easily reduce, I added comments where I see this (as you mentionned in practicalities)

darts/models/forecasting/lgbm.py

darts/models/forecasting/regression_model.py

darts/tests/models/forecasting/test_regression_models.py

…ithub.com/rijkvandermeulen/darts into feature/use_model_native_way_cat_features

rijkvandermeulen · 2023-02-28T11:21:33Z

@dennisbader Thanks for the review and the valuable suggestions! I think I've covered all of your remarks; could you do a second round of review? :)

rijkvandermeulen · 2023-02-28T12:08:00Z

BTW, pipeline seems to be broken by an unit test (test_stationarity_tests) unrelated to this PR.

dennisbader · 2023-03-05T16:00:23Z

@dennisbader Thanks for the review and the valuable suggestions! I think I've covered all of your remarks; could you do a second round of review? :)

Hi @rijkvandermeulen and thanks for updates! I'll review next week.

dennisbader

Thank you @rijkvandermeulen for the updates, looks really good! 🚀

We discussed the categorical covariates support internally, and think it would be nice to add a new class inheriting from RegressionModel which implements and governs the categorical support.

With this we could easily add CatBoost (including casting float to int in the model maybe) and XGBoost (as soon fully supporting categoricals).

Thanks as well for the unit tests 👍 I proposed using a different test for testing the model performance to keep it more light-weight (I added the code in the comments). The proposal relies on one of our existing examples, see section 6 in this example. This one only checks if categorical static covariates improve the predictions. In theory, if it works for static covariates, it should translate well to future/past covariates.

Let me know if you have any questions about adding the new Class-based categorical support approach or others, I'll glady help.

darts/models/forecasting/lgbm.py

dennisbader · 2023-03-09T09:55:04Z

darts/models/forecasting/lgbm.py

@@ -163,6 +183,43 @@ def fit(
            Additional kwargs passed to `lightgbm.LGBRegressor.fit()`
        """

+        # Validate that categorical covariates of the model are a subset of all covariates


We could make this a private method (like _check_categorical_covariates) of this new base class mentioned in the earlier comment, so we can later reuse it for the other models support categorical covariates.

This method would always be called by all models inheriting from the new "base" class

dennisbader · 2023-03-09T10:01:50Z

darts/models/forecasting/lgbm.py

+            max_samples_per_ts,
+        )
+
+        cat_cols_indices, _ = self._get_categorical_features(


in the new "base" class you can override _fit_model() and avoid have the same logic in two places.

something like below:

def _fit_model(..., **kwargs): cat_cols_indices, _ = self._get_catgorical_features(...) kwargs["categorical_feature"] = cat_col_indices super()._fit_model(..., **kwargs)

A mapping for getting the correct parameter name per model could allow to dynamically provide the categorical features.

i.e. "cat_features" for CatBoost, "categorical_features" for LightGBM

self.categorical_fit_param_name = "categorical_features"

dennisbader · 2023-03-09T10:03:59Z

darts/models/forecasting/regression_model.py

+        2. Get the indices of the categorical features in the list of features.
+        """
+
+        assert isinstance(self, SupportsCategoricalCovariates), (


in the new "base" class we could drop this check and SupportsCategoricalCovariates

darts/tests/models/forecasting/test_regression_models.py

dennisbader · 2023-03-09T12:33:11Z

darts/models/forecasting/lgbm.py

@@ -34,6 +34,9 @@ def __init__(
        quantiles: List[float] = None,
        random_state: Optional[int] = None,
        multi_models: Optional[bool] = True,
+        categorical_past_covariates: Optional[List[str]] = None,
+        categorical_future_covariates: Optional[List[str]] = None,


could we also allow single strings?

…s unit test by suggestion Dennis

rijkvandermeulen · 2023-03-12T17:23:55Z

Hey @dennisbader,

Great suggestion of creating a new "base" class, thanks! I gave this a go - looking forward to hearing what you think :)

Some points for discussion still might be:

In order to make this work for CatBoost we have to cast categorical covs from float to int. Maybe I'm missing something, but I don't think this is very trivial with numpy (guess we would have to work with strucutred arrays or the like). So for now I took the approach of converting X to a pandas df and do the conversion there. It works, but I'm not sure whether it's the best approach. Any better ideas?
Related question; I'm not sure whether the current RegressionModelWithCatecoricalCovariates in its current state is generic enough. For example, I'm not using XGBoost much, but judging from their API reference docs it seems that we would have to cast the categorical covs to categorical type. How do you see this? Maybe make a default implementation of _cast_float_to_int (probably not the best naming anymore then ;)) in RegressionModelWithCatecoricalCovariates and have this method potentially overwritten by each subclass (e.g., XGBoost once we implement this later on once stable)?

CHANGELOG.md

darts/models/forecasting/regression_model.py

rijkvandermeulen · 2023-03-23T08:45:48Z

Hey @dennisbader, I was wondering when you'd expect to have the time to take another look at the latest iteration of this PR? Thank you!

dennisbader · 2023-03-23T09:15:31Z

Ho @rijkvandermeulen, and sorry for the long waiting. From mid April on I have dedicated time for Darts, which will make reviewing much quicker :).

I'll review the changes this weekend. Plan is to release the new Darts version around the end of March and we definitely want to include this one!

dennisbader

Thanks again for all the updates @rijkvandermeulen 🚀 The new covariates class for RegressionModels looks great! We're very close to merge this one, and should be the last iteration.

Apart from some minor suggestion, I believe we should exclude categorical support for Catboost from this PR. Reason behind it is that we should avoid having to convert from numpy array to pandas DataFrame. This can get quite slow for large arrays.

For now let's put this into our backlog until we come up with a robust solution (we're also thinking about how to improve categorical support in the TimeSeries itself).

Thanks to your work, adding support at a later time will be very easy! 👍 🚀 :)

darts/models/forecasting/regression_model.py

darts/models/forecasting/catboost_model.py

rijkvandermeulen · 2023-03-27T07:02:06Z

Thanks for the feedback @dennisbader. Completely agree with the suggested approach; I've made the necessary updates in the code. Let me know if you need another small tweak or if we're all set to merge :)

Great to hear that you're planning to do a new release around the end of March and that this functionality will be included; looking forward to using it. Thx a lot for your help and support on this PR!

dennisbader

Hi @rijkvandermeulen, and thanks again for the updates.

Could you apply the suggestion that I mentioned last time (I added a comment again).
Also it seems there are some merge conflicts, could you resolve them?

Thanks a lot! 🚀

darts/models/forecasting/regression_model.py

# Conflicts: # darts/models/forecasting/regression_model.py

rijkvandermeulen · 2023-03-27T16:25:29Z

@dennisbader apologies; must have missed that comment earlier. I've committed your suggestion and also resolved the merge conflicts.

darts/models/forecasting/regression_model.py

dennisbader

LGTM! Thanks a lot @rijkvandermeulen 🚀

…unit8co#1585) * unit8co#1580 exploration * unit8co#1580 added cat_components to TimeSeries * unit8co#1580 _fit_model method LightGBM * unit8co#1580 included static covs in dummy unit test * unit8co#1580 integration with lgbm * unit8co#1580 helper func to method in RegressionModel * unit8co#1580 different approach; pass categorical covs to fit method of lgbm directly instead of attributes TimeSeries object * unit8co#1580 added few unit tests * unit8co#1580 small stuff * unit8co#1580 move categorical covs to model constructor * unit8co#1580 avoid code duplication in unit tests * unit8co#1580 add unit test on forecast quality with cat covs * unit8co#1580 add column names check in _get_categorical_covs helper * unit8co#1580 docstrings lgbm * unit8co#1580 add changelog entry * unit8co#1580 change check if ts has static cov * unit8co#1580 implemented RegressionModelWithCategoricalCovariates class * unit8co#1580 delete redundant test * unit8co#1580 replace test_quality_forecast_with_categorical_covariates unit test by suggestion Dennis * unit8co#1580 adjustment error messages validation method * unit8co#1580 adding categorical feature support for CatBoost * unit8co#1580 remove cat support CatBoost and smaller comments Dennis * unit8co#1580 finalizing * unit8co#1580 use parent _fit_model method * avoid creating lagged data twice * remove empty lines --------- Co-authored-by: Rijk van der Meulen <[email protected]> Co-authored-by: madtoinou <[email protected]> Co-authored-by: Dennis Bader <[email protected]>

Rijk van der Meulen added 9 commits February 20, 2023 10:20

unit8co#1580 exploration

ec8fe6d

unit8co#1580 added cat_components to TimeSeries

5ee855d

unit8co#1580 _fit_model method LightGBM

149b2c7

unit8co#1580 included static covs in dummy unit test

b02e8b1

unit8co#1580 integration with lgbm

948be36

unit8co#1580 helper func to method in RegressionModel

3c2fee2

unit8co#1580 different approach; pass categorical covs to fit method …

c3d642f

…of lgbm directly instead of attributes TimeSeries object

unit8co#1580 added few unit tests

5679eeb

unit8co#1580 small stuff

ef7fcf8

rijkvandermeulen requested review from hrzn and dennisbader as code owners February 21, 2023 11:45

Merge branch 'master' into feature/use_model_native_way_cat_features

5a5a09f

dennisbader requested changes Feb 26, 2023

View reviewed changes

Rijk van der Meulen added 8 commits February 27, 2023 15:49

unit8co#1580 move categorical covs to model constructor

4c5b140

unit8co#1580 avoid code duplication in unit tests

f6b25fc

unit8co#1580 add unit test on forecast quality with cat covs

e7cde27

unit8co#1580 add column names check in _get_categorical_covs helper

d8aa69f

unit8co#1580 docstrings lgbm

5be4f4c

unit8co#1580 add changelog entry

dc9ceeb

Merge branch 'feature/use_model_native_way_cat_features' of https://g…

713a850

…ithub.com/rijkvandermeulen/darts into feature/use_model_native_way_cat_features

unit8co#1580 change check if ts has static cov

165d1bc

rijkvandermeulen changed the title ~~[WIP] Functionality to let LightGBM effectively handle categorical features~~ Functionality to let LightGBM effectively handle categorical features Feb 28, 2023

Merge branch 'master' into feature/use_model_native_way_cat_features

d02d3a0

dennisbader mentioned this pull request Mar 2, 2023

[BUG]-Multiple timeseries(Global models) fails for Random Forest and Regression Models #1605

Closed

Merge branch 'master' into feature/use_model_native_way_cat_features

95bf521

dennisbader requested changes Mar 9, 2023

View reviewed changes

Rijk van der Meulen added 5 commits March 12, 2023 14:58

unit8co#1580 implemented RegressionModelWithCategoricalCovariates class

9df90ae

unit8co#1580 delete redundant test

36e56de

unit8co#1580 replace test_quality_forecast_with_categorical_covariate…

e85bad2

…s unit test by suggestion Dennis

unit8co#1580 adjustment error messages validation method

9ba3190

unit8co#1580 adding categorical feature support for CatBoost

5f2535b

rijkvandermeulen commented Mar 12, 2023

View reviewed changes

CHANGELOG.md Show resolved Hide resolved

rijkvandermeulen commented Mar 12, 2023

View reviewed changes

darts/models/forecasting/regression_model.py Show resolved Hide resolved

dennisbader requested changes Mar 26, 2023

View reviewed changes

Rijk van der Meulen added 3 commits March 27, 2023 08:15

unit8co#1580 remove cat support CatBoost and smaller comments Dennis

ae1d4df

unit8co#1580 finalizing

7cb8c72

Merge branch 'master' into feature/use_model_native_way_cat_features

20073fe

dennisbader requested changes Mar 27, 2023

View reviewed changes

darts/models/forecasting/regression_model.py Outdated Show resolved Hide resolved

Rijk van der Meulen added 2 commits March 27, 2023 17:46

unit8co#1580 use parent _fit_model method

6eb4ed4

Merge branch 'master' into feature/use_model_native_way_cat_features

5dc1341

# Conflicts: # darts/models/forecasting/regression_model.py

dennisbader reviewed Mar 27, 2023

View reviewed changes

darts/models/forecasting/regression_model.py Outdated Show resolved Hide resolved

avoid creating lagged data twice

fc41cd8

dennisbader reviewed Mar 27, 2023

View reviewed changes

darts/models/forecasting/regression_model.py Outdated Show resolved Hide resolved

remove empty lines

0836ff2

dennisbader approved these changes Mar 28, 2023

View reviewed changes

dennisbader merged commit de94ef4 into unit8co:master Mar 28, 2023

dennisbader mentioned this pull request Mar 31, 2023

fixes lgb warning when not using any categorical features #1681

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Functionality to let LightGBM effectively handle categorical features #1585

Functionality to let LightGBM effectively handle categorical features #1585

rijkvandermeulen commented Feb 21, 2023

codecov-commenter commented Feb 21, 2023 •

edited

Loading

dennisbader left a comment •

edited

Loading

rijkvandermeulen commented Feb 28, 2023

rijkvandermeulen commented Feb 28, 2023

dennisbader commented Mar 5, 2023

dennisbader left a comment

dennisbader Mar 9, 2023

dennisbader Mar 9, 2023

dennisbader Mar 9, 2023

dennisbader Mar 9, 2023

dennisbader Mar 9, 2023

rijkvandermeulen commented Mar 12, 2023

rijkvandermeulen commented Mar 23, 2023

dennisbader commented Mar 23, 2023

dennisbader left a comment

rijkvandermeulen commented Mar 27, 2023

dennisbader left a comment

rijkvandermeulen commented Mar 27, 2023 •

edited

Loading

dennisbader left a comment

Functionality to let LightGBM effectively handle categorical features #1585

Functionality to let LightGBM effectively handle categorical features #1585

Conversation

rijkvandermeulen commented Feb 21, 2023

Summary

Why have this?

Practicalities

codecov-commenter commented Feb 21, 2023 • edited Loading

Codecov Report

dennisbader left a comment • edited Loading

Choose a reason for hiding this comment

rijkvandermeulen commented Feb 28, 2023

rijkvandermeulen commented Feb 28, 2023

dennisbader commented Mar 5, 2023

dennisbader left a comment

Choose a reason for hiding this comment

dennisbader Mar 9, 2023

Choose a reason for hiding this comment

dennisbader Mar 9, 2023

Choose a reason for hiding this comment

dennisbader Mar 9, 2023

Choose a reason for hiding this comment

dennisbader Mar 9, 2023

Choose a reason for hiding this comment

dennisbader Mar 9, 2023

Choose a reason for hiding this comment

rijkvandermeulen commented Mar 12, 2023

rijkvandermeulen commented Mar 23, 2023

dennisbader commented Mar 23, 2023

dennisbader left a comment

Choose a reason for hiding this comment

rijkvandermeulen commented Mar 27, 2023

dennisbader left a comment

Choose a reason for hiding this comment

rijkvandermeulen commented Mar 27, 2023 • edited Loading

dennisbader left a comment

Choose a reason for hiding this comment

codecov-commenter commented Feb 21, 2023 •

edited

Loading

dennisbader left a comment •

edited

Loading

rijkvandermeulen commented Mar 27, 2023 •

edited

Loading