Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functionality to let LightGBM effectively handle categorical features #1585

Conversation

rijkvandermeulen
Copy link
Contributor

Fixes #1580.

Summary

  • This PR allows the user to specify (when fitting the model) which of the covariates should be treated as categorical features by the underlying LightGBMRegressor. Supports categorical past-, fut-, and static covariates.
  • A similar approach can be taken for the other RegressionModels that use gradient boosting on decision trees (XGBoost and CatBoost). They would require a bit more effort though. CatBoost, for example, does not allow passing floating point numbers as categorical features. Hence, we cannot simply pass X as an ndarray as we currently do. We would have to have a step in between to preprocess the training_samples in a format that the model accepts. Would you like to this as part of this PR or create a separate ticket?

Why have this?

For categorical features (especially with high cardinality), the LightGBM native way of handling categorical features works better than using one-hot (or any other type of) encoding. In my experience, this can make quite a big impact in practice --> better performance and faster training. For more info see: https://lightgbm.readthedocs.io/en/latest/Features.html#optimal-split-for-categorical-features

Practicalities

  • This PR is still WIP. Functionality wise it seems to work, but it requires a bit of cleaning up, documenting, more extensive unit tests etc before we merge. Before going into this, I wanted to first check whether you guys agree with the approach or maybe have some other ideas?
  • An alternative to passing the categorical_covariates when fitting the model (as is currently implemented) we could also make these attributes of the TimeSeries class (e.g., something like categorical_components and categorical_static_covariates). In fact, I played around with this a bit, but I feel like it would be a less robust and user-friendly solution.

@hrzn @dennisbader could you have a look and share your thoughts? :)

@codecov-commenter
Copy link

codecov-commenter commented Feb 21, 2023

Codecov Report

Patch coverage: 95.91% and project coverage change: -0.08 ⚠️

Comparison is base (28d3e2a) 94.14% compared to head (0836ff2) 94.07%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1585      +/-   ##
==========================================
- Coverage   94.14%   94.07%   -0.08%     
==========================================
  Files         125      125              
  Lines       11318    11350      +32     
==========================================
+ Hits        10655    10677      +22     
- Misses        663      673      +10     
Impacted Files Coverage Δ
darts/timeseries.py 92.30% <ø> (-0.22%) ⬇️
darts/models/forecasting/regression_model.py 97.11% <95.34%> (-0.24%) ⬇️
darts/models/forecasting/lgbm.py 100.00% <100.00%> (ø)
darts/utils/utils.py 90.51% <100.00%> (+0.28%) ⬆️

... and 8 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Collaborator

@dennisbader dennisbader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @rijkvandermeulen and thanks for giving this a go! :) Looks like a good start already 🚀

I haven't fully reviewed, but want to address some points already:

  • I would opt for moving the categorical variable definition out of fit() and into the model constructor (__init__()).
  • there is a bit of code duplication which we could easily reduce, I added comments where I see this (as you mentionned in practicalities)

@rijkvandermeulen
Copy link
Contributor Author

@dennisbader Thanks for the review and the valuable suggestions! I think I've covered all of your remarks; could you do a second round of review? :)

@rijkvandermeulen rijkvandermeulen changed the title [WIP] Functionality to let LightGBM effectively handle categorical features Functionality to let LightGBM effectively handle categorical features Feb 28, 2023
@rijkvandermeulen
Copy link
Contributor Author

BTW, pipeline seems to be broken by an unit test (test_stationarity_tests) unrelated to this PR.

@dennisbader
Copy link
Collaborator

@dennisbader Thanks for the review and the valuable suggestions! I think I've covered all of your remarks; could you do a second round of review? :)

Hi @rijkvandermeulen and thanks for updates! I'll review next week.

Copy link
Collaborator

@dennisbader dennisbader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @rijkvandermeulen for the updates, looks really good! 🚀

We discussed the categorical covariates support internally, and think it would be nice to add a new class inheriting from RegressionModel which implements and governs the categorical support.

With this we could easily add CatBoost (including casting float to int in the model maybe) and XGBoost (as soon fully supporting categoricals).

Thanks as well for the unit tests 👍 I proposed using a different test for testing the model performance to keep it more light-weight (I added the code in the comments). The proposal relies on one of our existing examples, see section 6 in this example. This one only checks if categorical static covariates improve the predictions. In theory, if it works for static covariates, it should translate well to future/past covariates.

Let me know if you have any questions about adding the new Class-based categorical support approach or others, I'll glady help.

@@ -163,6 +183,43 @@ def fit(
Additional kwargs passed to `lightgbm.LGBRegressor.fit()`
"""

# Validate that categorical covariates of the model are a subset of all covariates
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make this a private method (like _check_categorical_covariates) of this new base class mentioned in the earlier comment, so we can later reuse it for the other models support categorical covariates.

This method would always be called by all models inheriting from the new "base" class

max_samples_per_ts,
)

cat_cols_indices, _ = self._get_categorical_features(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the new "base" class you can override _fit_model() and avoid have the same logic in two places.

something like below:

def _fit_model(..., **kwargs):
    cat_cols_indices, _ = self._get_catgorical_features(...)
    kwargs["categorical_feature"] = cat_col_indices
    super()._fit_model(..., **kwargs)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A mapping for getting the correct parameter name per model could allow to dynamically provide the categorical features.

i.e. "cat_features" for CatBoost, "categorical_features" for LightGBM

self.categorical_fit_param_name = "categorical_features"

2. Get the indices of the categorical features in the list of features.
"""

assert isinstance(self, SupportsCategoricalCovariates), (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the new "base" class we could drop this check and SupportsCategoricalCovariates

@@ -34,6 +34,9 @@ def __init__(
quantiles: List[float] = None,
random_state: Optional[int] = None,
multi_models: Optional[bool] = True,
categorical_past_covariates: Optional[List[str]] = None,
categorical_future_covariates: Optional[List[str]] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we also allow single strings?

@rijkvandermeulen
Copy link
Contributor Author

Hey @dennisbader,

Great suggestion of creating a new "base" class, thanks! I gave this a go - looking forward to hearing what you think :)

Some points for discussion still might be:

  • In order to make this work for CatBoost we have to cast categorical covs from float to int. Maybe I'm missing something, but I don't think this is very trivial with numpy (guess we would have to work with strucutred arrays or the like). So for now I took the approach of converting X to a pandas df and do the conversion there. It works, but I'm not sure whether it's the best approach. Any better ideas?
  • Related question; I'm not sure whether the current RegressionModelWithCatecoricalCovariates in its current state is generic enough. For example, I'm not using XGBoost much, but judging from their API reference docs it seems that we would have to cast the categorical covs to categorical type. How do you see this? Maybe make a default implementation of _cast_float_to_int (probably not the best naming anymore then ;)) in RegressionModelWithCatecoricalCovariates and have this method potentially overwritten by each subclass (e.g., XGBoost once we implement this later on once stable)?

@rijkvandermeulen
Copy link
Contributor Author

Hey @dennisbader, I was wondering when you'd expect to have the time to take another look at the latest iteration of this PR? Thank you!

@dennisbader
Copy link
Collaborator

Ho @rijkvandermeulen, and sorry for the long waiting. From mid April on I have dedicated time for Darts, which will make reviewing much quicker :).

I'll review the changes this weekend. Plan is to release the new Darts version around the end of March and we definitely want to include this one!

Copy link
Collaborator

@dennisbader dennisbader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for all the updates @rijkvandermeulen 🚀 The new covariates class for RegressionModels looks great! We're very close to merge this one, and should be the last iteration.

Apart from some minor suggestion, I believe we should exclude categorical support for Catboost from this PR. Reason behind it is that we should avoid having to convert from numpy array to pandas DataFrame. This can get quite slow for large arrays.

For now let's put this into our backlog until we come up with a robust solution (we're also thinking about how to improve categorical support in the TimeSeries itself).

Thanks to your work, adding support at a later time will be very easy! 👍 🚀 :)

@rijkvandermeulen
Copy link
Contributor Author

Thanks for the feedback @dennisbader. Completely agree with the suggested approach; I've made the necessary updates in the code. Let me know if you need another small tweak or if we're all set to merge :)

Great to hear that you're planning to do a new release around the end of March and that this functionality will be included; looking forward to using it. Thx a lot for your help and support on this PR!

Copy link
Collaborator

@dennisbader dennisbader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rijkvandermeulen, and thanks again for the updates.

Could you apply the suggestion that I mentioned last time (I added a comment again).
Also it seems there are some merge conflicts, could you resolve them?

Thanks a lot! 🚀

Rijk van der Meulen added 2 commits March 27, 2023 17:46
@rijkvandermeulen
Copy link
Contributor Author

rijkvandermeulen commented Mar 27, 2023

@dennisbader apologies; must have missed that comment earlier. I've committed your suggestion and also resolved the merge conflicts.

Copy link
Collaborator

@dennisbader dennisbader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks a lot @rijkvandermeulen 🚀

@dennisbader dennisbader merged commit de94ef4 into unit8co:master Mar 28, 2023
alexcolpitts96 pushed a commit to alexcolpitts96/darts that referenced this pull request May 31, 2023
…unit8co#1585)

* unit8co#1580 exploration

* unit8co#1580 added cat_components to TimeSeries

* unit8co#1580 _fit_model method LightGBM

* unit8co#1580 included static covs in dummy unit test

* unit8co#1580 integration with lgbm

* unit8co#1580 helper func to method in RegressionModel

* unit8co#1580 different approach; pass categorical covs to fit method of lgbm directly instead of attributes TimeSeries object

* unit8co#1580 added few unit tests

* unit8co#1580 small stuff

* unit8co#1580 move categorical covs to model constructor

* unit8co#1580 avoid code duplication in unit tests

* unit8co#1580 add unit test on forecast quality with cat covs

* unit8co#1580 add column names check in _get_categorical_covs helper

* unit8co#1580 docstrings lgbm

* unit8co#1580 add changelog entry

* unit8co#1580 change check if ts has static cov

* unit8co#1580 implemented RegressionModelWithCategoricalCovariates class

* unit8co#1580 delete redundant test

* unit8co#1580 replace test_quality_forecast_with_categorical_covariates unit test by suggestion Dennis

* unit8co#1580 adjustment error messages validation method

* unit8co#1580 adding categorical feature support for CatBoost

* unit8co#1580 remove cat support CatBoost and smaller comments Dennis

* unit8co#1580 finalizing

* unit8co#1580 use parent _fit_model method

* avoid creating lagged data twice

* remove empty lines

---------

Co-authored-by: Rijk van der Meulen <[email protected]>
Co-authored-by: madtoinou <[email protected]>
Co-authored-by: Dennis Bader <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow for usage of native way of dealing with categorical features RegressionModels
4 participants