-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Functionality to let LightGBM effectively handle categorical features #1585
Functionality to let LightGBM effectively handle categorical features #1585
Conversation
…of lgbm directly instead of attributes TimeSeries object
Codecov ReportPatch coverage:
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more Additional details and impacted files@@ Coverage Diff @@
## master #1585 +/- ##
==========================================
- Coverage 94.14% 94.07% -0.08%
==========================================
Files 125 125
Lines 11318 11350 +32
==========================================
+ Hits 10655 10677 +22
- Misses 663 673 +10
... and 8 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @rijkvandermeulen and thanks for giving this a go! :) Looks like a good start already 🚀
I haven't fully reviewed, but want to address some points already:
- I would opt for moving the categorical variable definition out of
fit()
and into the model constructor (__init__()
). - there is a bit of code duplication which we could easily reduce, I added comments where I see this (as you mentionned in practicalities)
…ithub.com/rijkvandermeulen/darts into feature/use_model_native_way_cat_features
@dennisbader Thanks for the review and the valuable suggestions! I think I've covered all of your remarks; could you do a second round of review? :) |
BTW, pipeline seems to be broken by an unit test ( |
Hi @rijkvandermeulen and thanks for updates! I'll review next week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @rijkvandermeulen for the updates, looks really good! 🚀
We discussed the categorical covariates support internally, and think it would be nice to add a new class inheriting from RegressionModel
which implements and governs the categorical support.
With this we could easily add CatBoost (including casting float to int in the model maybe) and XGBoost (as soon fully supporting categoricals).
Thanks as well for the unit tests 👍 I proposed using a different test for testing the model performance to keep it more light-weight (I added the code in the comments). The proposal relies on one of our existing examples, see section 6 in this example. This one only checks if categorical static covariates improve the predictions. In theory, if it works for static covariates, it should translate well to future/past covariates.
Let me know if you have any questions about adding the new Class-based categorical support approach or others, I'll glady help.
darts/models/forecasting/lgbm.py
Outdated
@@ -163,6 +183,43 @@ def fit( | |||
Additional kwargs passed to `lightgbm.LGBRegressor.fit()` | |||
""" | |||
|
|||
# Validate that categorical covariates of the model are a subset of all covariates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could make this a private method (like _check_categorical_covariates
) of this new base class mentioned in the earlier comment, so we can later reuse it for the other models support categorical covariates.
This method would always be called by all models inheriting from the new "base" class
darts/models/forecasting/lgbm.py
Outdated
max_samples_per_ts, | ||
) | ||
|
||
cat_cols_indices, _ = self._get_categorical_features( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the new "base" class you can override _fit_model()
and avoid have the same logic in two places.
something like below:
def _fit_model(..., **kwargs):
cat_cols_indices, _ = self._get_catgorical_features(...)
kwargs["categorical_feature"] = cat_col_indices
super()._fit_model(..., **kwargs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A mapping for getting the correct parameter name per model could allow to dynamically provide the categorical features.
i.e. "cat_features" for CatBoost, "categorical_features" for LightGBM
self.categorical_fit_param_name = "categorical_features"
2. Get the indices of the categorical features in the list of features. | ||
""" | ||
|
||
assert isinstance(self, SupportsCategoricalCovariates), ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the new "base" class we could drop this check and SupportsCategoricalCovariates
darts/models/forecasting/lgbm.py
Outdated
@@ -34,6 +34,9 @@ def __init__( | |||
quantiles: List[float] = None, | |||
random_state: Optional[int] = None, | |||
multi_models: Optional[bool] = True, | |||
categorical_past_covariates: Optional[List[str]] = None, | |||
categorical_future_covariates: Optional[List[str]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we also allow single strings?
…s unit test by suggestion Dennis
Hey @dennisbader, Great suggestion of creating a new "base" class, thanks! I gave this a go - looking forward to hearing what you think :) Some points for discussion still might be:
|
Hey @dennisbader, I was wondering when you'd expect to have the time to take another look at the latest iteration of this PR? Thank you! |
Ho @rijkvandermeulen, and sorry for the long waiting. From mid April on I have dedicated time for Darts, which will make reviewing much quicker :). I'll review the changes this weekend. Plan is to release the new Darts version around the end of March and we definitely want to include this one! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for all the updates @rijkvandermeulen 🚀 The new covariates class for RegressionModels looks great! We're very close to merge this one, and should be the last iteration.
Apart from some minor suggestion, I believe we should exclude categorical support for Catboost from this PR. Reason behind it is that we should avoid having to convert from numpy array to pandas DataFrame. This can get quite slow for large arrays.
For now let's put this into our backlog until we come up with a robust solution (we're also thinking about how to improve categorical support in the TimeSeries itself).
Thanks to your work, adding support at a later time will be very easy! 👍 🚀 :)
Thanks for the feedback @dennisbader. Completely agree with the suggested approach; I've made the necessary updates in the code. Let me know if you need another small tweak or if we're all set to merge :) Great to hear that you're planning to do a new release around the end of March and that this functionality will be included; looking forward to using it. Thx a lot for your help and support on this PR! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @rijkvandermeulen, and thanks again for the updates.
Could you apply the suggestion that I mentioned last time (I added a comment again).
Also it seems there are some merge conflicts, could you resolve them?
Thanks a lot! 🚀
# Conflicts: # darts/models/forecasting/regression_model.py
@dennisbader apologies; must have missed that comment earlier. I've committed your suggestion and also resolved the merge conflicts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks a lot @rijkvandermeulen 🚀
…unit8co#1585) * unit8co#1580 exploration * unit8co#1580 added cat_components to TimeSeries * unit8co#1580 _fit_model method LightGBM * unit8co#1580 included static covs in dummy unit test * unit8co#1580 integration with lgbm * unit8co#1580 helper func to method in RegressionModel * unit8co#1580 different approach; pass categorical covs to fit method of lgbm directly instead of attributes TimeSeries object * unit8co#1580 added few unit tests * unit8co#1580 small stuff * unit8co#1580 move categorical covs to model constructor * unit8co#1580 avoid code duplication in unit tests * unit8co#1580 add unit test on forecast quality with cat covs * unit8co#1580 add column names check in _get_categorical_covs helper * unit8co#1580 docstrings lgbm * unit8co#1580 add changelog entry * unit8co#1580 change check if ts has static cov * unit8co#1580 implemented RegressionModelWithCategoricalCovariates class * unit8co#1580 delete redundant test * unit8co#1580 replace test_quality_forecast_with_categorical_covariates unit test by suggestion Dennis * unit8co#1580 adjustment error messages validation method * unit8co#1580 adding categorical feature support for CatBoost * unit8co#1580 remove cat support CatBoost and smaller comments Dennis * unit8co#1580 finalizing * unit8co#1580 use parent _fit_model method * avoid creating lagged data twice * remove empty lines --------- Co-authored-by: Rijk van der Meulen <[email protected]> Co-authored-by: madtoinou <[email protected]> Co-authored-by: Dennis Bader <[email protected]>
Fixes #1580.
Summary
LightGBMRegressor
. Supports categorical past-, fut-, and static covariates.Why have this?
For categorical features (especially with high cardinality), the LightGBM native way of handling categorical features works better than using one-hot (or any other type of) encoding. In my experience, this can make quite a big impact in practice --> better performance and faster training. For more info see: https://lightgbm.readthedocs.io/en/latest/Features.html#optimal-split-for-categorical-features
Practicalities
TimeSeries
class (e.g., something likecategorical_components
andcategorical_static_covariates
). In fact, I played around with this a bit, but I feel like it would be a less robust and user-friendly solution.@hrzn @dennisbader could you have a look and share your thoughts? :)