Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

categorical support design proposal #36

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
223 changes: 223 additions & 0 deletions steps/24_categorical_support/step.md
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my understanding, this is the case:

  • design 1 requires users to specifies which features are categorical.
  • design 3 requires users to encode externally.

So, a general question: is design 2 the only design to make use of feature kind PR?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have clarified about design 3 in #36 (comment). Please check it out.

Original file line number Diff line number Diff line change
@@ -0,0 +1,223 @@
# Adding categorical support to sktime

Contributors: @Abhay-Lejith

## Introduction

### Objectives
Allow users to pass exog data with categorical features to any forecaster. Outcome will depend on whether the forecaster natively supports categorical or not.
Two broadly possible outcomes:
- error in fit saying not natively supported.
- internally encode and fit successfully.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not need to encode always.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not natively supported, above two are the only options right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Between L10 and L11, is it not considered a valid option?

"Fit/predict by dropping/ignoring non-numerical features"

It's probably more useful than complete failure, and can possibly be combined with other designs, especially design 1 to increase scope to all forecasters.

I did comment this earlier as well, don't know what you think about this @Abhay-Lejith @fkiraly .

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is a possible solution. I am updating the design document once again and will add this.


### Current status:
* Non-numeric dtypes are completely blocked off in datatype checks.
* https://github.com/sktime/sktime/pull/5886 - This pr removes these checks but has not been merged yet as most estimators do not support categorical and will break without informative error messages.
* https://github.com/sktime/sktime/pull/6490 - pr adding feature_kind metadata

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest, start with outlining a conceptual model. Explain what kind of (abstract) estimators there are, and what you want to model. Then proceed to example code that should run, try to cover different cases.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this 'conceptual model' heavily depend on which design we choose to go ahead with. Or do you want me to do this for all the proposed designs?

Explain what kind of (abstract) estimators there are, and what you want to model

I don't think there are any new estimators being made, changes will have to be made to existing classes and again, the change will depend on what design we go for.


### Models which support categorical natively that have been identified so far:
- xgboost
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this compare to the list in
sktime/sktime#6109 (comment)
and subsequent discussion?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list is of forecasters with native support. And I am currently looking at only categorical in exog.

- lightgbm
- catboost

Note that all these will have to be used via reduction.

### Sample data used in examples below
```python
sample_data = {
"target": [10,20,30,20],
"cat_feature":["a","b","c","b"],
"numeric_feature": [1,2,3,4],
}
sample_dataset = pd.DataFrame(data=sample_data)
y = sample_dataset["target"]
X = sample_dataset[["cat_feature","numeric_feature"]]
X_train,X_test,y_train,y_test = temporal_train_test_split(y,X,0.5)
```

### Example 1: (Reduction case)
```python
from sktime.forecasting.compose import make_reduction
from catboost import CatBoostRegressor

regressor = CatBoostRegressor()
forecaster = make_reduction(regressor)
forecaster.fit(y_train,X_train)
y_pred = forecaster.predict(X_test,fh=[1,2])
```

### Example 2:
```python
from sktime import XYZforecaster

forecaster = XYZForecaster()
forecaster.fit(y_train,X_train)
y_pred = forecaster.predict(X_test,fh=[1,2])
```

## Requirements
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start with high-level requirements, e.g., what does the API need to do? Then proceed to driving lower level ones. Try to avoid mixing up requirements and solutions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand. Can you give me an example of a requirement.
I don't think any of the below points are solutions. I have listed the requirements below and then discussed solutions to each separately in the next section.

1. Need to know whether categorical data is present in X_train and which columns.
2. Need to know if forecaster has native categorical support.
3. In reduction, we need to know if model used has native categorical support.
4. In reduction, we need to pass parameters(like `enable_categorical=True` in case of xgboost) to the model used if natively supported according to the requirements of the model.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. is not a requirement but a solution. Try to separate these clearly. I do not see at least why this needs to be the case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a requirement. xgboost for instance 'requires' this parameter to be passed to it when using categorical.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant, "specification requirement" - what are requirements towards the piece of software that you will implement. That´s a specifc use of the word: https://en.wikipedia.org/wiki/Software_requirements

5. Must handle all combinations of cases
- estimator natively supports categorical: yes/no
- categorical feature specified by user: yes/no
- categorical present in data: yes/no

Important case is when categorical data is passed but is not natively supported.


## Solutions for some of the requirements:

### Requirement 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are no requirements specified above, so it is a bit vague what this is solving.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fkiraly I have listed requirements above in L-41 to L-52 .

- https://github.com/sktime/sktime/pull/6490 - pr adding feature_kind metadata

This pr adds a new field to metadata called feature_kind. It is a list from which we can infer the dtype of the columns present in the data. Currently a broad classification into 'float' and 'categorical' is used.

#### Sub-challenge: User may want to treat a column with numerical values as a categorical feature(Example - a column which contains ratings from 1-5)

This information cannot be detected using the feature_kind metadata. It has to be passed by the user in some manner.

- Possible solutions:
- 'categorical_features' arg in model where user can specify such columns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more solutions:

  • the user converts the data outside sktime, pandas does have this feature, for instance
  • a pipeline element that does that

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line L-86 is the same right?

  • not deal with this case and expect user to convert from numerical to categorical. We can make a transformer for this purpose.

- passing this info via `set_config`.
- not deal with this case and expect user to convert from numerical to categorical. We can make a transformer for this purpose.

### Req 2
We will use a tag `categorical_support`(=True/False) or similar to identify whether a forecaster has native support.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would be a capability tag, so in line with most such tags it would be capability:something


### Req 3
- Set tag depending on model used in reduction.
- Since there are few models(3) that have been identified with native support, a hard coded method like an if/else statement to check the model could be used.
- Would be more difficult to extend if new estimators with native support are found.
- Maintain some sort of list internally in sktime, of estimators with native support and check with this.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

further option: try/except with a minuscule dataset, can be combined with a "known" list

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sklearn also has tags, is there perhaps one for this?


### Req 4
- Should this be expected from the user itself since they have to initialize the model anyway?
- Or, if we choose to pass arguments internally, then we will have to do something like this:
```
if model used is xgboost:
xgboost specific categorical specification
else if model used is catboost:
catboost specific cat specification
else if ..... and so on
```

### Req 5
Discussed according to the different design options below.

---
---

### Design 1. Allow categorical features ONLY in those forecasters which support natively. (no encoding internally)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In design 1, is the idea to create XGBoostForecaster etc. similar to LightGBM in Darts, basically as a wrapper on top of make_reduction?

  • If yes
    • how do you propose to support for these specific regressors through reduction forecasters from skforecast and darts?
  • If no
    • can you please give more details?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are essentially two options: create this wrapper on top of make_reduction itself or modify make_reduction to work with categorical. I asked for your inputs on the wrapper idea in the umbrella issue and both of you were against it. So for now the design here is on modifying make_reduction to support it based on finalised solutions to req3 and req4.
I am also thinking about the wrapper class idea in a bit more detail and will try to add it here as well.


- `categorical_features` parameter will be present only in forecasters that support natively.
- For forecasters that do not support natively, users can make a pipeline with an encoder of their choice and use categorical variables.

#### How tag is going to be used
- in datatype checks :
```
if categorical support tag is false:
if categorical feature is present:
error - cat feature is not supported by this estimator
```

- In this case `categorical feature` will not be a parameter in models that do not support. So (NO,YES,_) is not possible.

| **Estimator supports categorical natively** | **Cat feature specified by user in fit** | **Cat feature present in data** | **Outcome** |
|:-----------------:|:-------------------------:|:--------------:|:---------------:|
| NO | NO | NO | normal case, fits without error. |
| NO | NO | YES| informative error message to user that categorical feature is not supported by the forecaster |
| NO | YES| NO | X |
| NO | YES| YES| X |
| YES| NO | NO | normal case, fits without error. |
| YES| NO | YES| use categorical columns as detected by feature_kind and proceed to fit. |
| YES| YES| NO | informative error message that specified column was not present in data.|
| YES| YES| YES| use categorical columns as detected by feature_kind along with the user specified columns(union) and proceed to fit. |

- in case where specified column was not found, we can also choose to ignore that column and continue to fit with warning that column was not found.

#### PROS:
- No need to add `categorical_feature` parameter to all models.

#### CONS:
- Less number of estimators support categorical natively.


---
---

### Design 2. Perform some encoding method as default if not natively supported.

- `categorical_feature` param will be present in all models.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts: if this is available as a configuration instead of as an argument, will that be feasible? If so, will it help?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Abhay-Lejith @fkiraly any thoughts about this comment?


#### How tag is going to be used:
```
if cat support tag is false:
encode cat features and continue
else:
continue
```

| **Estimator supports categorical natively** | **Cat feature specified by user in fit** | **Cat feature present in data** | **Outcome** |
|:-----------------:|:-------------------------:|:--------------:|:---------------:|
| NO | NO | NO | normal case, fits without error. |
| NO | NO | YES| encode features internally as detected by feature_kind and continue.
| NO | YES| NO | error that specified column was not present in data |
| NO | YES| YES| use categorical features detected by feature kind and those specified by user(union) and proceed to encode and then fit |
| YES| NO | NO | normal case, fits without error.|
| YES| NO | YES| use categorical features detected by feature_kind and fit|
| YES| YES| NO | error that specified column was not present in data|
| YES| YES| YES| use categorical features detected by feature kind and those specified by user(union) and proceed|

- in case where specified column was not found, we can also choose to ignore that column and continue to fit with warning that column was not found.

#### PROS:
- All forecasters will be able to accept categorical inputs.

#### CONS:
- adding `cat_features` parameter to all estimators is not desired
- Not flexible in choice of encoding (label,one-hot,etc) if we fix a default. If user wants different types, will have to resort to externallly doing it anyway.

---
---

### Design 3. Same as design 2 but NO `cat_features` parameter in model
Instead, user is required to convert the required numerical features to categorical. To ease the process, we can make a `transformer` to make this conversion.

- no `cat_features` parameter in models.

| **Estimator supports categorical natively** | **Cat feature present in data** | **Outcome** |
|:-------------------------------------------:|:-------------------------------:|:---------------:|
| NO | NO | normal case, fits without error. |
| NO | YES| encode features internally as detected by feature_kind and continue to fit.
| YES| NO | normal case, fits without error.|
| YES| YES| use categorical features detected by feature_kind and fit|

#### PROS:
- All forecasters will be able to accept categorical inputs.

#### CONS:
- Not flexible in choice of encoding (label,one-hot,etc) if we fix a default. If user wants different types, will have to resort to externallly doing it anyway.
- Removing parameter comes at cost of requiring user to convert numerical to categorical themselves.
- If this is the case then user can just convert categorical to numerical and use the data. Providing internal encoding is kind of redundant.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm quite confused about design 3. L188 and L204 mentions that user is required to own the encoding, but L195 and L203 suggests we are encoding with a fixed default encoding method. Can you please help me understand what exactly is the plan?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yarnabrina the encoding the user needs to do here is from 'numerical' to 'categorical' for the cases where they want to treat a numerical column as categorical feature. (because we cannot detect these by feature_kind) . Internal encoding is from categorical to numerical and is done by a fixed default. Is it clear now?


---
---
### Design 4.
- Same as Design 2 or 3 but without fixing a default encoder.
- How much choice do we want to give user?
- Which encoders to use?
- Which columns to encode?
- Which encoder to use on which column?
- This is basically trying to provide the entire encoding ecosystem internally.
- Will be very difficult to implement. Where will the user specify all this data?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users can pass configuration for the forecaster to specify encoding choices, with a specific encoding method (say one hot encoding being default). It'll look like forecast.set_config(encoding_configuration=...).

The values can be of 2 types:

  1. single encoder instance -> use feature kind to identify categorical features
  2. dictionary with column names as keys and value as encoder instance

In my point of view, the pros are that this will not need to modify signature of any forecasters, and users have full freedom for encoding choices if they so want. This can also be set on a pipeline instance and can be forwarded to all estimators in the pipeline possibly to ensure all of them use same encoding method (preferably same encoder instance).

Requesting @Abhay-Lejith and @fkiraly to suggest cons.

Copy link
Contributor

@fkiraly fkiraly Jun 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My problem with this is that if we use set_config, the modelling choices here will not be tunable, or visible via set_params, which is the "big advantage" of the sklearn-like interface.

I would feel strongly that these choiecs - as they are modelling choices - should sit somewhere where get_params, set_params sees them.

The most natural place would be existing pipelines, and I am noting that the above seems to replicate the functionality of ColumnEnsembleTransformer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few clarification questions

  1. I am definitely unfamiliar with ColumnEnsembleTransformer in sktime, how is different from ColumnTransformer (from sktime or from sklearn)?

  2. Can you please elaborate how can this transformer be used for the orchestration requirement I mentioned earlier in Discord (use encoding with estimator specific scope so that we only do encoding where required and use native support whereever available)? Of course, "we are not giving that option to user" is also an option, is that the plan?

  3. If you are suggesting that encoding method (or methods if we allow variation per columns) needs to be tunable, which does make sense if we go by pipeline approach, I think designs 2-3 are not suitable because they rely on fixed encoder and 1 is kind-of passable as it does not do encoding at all and rely on whatever encoding estimators do natively (@Abhay-Lejith please confirm).

Copy link
Author

@Abhay-Lejith Abhay-Lejith Jun 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About point 3, yes you are right.
I did not have in mind this 'tunable' requirement, until @fkiraly brought it up as a con to the set_config idea. So the fixed default encoder designs will not be tunable since it is a 'fixed' default.

About point 2, this case of a pipeline where one estimator may have native support and another may not and we wish to use the encoded features in one but the native support in the other seems like a rather niche case to me. Why not just use the encoded features for all estimators in the pipeline. Also, just so I am clear with it, could you give an example of such a pipeline with code?


- I don't think this is a good idea.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing your opinion, but just to understand better can you please tell whether you think it's a bad idea or it's a complex idea? These are not necessarily same.

If it's the first one, viz. bad idea, can I request you to explain why? I've shared my reasoning with respect to multi-step orchestration process above, would like to know yours.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, I do not understand what concretely design 4 is, so I cannot form an opinion due to lack of specificity and information.

What would help me a lot is if in all cases, the scenario "data has cateorical, estimator does not support" is concretely spelled out. How does the speculative code look like, what happens internally?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fkiraly I have mentioned (L-210) that it is the same as Design 2 or 3, but we are extending it by trying to provide more encoding options to the user.
So in the scenario "data has categorical, estimator does not support", it will encode the categorical features and fit.
I haven't gone into more detail of how it will look yet as I was not sure about how we would allow the user to specify all this data.

Copy link
Author

@Abhay-Lejith Abhay-Lejith Jun 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yarnabrina , sorry about bluntly saying it's a bad idea without any explanation.
But my thought was that doing so much only to provide an existing functionality under the hood seemed unnecessary to me. Also, I did not think of the set_config idea you proposed, so I was also wondering how we would let the user specify so much data.
Edit: I was thinking ColumnTransformer but I see that sktime has its own version as ColumnEnsembleTransformer(as noted by @fkiraly as well)


---
---