Dependency housekeeping: sklearn>=1.2, numpy>2, pandas #72

ehudkr · 2024-07-25T11:22:42Z

Adjust causallib and its tests for scikit-learn version >=1.2.0.
And do additional housekeeping adjusting for ever-evolving changes in causallib's dependencies.

Starting version 1.2.0 Scikit-learn enforced named inputs (i.e., dataframes) to have all their columns be of a single type.
Namely, you could not fit an sklearn model on a dataframe with an integer column name (e.g., 0) and a string column name (e.g., "x1").
This has caused causallib to crash depending on the provided input (especially the ones in its tests).
This is because causallib may often join the covariates X and treatment a (Standardization) and some propensity-based features (PropensityFeatureStandardization). And if the name given to a and/or the propensity feature is, for example, a string, while X columns are integers, it would've cause sklearn to crash.
Note that non-string column names in dataframes seem to be anti-pattern, and therefore if the user provided X with string columns and a with a string name, no error should've arise, 2) the proposed solution will tend to stringify columns/names.

The proposed solution is mostly a "safe join" in which X's columns and a's name are evaluated for their types. If X's columns are of mixed types they are converted to strings. If X.columns and a.name type do not agree then there will be a preference to convert them all to strings. After this name sanitization, they are concatenated.

In addition to adjusting for scikit-learn>=1.2, there were additional changes to adjust the package for numpy>=2 and for pandas>=2.1.

`normalize` kwarg is no longer supported, and wasn't really necessary for tests to pass. Therefore, it is removed and not changed to `Pipeline` with `StandardScaler` preprocess Signed-off-by: Ehud-Karavani <[email protected]>

The use of `a.values` within the treatment encoder created mismatch between specifying named pandas and unnamed numpy arrays. Signed-off-by: Ehud-Karavani <[email protected]>

Currently, does not work. needs to figure out what happens with encoding when X cols are numeric (should not add prefix which makes it string, but also cannot be done numerically because 0 1 treatment columns will bash with 0 1 Xs columns) Signed-off-by: Ehud-Karavani <[email protected]>

Signed-off-by: Ehud-Karavani <[email protected]>

Instead, transform the resulted output to dense. (dense is required to be set up as dataframe) Signed-off-by: Ehud-Karavani <[email protected]>

if wrong `feature_type` is provided an explicit warning is raised which states what feature types are available. Also corrects a typo in docstring Signed-off-by: Ehud-Karavani <[email protected]>

Rename `a` however necessary in order to fit with X.columns type. Signed-off-by: Ehud-Karavani <[email protected]>

So sklearn>=1.2 don't break for OneHotEncoder with numpy input. Signed-off-by: Ehud-Karavani <[email protected]>

Signed-off-by: Ehud-Karavani <[email protected]>

Prefer making `a.name` a string, and base `X.columns` on that, rather than basing `a.name` on `X.columns`. This is because `a.name` also passed around as MultiIndex levels and therefore should be readable. Current implementation is probably BAD. Logic should be really simplified. Signed-off-by: Ehud-Karavani <[email protected]>

Following scikit-learn raising an Exception for mixed types: https://github.com/scikit-learn/scikit-learn/blob/8133ecaacca77f06a8c4c560f5dbbfd654f1990f/sklearn/utils/validation.py#L2271-L2280 scikit-learn/scikit-learn@9f9f168 A dedicated Warning will make it easier to filter when testing, instead of making sure all data in all tests adhere to a single-type columns. Signed-off-by: Ehud-Karavani <[email protected]>

First, this is not a Standardization-specific code, but rather maintenance code that is used by Standardization (which joins `a` to `X` part of its function). Second, this might be needed elsewhere in the package. Therefore, it is best to move this functionality to some general utils. Signed-off-by: Ehud-Karavani <[email protected]> Make `a_name` optional Signed-off-by: Ehud-Karavani <[email protected]>

Since the renaming of the columns will always be proceeded with a merge of the two Frames/Series, it is better to just create a single function that will do the safe merge once, rather than let every instance sanitize the name and merge on its own. Signed-off-by: Ehud-Karavani <[email protected]>

Signed-off-by: Ehud-Karavani <[email protected]>

`Standardization` requires instantiation of a fixed-value synthetic treatment assignment vector, which now requires its name to match the treatment assignment vector name seen during `fit`. Ensures the Series' name is the same in both `fit` and `predict` time. Signed-off-by: Ehud-Karavani <[email protected]>

Signed-off-by: Ehud-Karavani <[email protected]>

This avoids lots of clutter and makes tests output more clear and intelligible. The alternative to make the data in each test adhere to the sklearn restriction is too much of change. Signed-off-by: Ehud-Karavani <[email protected]>

Signed-off-by: Ehud-Karavani <[email protected]>

Less applicable for `test_doubly_robust.py` because that requires too much changes to the code that it is not worth it. Signed-off-by: Ehud-Karavani <[email protected]>

scikit-learn moves from "none" to None for non penalized logistic regression starting version 1.2.0. Adjust tests using it to use either, depending on the current sklearn version installed. Signed-off-by: Ehud-Karavani <[email protected]>

Both check the same, one using unittests and the other with numpy testing Signed-off-by: Ehud-Karavani <[email protected]>

It appears scikit-learn forbids initializing its estimators with positional variable arguments. So the dynamic CausalSearch model inheriting and initializing an unknown scikit-learn Estimator, needs to adhere to that. Signed-off-by: Ehud-Karavani <[email protected]>

Signed-off-by: Ehud-Karavani <[email protected]>

Pandas may soon drop support of grouping on columns. `FutureWarning: DataFrame.groupby with axis=1 is deprecated. Do `frame.T.groupby(...)` without axis instead.` Fix it by double-transposing the dfataframe, before the groupby and then transposing the result. Signed-off-by: Ehud-Karavani <[email protected]>

Pandas may in the near future stop supporting the capability of ignoring columns that cannot be converted to numerics, keeping the input as is. `FutureWarning: errors='ignore' is deprecated and will raise in a future version. Use to_numeric without passing `errors` and catch exceptions explicitly instead` So instead, let pandas place NaNs wherever it did not succeed and then fill those nans with the original (non-numeric) values. Signed-off-by: Ehud-Karavani <[email protected]>

It seems Pandas has stop supporting `applymap` in favor of `map`: `FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.` Signed-off-by: Ehud-Karavani <[email protected]>

Pandas probably used to allow integer-based slicing/accessing even if index was not integer, this is why the second `[0]` worked even though the index has "control_to_treated" value. But this behavior will be deprecated soon: `FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`` So replace with full `iloc`-based slicing. Signed-off-by: Ehud-Karavani <[email protected]>

Pandas may soon stop allowing placing empty frames/series in `pd.concat` `FutureWarning: The behavior of array concatenation with empty entries is deprecated. In a future version, this will no longer exclude empty items when determining the result dtype. To retain the old behavior, exclude the empty entries before the concat operation.` But it does permit concatenating `None`s. So replace all empty frames with `None`s should do the work. Signed-off-by: Ehud-Karavani <[email protected]>

It seems the transition from `applymap` to `map` is <1 year old, starting pandas version 2.1: https://pandas.pydata.org/pandas-docs/version/2.1/reference/api/pandas.DataFrame.map.html This is too new to fully rely on. Especially when it breaks for python 3.7 and 3.8 that are not supported by newer pandas versions, which I'm not yet comfortable with (although their end-of-life status, because I'm not sure how causallib supports 3.12 at the moment). So putting a pandas version-dependent choice makes sense at the moment and can later be discarded. Signed-off-by: Ehud-Karavani <[email protected]>

Tested locally with the newest dependencies and it seems to work. Signed-off-by: Ehud-Karavani <[email protected]>

Signed-off-by: Ehud-Karavani <[email protected]>

replace `row_stack` with `vstack` and `np.NaN` with `np.nan`. Signed-off-by: Ehud-Karavani <[email protected]>

ehudkr added 30 commits December 22, 2022 11:56

Remove normalize kwarg from LinearRegression

b7e0a57

`normalize` kwarg is no longer supported, and wasn't really necessary for tests to pass. Therefore, it is removed and not changed to `Pipeline` with `StandardScaler` preprocess Signed-off-by: Ehud-Karavani <[email protected]>

Temp commit: mid changes adjusting to same-type cols

df3694b

Merge branch 'public_master' into adjust-for-sklearn-1.2.0

568f107

Standardization: augment a without going through numpy

cb8bab0

The use of `a.values` within the treatment encoder created mismatch between specifying named pandas and unnamed numpy arrays. Signed-off-by: Ehud-Karavani <[email protected]>

Skip column type test for sklearn versions < 1.2

e14c86f

Signed-off-by: Ehud-Karavani <[email protected]>

Avoid setting sparse in OneHotEncoder due to deprecation

3400ef5

Instead, transform the resulted output to dense. (dense is required to be set up as dataframe) Signed-off-by: Ehud-Karavani <[email protected]>

PropensityFeatureStandardization: informative feature_type access

d01f979

if wrong `feature_type` is provided an explicit warning is raised which states what feature types are available. Also corrects a typo in docstring Signed-off-by: Ehud-Karavani <[email protected]>

Standardization: renaming of a to force match with X.columns types

bcc0767

Rename `a` however necessary in order to fit with X.columns type. Signed-off-by: Ehud-Karavani <[email protected]>

TMLE covariate inference with DataFrame to preserve named arrays

45af868

So sklearn>=1.2 don't break for OneHotEncoder with numpy input. Signed-off-by: Ehud-Karavani <[email protected]>

Stringify X.columns if mixed column names types

624baf8

Signed-off-by: Ehud-Karavani <[email protected]>

Column-type safe merging for RFE confounder selection method

9886079

Signed-off-by: Ehud-Karavani <[email protected]>

Safe column-type join for merging propensity-based feature PropFeatStd

44a9264

Signed-off-by: Ehud-Karavani <[email protected]>

Avoid column-renaming if one of the inputs is an empty dataset

35e6f64

Signed-off-by: Ehud-Karavani <[email protected]>

Same duration names during fit and predict in survival regression

6333b15

Signed-off-by: Ehud-Karavani <[email protected]>

R-learner column-name-type safe join

d648801

Signed-off-by: Ehud-Karavani <[email protected]>

Silence column name change warnings during tests

f1debd7

This avoids lots of clutter and makes tests output more clear and intelligible. The alternative to make the data in each test adhere to the sklearn restriction is too much of change. Signed-off-by: Ehud-Karavani <[email protected]>

Test and fix column-renaming safe join

4cd945c

Signed-off-by: Ehud-Karavani <[email protected]>

Use string-columns arrays instead of silencing ColumnNameChangeWarning

785b9e1

Less applicable for `test_doubly_robust.py` because that requires too much changes to the code that it is not worth it. Signed-off-by: Ehud-Karavani <[email protected]>

Remove duplicated assertion

e39ece5

Both check the same, one using unittests and the other with numpy testing Signed-off-by: Ehud-Karavani <[email protected]>

Remove scikit-learn v1.2 limitation from requirements.txt

c778d7e

Signed-off-by: Ehud-Karavani <[email protected]>

Fix convergence warnings for logistic-regression-based IPW

fb9d9dc

Signed-off-by: Ehud-Karavani <[email protected]>

ehudkr added 8 commits July 25, 2024 10:55

Fix pandas future deprecation replacing appplymap with map

43155fe

It seems Pandas has stop supporting `applymap` in favor of `map`: `FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.` Signed-off-by: Ehud-Karavani <[email protected]>

Remove numpy <v2.0 limitation from requirements.txt

47786cc

Tested locally with the newest dependencies and it seems to work. Signed-off-by: Ehud-Karavani <[email protected]>

Bump version: 0.9.7

3551319

Signed-off-by: Ehud-Karavani <[email protected]>

Adjust AdversarialBalancing to numpy>=2

48f0ae1

replace `row_stack` with `vstack` and `np.NaN` with `np.nan`. Signed-off-by: Ehud-Karavani <[email protected]>

ehudkr merged commit bdf033c into master Jul 25, 2024
13 checks passed

ehudkr deleted the adjust-for-sklearn-1.2.0 branch July 25, 2024 11:28

ehudkr mentioned this pull request Jul 25, 2024

Fix support for scikit-learn>=1.2.0 ̵a̵n̵d̵ ̵N̵u̵m̵p̵y̵=̵1̵.̵2̵4̵.̵0̵ #52

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dependency housekeeping: sklearn>=1.2, numpy>2, pandas #72

Dependency housekeeping: sklearn>=1.2, numpy>2, pandas #72

ehudkr commented Jul 25, 2024

Dependency housekeeping: sklearn>=1.2, numpy>2, pandas #72

Dependency housekeeping: sklearn>=1.2, numpy>2, pandas #72

Conversation

ehudkr commented Jul 25, 2024