Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dependency housekeeping: sklearn>=1.2, numpy>2, pandas #72

Merged
merged 38 commits into from
Jul 25, 2024

Conversation

ehudkr
Copy link
Collaborator

@ehudkr ehudkr commented Jul 25, 2024

Adjust causallib and its tests for scikit-learn version >=1.2.0.
And do additional housekeeping adjusting for ever-evolving changes in causallib's dependencies.

Starting version 1.2.0 Scikit-learn enforced named inputs (i.e., dataframes) to have all their columns be of a single type.
Namely, you could not fit an sklearn model on a dataframe with an integer column name (e.g., 0) and a string column name (e.g., "x1").
This has caused causallib to crash depending on the provided input (especially the ones in its tests).
This is because causallib may often join the covariates X and treatment a (Standardization) and some propensity-based features (PropensityFeatureStandardization). And if the name given to a and/or the propensity feature is, for example, a string, while X columns are integers, it would've cause sklearn to crash.
Note that non-string column names in dataframes seem to be anti-pattern, and therefore if the user provided X with string columns and a with a string name, no error should've arise, 2) the proposed solution will tend to stringify columns/names.

The proposed solution is mostly a "safe join" in which X's columns and a's name are evaluated for their types. If X's columns are of mixed types they are converted to strings. If X.columns and a.name type do not agree then there will be a preference to convert them all to strings. After this name sanitization, they are concatenated.

In addition to adjusting for scikit-learn>=1.2, there were additional changes to adjust the package for numpy>=2 and for pandas>=2.1.

`normalize` kwarg is no longer supported, and wasn't really necessary
for tests to pass. Therefore, it is removed and not changed to
`Pipeline` with `StandardScaler` preprocess

Signed-off-by: Ehud-Karavani <[email protected]>
The use of `a.values` within the treatment encoder created mismatch
between specifying named pandas and unnamed numpy arrays.

Signed-off-by: Ehud-Karavani <[email protected]>
Currently, does not work.
needs to figure out what happens with encoding when X cols are numeric
(should not add prefix which makes it string, but also cannot be done
numerically because 0 1 treatment columns will bash with 0 1 Xs columns)

Signed-off-by: Ehud-Karavani <[email protected]>
Instead, transform the resulted output to dense.
(dense is required to be set up as dataframe)

Signed-off-by: Ehud-Karavani <[email protected]>
if wrong `feature_type` is provided an explicit warning is raised
which states what feature types are available.

Also corrects a typo in docstring

Signed-off-by: Ehud-Karavani <[email protected]>
Rename `a` however necessary in order to fit with X.columns type.

Signed-off-by: Ehud-Karavani <[email protected]>
So sklearn>=1.2 don't break for OneHotEncoder with numpy input.

Signed-off-by: Ehud-Karavani <[email protected]>
Prefer making `a.name` a string, and base `X.columns` on that,
rather than basing `a.name` on `X.columns`.
This is because `a.name` also passed around as MultiIndex levels and
therefore should be readable.

Current implementation is probably BAD.
Logic should be really simplified.

Signed-off-by: Ehud-Karavani <[email protected]>
Following scikit-learn raising an Exception for mixed types:
https://github.com/scikit-learn/scikit-learn/blob/8133ecaacca77f06a8c4c560f5dbbfd654f1990f/sklearn/utils/validation.py#L2271-L2280
scikit-learn/scikit-learn@9f9f168

A dedicated Warning will make it easier to filter when testing,
instead of making sure all data in all tests adhere to a single-type
columns.

Signed-off-by: Ehud-Karavani <[email protected]>
First, this is not a Standardization-specific code, but rather
maintenance code that is used by Standardization
(which joins `a` to `X` part of its function).
Second, this might be needed elsewhere in the package.

Therefore, it is best to move this functionality to some general utils.

Signed-off-by: Ehud-Karavani <[email protected]>

Make `a_name` optional

Signed-off-by: Ehud-Karavani <[email protected]>
Since the renaming of the columns will always be proceeded with a merge
of the two Frames/Series, it is better to just create a single function
that will do the safe merge once, rather than let every instance
sanitize the name and merge on its own.

Signed-off-by: Ehud-Karavani <[email protected]>
`Standardization` requires instantiation of a fixed-value synthetic
treatment assignment vector, which now requires its name to match the
treatment assignment vector name seen during `fit`.
Ensures the Series' name is the same in both `fit` and `predict` time.

Signed-off-by: Ehud-Karavani <[email protected]>
This avoids lots of clutter and makes tests output more clear and
intelligible.
The alternative to make the data in each test adhere to the sklearn
restriction is too much of change.

Signed-off-by: Ehud-Karavani <[email protected]>
Less applicable for `test_doubly_robust.py` because that requires too
much changes to the code that it is not worth it.

Signed-off-by: Ehud-Karavani <[email protected]>
scikit-learn moves from "none" to None for non penalized logistic
regression starting version 1.2.0.
Adjust tests using it to use either, depending on the current sklearn
version installed.

Signed-off-by: Ehud-Karavani <[email protected]>
Both check the same, one using unittests and the other with numpy
testing

Signed-off-by: Ehud-Karavani <[email protected]>
It appears scikit-learn forbids initializing its estimators with
positional variable arguments.
So the dynamic CausalSearch model inheriting and initializing an unknown
scikit-learn Estimator, needs to adhere to that.

Signed-off-by: Ehud-Karavani <[email protected]>
Pandas may soon drop support of grouping on columns.
`FutureWarning: DataFrame.groupby with axis=1 is deprecated. Do `frame.T.groupby(...)` without axis instead.`
Fix it by double-transposing the dfataframe, before the groupby and
then transposing the result.

Signed-off-by: Ehud-Karavani <[email protected]>
ehudkr added 8 commits July 25, 2024 10:55
Pandas may in the near future stop supporting the capability of ignoring
columns that cannot be converted to numerics, keeping the input as is.
`FutureWarning: errors='ignore' is deprecated and will raise in a future version. Use to_numeric without passing `errors` and catch exceptions explicitly instead`

So instead, let pandas place NaNs wherever it did not succeed and then
fill those nans with the original (non-numeric) values.

Signed-off-by: Ehud-Karavani <[email protected]>
It seems Pandas has stop supporting `applymap` in favor of `map`:
`FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.`

Signed-off-by: Ehud-Karavani <[email protected]>
Pandas probably used to allow integer-based slicing/accessing even if
index was not integer, this is why the second `[0]` worked even though
the index has "control_to_treated" value.
But this behavior will be deprecated soon:
`FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]``

So replace with full `iloc`-based slicing.

Signed-off-by: Ehud-Karavani <[email protected]>
Pandas may soon stop allowing placing empty frames/series in `pd.concat`
`FutureWarning: The behavior of array concatenation with empty entries is deprecated. In a future version, this will no longer exclude empty items when determining the result dtype. To retain the old behavior, exclude the empty entries before the concat operation.`

But it does permit concatenating `None`s.
So replace all empty frames with `None`s should do the work.

Signed-off-by: Ehud-Karavani <[email protected]>
It seems the transition from `applymap` to `map` is <1 year old,
starting pandas version 2.1:
https://pandas.pydata.org/pandas-docs/version/2.1/reference/api/pandas.DataFrame.map.html

This is too new to fully rely on.
Especially when it breaks for python 3.7 and 3.8
that are not supported by newer pandas versions,
which I'm not yet comfortable with (although their end-of-life status,
because I'm not sure how causallib supports 3.12 at the moment).

So putting a pandas version-dependent choice makes sense at the moment
and can later be discarded.

Signed-off-by: Ehud-Karavani <[email protected]>
Tested locally with the newest dependencies and it seems to work.

Signed-off-by: Ehud-Karavani <[email protected]>
Signed-off-by: Ehud-Karavani <[email protected]>
replace `row_stack` with `vstack` and `np.NaN` with `np.nan`.

Signed-off-by: Ehud-Karavani <[email protected]>
@ehudkr ehudkr merged commit bdf033c into master Jul 25, 2024
13 checks passed
@ehudkr ehudkr deleted the adjust-for-sklearn-1.2.0 branch July 25, 2024 11:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant