-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dependency housekeeping: sklearn>=1.2, numpy>2, pandas #72
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
`normalize` kwarg is no longer supported, and wasn't really necessary for tests to pass. Therefore, it is removed and not changed to `Pipeline` with `StandardScaler` preprocess Signed-off-by: Ehud-Karavani <[email protected]>
The use of `a.values` within the treatment encoder created mismatch between specifying named pandas and unnamed numpy arrays. Signed-off-by: Ehud-Karavani <[email protected]>
Currently, does not work. needs to figure out what happens with encoding when X cols are numeric (should not add prefix which makes it string, but also cannot be done numerically because 0 1 treatment columns will bash with 0 1 Xs columns) Signed-off-by: Ehud-Karavani <[email protected]>
Signed-off-by: Ehud-Karavani <[email protected]>
Instead, transform the resulted output to dense. (dense is required to be set up as dataframe) Signed-off-by: Ehud-Karavani <[email protected]>
if wrong `feature_type` is provided an explicit warning is raised which states what feature types are available. Also corrects a typo in docstring Signed-off-by: Ehud-Karavani <[email protected]>
Rename `a` however necessary in order to fit with X.columns type. Signed-off-by: Ehud-Karavani <[email protected]>
So sklearn>=1.2 don't break for OneHotEncoder with numpy input. Signed-off-by: Ehud-Karavani <[email protected]>
Signed-off-by: Ehud-Karavani <[email protected]>
Prefer making `a.name` a string, and base `X.columns` on that, rather than basing `a.name` on `X.columns`. This is because `a.name` also passed around as MultiIndex levels and therefore should be readable. Current implementation is probably BAD. Logic should be really simplified. Signed-off-by: Ehud-Karavani <[email protected]>
Following scikit-learn raising an Exception for mixed types: https://github.com/scikit-learn/scikit-learn/blob/8133ecaacca77f06a8c4c560f5dbbfd654f1990f/sklearn/utils/validation.py#L2271-L2280 scikit-learn/scikit-learn@9f9f168 A dedicated Warning will make it easier to filter when testing, instead of making sure all data in all tests adhere to a single-type columns. Signed-off-by: Ehud-Karavani <[email protected]>
First, this is not a Standardization-specific code, but rather maintenance code that is used by Standardization (which joins `a` to `X` part of its function). Second, this might be needed elsewhere in the package. Therefore, it is best to move this functionality to some general utils. Signed-off-by: Ehud-Karavani <[email protected]> Make `a_name` optional Signed-off-by: Ehud-Karavani <[email protected]>
Since the renaming of the columns will always be proceeded with a merge of the two Frames/Series, it is better to just create a single function that will do the safe merge once, rather than let every instance sanitize the name and merge on its own. Signed-off-by: Ehud-Karavani <[email protected]>
Signed-off-by: Ehud-Karavani <[email protected]>
Signed-off-by: Ehud-Karavani <[email protected]>
`Standardization` requires instantiation of a fixed-value synthetic treatment assignment vector, which now requires its name to match the treatment assignment vector name seen during `fit`. Ensures the Series' name is the same in both `fit` and `predict` time. Signed-off-by: Ehud-Karavani <[email protected]>
Signed-off-by: Ehud-Karavani <[email protected]>
Signed-off-by: Ehud-Karavani <[email protected]>
Signed-off-by: Ehud-Karavani <[email protected]>
This avoids lots of clutter and makes tests output more clear and intelligible. The alternative to make the data in each test adhere to the sklearn restriction is too much of change. Signed-off-by: Ehud-Karavani <[email protected]>
Signed-off-by: Ehud-Karavani <[email protected]>
Less applicable for `test_doubly_robust.py` because that requires too much changes to the code that it is not worth it. Signed-off-by: Ehud-Karavani <[email protected]>
scikit-learn moves from "none" to None for non penalized logistic regression starting version 1.2.0. Adjust tests using it to use either, depending on the current sklearn version installed. Signed-off-by: Ehud-Karavani <[email protected]>
Both check the same, one using unittests and the other with numpy testing Signed-off-by: Ehud-Karavani <[email protected]>
It appears scikit-learn forbids initializing its estimators with positional variable arguments. So the dynamic CausalSearch model inheriting and initializing an unknown scikit-learn Estimator, needs to adhere to that. Signed-off-by: Ehud-Karavani <[email protected]>
Signed-off-by: Ehud-Karavani <[email protected]>
Signed-off-by: Ehud-Karavani <[email protected]>
Pandas may soon drop support of grouping on columns. `FutureWarning: DataFrame.groupby with axis=1 is deprecated. Do `frame.T.groupby(...)` without axis instead.` Fix it by double-transposing the dfataframe, before the groupby and then transposing the result. Signed-off-by: Ehud-Karavani <[email protected]>
Pandas may in the near future stop supporting the capability of ignoring columns that cannot be converted to numerics, keeping the input as is. `FutureWarning: errors='ignore' is deprecated and will raise in a future version. Use to_numeric without passing `errors` and catch exceptions explicitly instead` So instead, let pandas place NaNs wherever it did not succeed and then fill those nans with the original (non-numeric) values. Signed-off-by: Ehud-Karavani <[email protected]>
It seems Pandas has stop supporting `applymap` in favor of `map`: `FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.` Signed-off-by: Ehud-Karavani <[email protected]>
Pandas probably used to allow integer-based slicing/accessing even if index was not integer, this is why the second `[0]` worked even though the index has "control_to_treated" value. But this behavior will be deprecated soon: `FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`` So replace with full `iloc`-based slicing. Signed-off-by: Ehud-Karavani <[email protected]>
Pandas may soon stop allowing placing empty frames/series in `pd.concat` `FutureWarning: The behavior of array concatenation with empty entries is deprecated. In a future version, this will no longer exclude empty items when determining the result dtype. To retain the old behavior, exclude the empty entries before the concat operation.` But it does permit concatenating `None`s. So replace all empty frames with `None`s should do the work. Signed-off-by: Ehud-Karavani <[email protected]>
It seems the transition from `applymap` to `map` is <1 year old, starting pandas version 2.1: https://pandas.pydata.org/pandas-docs/version/2.1/reference/api/pandas.DataFrame.map.html This is too new to fully rely on. Especially when it breaks for python 3.7 and 3.8 that are not supported by newer pandas versions, which I'm not yet comfortable with (although their end-of-life status, because I'm not sure how causallib supports 3.12 at the moment). So putting a pandas version-dependent choice makes sense at the moment and can later be discarded. Signed-off-by: Ehud-Karavani <[email protected]>
Tested locally with the newest dependencies and it seems to work. Signed-off-by: Ehud-Karavani <[email protected]>
Signed-off-by: Ehud-Karavani <[email protected]>
replace `row_stack` with `vstack` and `np.NaN` with `np.nan`. Signed-off-by: Ehud-Karavani <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Adjust causallib and its tests for scikit-learn version >=1.2.0.
And do additional housekeeping adjusting for ever-evolving changes in causallib's dependencies.
Starting version 1.2.0 Scikit-learn enforced named inputs (i.e., dataframes) to have all their columns be of a single type.
Namely, you could not fit an sklearn model on a dataframe with an integer column name (e.g., 0) and a string column name (e.g., "x1").
This has caused causallib to crash depending on the provided input (especially the ones in its tests).
This is because causallib may often join the covariates
X
and treatmenta
(Standardization
) and some propensity-based features (PropensityFeatureStandardization
). And if the name given toa
and/or the propensity feature is, for example, a string, whileX
columns are integers, it would've cause sklearn to crash.Note that non-string column names in dataframes seem to be anti-pattern, and therefore if the user provided
X
with string columns anda
with a string name, no error should've arise, 2) the proposed solution will tend to stringify columns/names.The proposed solution is mostly a "safe join" in which
X
's columns anda
's name are evaluated for their types. IfX
's columns are of mixed types they are converted to strings. IfX.columns
anda.name
type do not agree then there will be a preference to convert them all to strings. After this name sanitization, they are concatenated.In addition to adjusting for scikit-learn>=1.2, there were additional changes to adjust the package for numpy>=2 and for pandas>=2.1.