You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Feature Request
handling missing values is not currently supported in the gcm module. To enable it, we can leverage one of the ensemble models that natively support missing value handling. E.g. HistGradientBoosting. For that, we can relax the hard requirement for complete data if the causal mechanism is set to an Additive noise model with HistGradientBoosting as the function.
@bloebp and I had a discussion in Discord, and we can add the support.
# Import the necessary librariesimportdowhyimportdowhy.gcmasgcmfromdowhy.gcmimportStructuralCausalModelimportnumpyasnpimportdowhy.datasetsimportpandasaspdimportnetworkxasnximportrandomfromsklearn.ensembleimportHistGradientBoostingClassifier# set the decimal precision to 3pd.set_option('display.precision', 3)
# set a global random seed for numpy and pther librariesseed=1random.seed(seed)
np.random.seed(seed)
### Create a synthetic dataset with linear associations# Value of the coefficient [BETA]BETA=1# Number of Common CausesNUM_COMMON_CAUSES=8# Number of Discrete Common CausesNUM_DISCRETE_COMMON_CAUSES=5# number of effect modifiersNUM_EFFECT_MODIFIERS=3# Number of Discrete Effect ModifiersNUM_DISCRETE_EFFECT_MODIFIERS=3# Number of InstrumentsNUM_INSTRUMENTS=2# Number of SamplesNUM_SAMPLES=10000# Treatment is BinaryTREATMENT_IS_BINARY=True# outcome is binaryOUTCOME_IS_BINARY=Trueif__name__=="__main__":
data=dowhy.datasets.linear_dataset(beta=BETA,
num_common_causes=NUM_COMMON_CAUSES,
num_instruments=NUM_INSTRUMENTS,
num_discrete_common_causes=NUM_DISCRETE_COMMON_CAUSES,
num_effect_modifiers=NUM_EFFECT_MODIFIERS,
num_discrete_effect_modifiers=NUM_DISCRETE_EFFECT_MODIFIERS,
num_samples=NUM_SAMPLES,
treatment_is_binary=TREATMENT_IS_BINARY,
outcome_is_binary=OUTCOME_IS_BINARY)
df=data['df']
### Inject missing values at random to the dataset with a ratio of 0.1# Create a copy of the dataframedf_with_missing=df.copy()
# Calculate number of cells to make missingn_cells=df_with_missing.sizen_missing=int(0.1*n_cells)
# Randomly select cells to make missingrows=np.random.randint(0, df_with_missing.shape[0], n_missing)
cols=np.random.randint(0, df_with_missing.shape[1], n_missing)
# Set selected cells to NaNforr, cinzip(rows, cols):
df_with_missing.iloc[r,c] =np.nan# Use df_with_missing going forwarddf=df_with_missing# show the fraction of missing values per columndf.isnull().mean()
### Create a causal model# Parse the GML string into a networkx graphcausal_graph=nx.parse_gml(data['gml_graph'])
causal_model=StructuralCausalModel(causal_graph)
# find the root nodesroot_nodes= [nodefornodeincausal_graph.nodes() iflen(list(causal_graph.predecessors(node))) ==0]
print(len(root_nodes))
print(root_nodes)
# build the modeltry:
# assign empirical distribution to the root nodesfornodeinroot_nodes:
causal_model.set_causal_mechanism(node, gcm.EmpiricalDistribution())
# assign additive noise model to the non-root nodesfornodeincausal_graph.nodes():
ifnodenotinroot_nodes:
# check if the node is categorical or continuous ifdf[node].nunique() <=10:
print(f"Setting {node} to AdditiveNoiseModel(gcm.ml.create_hist_gradient_boost_classifier())")
causal_model.set_causal_mechanism(node, gcm.AdditiveNoiseModel(gcm.ml.create_hist_gradient_boost_classifier()))
else:
print(f"Setting {node} to AdditiveNoiseModel(gcm.ml.create_hist_gradient_boost_regressor())")
causal_model.set_causal_mechanism(node, gcm.AdditiveNoiseModel(gcm.ml.create_hist_gradient_boost_regressor()))
gcm.fit(causal_model, df)
exceptExceptionase:
print(e)
finally:
print("How to fix the error above? ")
proposal=""" Allow the causal model to be fitted with the data that has missing values IFF the selected regressor/classifier can handle missing values. For example, the `HistGradientBoostingClassifier` can handle missing values. I believe that supporting this requires relaxing the conditional statement in line `dowhy/gcm/util/general.py:166` if the regressor/classifier can handle missing values, then drop the rows with missing values on the target node, finally pass the data to the regressor/classifier fit method."""print(proposal)
# to simulate the behavior, we can fit a `create_hist_gradient_boost_classifier` on the node `v0` and then predict with the missing values.# first get the parents of `v0`parents_of_v0=list(causal_graph.predecessors('v0'))
# drop the rows with missing values in the parents of `v0`df_local=df.dropna(subset=['v0'])
# fit a `hist_gradient_boost_classifier` from sklearn on the parents of `v0`X=df_local[parents_of_v0].to_numpy().astype(np.float32) # Convert to float32y=df_local['v0'].to_numpy().astype(np.float32)
# fit the model# random missingnessclf=HistGradientBoostingClassifier(min_samples_leaf=1).fit(X, y)
# make prediction on the complete dataset (no missing values)y_pred_complete=clf.predict(X)
# make prediction on the original dataset (with missing values)y_pred_missing=clf.predict(df[parents_of_v0].to_numpy().astype(np.float32))
print("Shape of the predictions: ", y_pred_complete.shape)
print("Shape of the predictions with missing values: ", y_pred_missing.shape)
The text was updated successfully, but these errors were encountered:
Feature Request
handling missing values is not currently supported in the
gcm
module. To enable it, we can leverage one of the ensemble models that natively support missing value handling. E.g. HistGradientBoosting. For that, we can relax the hard requirement for complete data if the causal mechanism is set to an Additive noise model with HistGradientBoosting as the function.@bloebp and I had a discussion in Discord, and we can add the support.
Solution
Relax the if statement at the line below if the data is not complete but the selected model supports it (e.g. HistGradientBoosting).
https://github.com/py-why/dowhy/blob/main/dowhy/gcm/util/general.py#L166
Update! here is the list of all scikit-learn methods that support missing values out-of-the-box: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values
Alternatives
Not applicable!
Code to generate the behavior
======================================
Objectives:
This is a code for experiments with the DoWhy package, especially with the
gcm
module when data has missing values.Environment Setup:
uv
for this purpose.Install the necessary libraries beforehand:
!uv pip install -U dowhy scikit-learn==1.3.2
======================================
Code:
The text was updated successfully, but these errors were encountered: