Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In gcm, allow building a model in presence of missing values if an appropriate estimator is selected #1300

Open
saeedmehrang opened this issue Feb 26, 2025 · 0 comments
Labels
enhancement New feature or request

Comments

@saeedmehrang
Copy link

saeedmehrang commented Feb 26, 2025

Feature Request
handling missing values is not currently supported in the gcm module. To enable it, we can leverage one of the ensemble models that natively support missing value handling. E.g. HistGradientBoosting. For that, we can relax the hard requirement for complete data if the causal mechanism is set to an Additive noise model with HistGradientBoosting as the function.

@bloebp and I had a discussion in Discord, and we can add the support.

Solution
Relax the if statement at the line below if the data is not complete but the selected model supports it (e.g. HistGradientBoosting).
https://github.com/py-why/dowhy/blob/main/dowhy/gcm/util/general.py#L166

Update! here is the list of all scikit-learn methods that support missing values out-of-the-box: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

Alternatives
Not applicable!

Code to generate the behavior

======================================
Objectives:

This is a code for experiments with the DoWhy package, especially with the gcm module when data has missing values.

Environment Setup:

  • Use python 3.12
  • before running this code,
    • Create and activate a virtual environment , it is best to use uv for this purpose.
    • Install ipykernel in the virtual environment if you want to use jupyter notebook,
  • Select the virtual environment in the notebook as the kernel,
  • Install dowhy inside the notebook inside the virtual environment,

Install the necessary libraries beforehand:
!uv pip install -U dowhy scikit-learn==1.3.2

======================================

Code:
# Import the necessary libraries
import dowhy
import dowhy.gcm as gcm
from dowhy.gcm import StructuralCausalModel
import numpy as np
import dowhy.datasets
import pandas as pd
import networkx as nx
import random
from sklearn.ensemble import HistGradientBoostingClassifier


# set the decimal precision to 3
pd.set_option('display.precision', 3)
# set a global random seed for numpy and pther libraries
seed = 1
random.seed(seed)
np.random.seed(seed)

### Create a synthetic dataset with linear associations
# Value of the coefficient [BETA]
BETA = 1
# Number of Common Causes
NUM_COMMON_CAUSES = 8
# Number of Discrete Common Causes
NUM_DISCRETE_COMMON_CAUSES = 5
# number of effect modifiers
NUM_EFFECT_MODIFIERS = 3
# Number of Discrete Effect Modifiers
NUM_DISCRETE_EFFECT_MODIFIERS = 3
# Number of Instruments
NUM_INSTRUMENTS = 2
# Number of Samples
NUM_SAMPLES = 10000
# Treatment is Binary
TREATMENT_IS_BINARY = True
# outcome is binary
OUTCOME_IS_BINARY = True

if __name__ == "__main__":
    data = dowhy.datasets.linear_dataset(beta=BETA,
                                    num_common_causes=NUM_COMMON_CAUSES,
                                    num_instruments=NUM_INSTRUMENTS,
                                    num_discrete_common_causes=NUM_DISCRETE_COMMON_CAUSES,
                                    num_effect_modifiers=NUM_EFFECT_MODIFIERS,
                                    num_discrete_effect_modifiers=NUM_DISCRETE_EFFECT_MODIFIERS,
                                    num_samples=NUM_SAMPLES,
                                    treatment_is_binary=TREATMENT_IS_BINARY,
                                    outcome_is_binary=OUTCOME_IS_BINARY)

    df = data['df']

    ### Inject missing values at random to the dataset with a ratio of 0.1
    # Create a copy of the dataframe
    df_with_missing = df.copy()

    # Calculate number of cells to make missing
    n_cells = df_with_missing.size
    n_missing = int(0.1 * n_cells)

    # Randomly select cells to make missing
    rows = np.random.randint(0, df_with_missing.shape[0], n_missing)
    cols = np.random.randint(0, df_with_missing.shape[1], n_missing)

    # Set selected cells to NaN
    for r, c in zip(rows, cols):
        df_with_missing.iloc[r,c] = np.nan

    # Use df_with_missing going forward
    df = df_with_missing

    # show the fraction of missing values per column
    df.isnull().mean()

    ### Create a causal model
    # Parse the GML string into a networkx graph
    causal_graph = nx.parse_gml(data['gml_graph'])

    causal_model = StructuralCausalModel(causal_graph)

    # find the root nodes
    root_nodes = [node for node in causal_graph.nodes() if len(list(causal_graph.predecessors(node))) == 0]
    print(len(root_nodes))
    print(root_nodes)

    # build the model
    try:
        # assign empirical distribution to the root nodes
        for node in root_nodes:
            causal_model.set_causal_mechanism(node, gcm.EmpiricalDistribution())

        # assign additive noise model to the non-root nodes
        for node in causal_graph.nodes():
            if node not in root_nodes:
                # check if the node is categorical or continuous    
                if df[node].nunique() <= 10:
                    print(f"Setting {node} to AdditiveNoiseModel(gcm.ml.create_hist_gradient_boost_classifier())")
                    causal_model.set_causal_mechanism(node, gcm.AdditiveNoiseModel(gcm.ml.create_hist_gradient_boost_classifier()))
                else:
                    print(f"Setting {node} to AdditiveNoiseModel(gcm.ml.create_hist_gradient_boost_regressor())")
                    causal_model.set_causal_mechanism(node, gcm.AdditiveNoiseModel(gcm.ml.create_hist_gradient_boost_regressor()))

        gcm.fit(causal_model, df)
    
    except Exception as e:
        print(e)

    finally:
        print("How to fix the error above? ")

        proposal = """
        Allow the causal model to be fitted with the data that has missing values IFF the selected regressor/classifier can handle missing values. 
        For example, the `HistGradientBoostingClassifier` can handle missing values. 
        I believe that supporting this requires relaxing the conditional statement in line `dowhy/gcm/util/general.py:166` if the regressor/classifier can handle missing values, 
        then drop the rows with missing values on the target node, finally pass the data to the regressor/classifier fit method."""
        print(proposal)
        
        # to simulate the behavior, we can fit a `create_hist_gradient_boost_classifier` on the node `v0` and then predict with the missing values.
        # first get the parents of `v0`
        parents_of_v0 = list(causal_graph.predecessors('v0'))

        # drop the rows with missing values in the parents of `v0`
        df_local = df.dropna(subset=['v0'])

        # fit a `hist_gradient_boost_classifier` from sklearn on the parents of `v0`
        X = df_local[parents_of_v0].to_numpy().astype(np.float32)  # Convert to float32
        y = df_local['v0'].to_numpy().astype(np.float32)

        # fit the model
        # random missingness
        clf = HistGradientBoostingClassifier(min_samples_leaf=1).fit(X, y)
        # make prediction on the complete dataset (no missing values)
        y_pred_complete = clf.predict(X)
        # make prediction on the original dataset (with missing values)
        y_pred_missing = clf.predict(df[parents_of_v0].to_numpy().astype(np.float32))

        print("Shape of the predictions: ", y_pred_complete.shape)
        print("Shape of the predictions with missing values: ", y_pred_missing.shape)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant