-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update inference python file to include shap function #48
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,10 @@ | ||
import numpy as np | ||
import pandas as pd | ||
from pandas.api.types import is_numeric_dtype | ||
import pytest | ||
|
||
from student_success_tool.modeling.inference import select_top_features_for_display | ||
|
||
from student_success_tool.modeling.inference import calculate_shap_values | ||
|
||
@pytest.mark.parametrize( | ||
[ | ||
|
@@ -115,3 +116,91 @@ def test_select_top_features_for_display( | |
) | ||
assert isinstance(obs, pd.DataFrame) and not obs.empty | ||
assert pd.testing.assert_frame_equal(obs, exp) is None | ||
|
||
|
||
@pytest.fixture | ||
def sample_data(): | ||
data = { | ||
'student_id': [1, 2, 3], | ||
'feature1': [0.1, 0.2, 0.3], | ||
'feature2': [0.4, 0.5, 0.6] | ||
} | ||
return pd.DataFrame(data) | ||
|
||
# Create dummy KernelExplainer | ||
class SimpleKernelExplainer: | ||
def shap_values(self, X): | ||
# Simulate SHAP values: For simplicity, we return random numbers | ||
return np.random.rand(len(X), len(X.columns)) * 0.1 # Random SHAP values between 0 and 0.1 | ||
|
||
@pytest.fixture | ||
def explainer(): | ||
return SimpleKernelExplainer() | ||
|
||
@pytest.mark.parametrize( | ||
"input_data, expected_shape", | ||
[ | ||
({"student_id": [1, 2, 3], "feature1": [0.1, 0.2, 0.3], "feature2": [0.4, 0.5, 0.6]}, (3, 3)), | ||
({"student_id": [1, 2], "feature1": [0.1, 0.2], "feature2": [0.4, 0.5]}, (2, 3)) | ||
] | ||
) | ||
def test_calculate_shap_values_basic(input_data, expected_shape, explainer): | ||
df = pd.DataFrame(input_data) | ||
student_id_col = 'student_id' | ||
model_features = ['feature1', 'feature2'] | ||
mode = df.mode().iloc[0] | ||
|
||
iterator = iter([df]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the thing I mentioned before -- doesn't this function work equally well if you pass There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is used in context of spark's repartition in which I believe that would need an iterator. You are right in this dummy case of our unit test, it doesn't serve a purpose. |
||
|
||
result = list(calculate_shap_values(iterator, student_id_col=student_id_col, model_features=model_features, explainer=explainer, mode=mode)) | ||
|
||
# Check that the result contains the expected number of rows and columns | ||
shap_df = result[0] | ||
assert shap_df.shape == expected_shape | ||
|
||
# Ensure that 'student_id' column is present | ||
assert student_id_col in shap_df.columns | ||
|
||
# Ensure that SHAP values are generated and are numeric | ||
assert is_numeric_dtype(shap_df[model_features].iloc[0, 0]) | ||
assert is_numeric_dtype(shap_df[model_features].iloc[0, 1]) | ||
|
||
# Ensure student IDs are correctly reattached | ||
assert shap_df[student_id_col].iloc[0] == 1 | ||
assert shap_df[student_id_col].iloc[1] == 2 | ||
|
||
@pytest.mark.parametrize( | ||
"batch1_data, batch2_data, expected_shape1, expected_shape2", | ||
[ | ||
({"student_id": [1, 2, 3], "feature1": [0.1, 0.2, 0.3], "feature2": [0.4, 0.5, 0.6]}, | ||
{"student_id": [4, 5, 6], "feature1": [0.7, 0.8, 0.9], "feature2": [0.6, 0.7, 0.8]}, | ||
(3, 3), (3, 3)), | ||
({"student_id": [4, 5, 6], "feature1": [0.1, 0.2, 0.3], "feature2": [0.4, 0.5, 0.6]}, | ||
{"student_id": [4, 5, 6], "feature1": [0.5, 0.6, 0.7], "feature2": [0.7, 0.8, 0.9]}, | ||
(3, 3), (3, 3)) | ||
] | ||
) | ||
def test_calculate_shap_values_multiple_batches(batch1_data, batch2_data, expected_shape1, expected_shape2, explainer): | ||
batch1 = pd.DataFrame(batch1_data) | ||
batch2 = pd.DataFrame(batch2_data) | ||
|
||
student_id_col = 'student_id' | ||
model_features = ['feature1', 'feature2'] | ||
mode = batch1.mode().iloc[0] | ||
|
||
iterator = iter([batch1, batch2]) | ||
|
||
result = list(calculate_shap_values(iterator, student_id_col=student_id_col, model_features=model_features, explainer=explainer, mode=mode)) | ||
|
||
# Ensure we have two DataFrames | ||
assert len(result) == 2 | ||
|
||
# Check first batch | ||
shap_df1 = result[0] | ||
assert shap_df1.shape == expected_shape1 | ||
|
||
# Check second batch | ||
shap_df2 = result[1] | ||
assert shap_df2.shape == expected_shape2 | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dumb question: Where is the parallelization happening? To me, it looks like this function is iterating over the dataframes one after the other in a
for
loop.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I just copied over our function from the private repo. I was planning on merging this PR and then working on adding the SHAP values notebook, which is where the parallelization is happening using
spark.repartition
. Not a dumb question - You are correct in that parallelization is not happening in this function as it is just an iteration.Do you think it's best to create a notebook or put all parallelization in this function? And if we want to create a notebook, I can keep it outside of the PDP template notebooks, or as the 4th template notebook. Curious of your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I don't understand how
spark.repartition
works? Normally I'd expect an operation to-be-parallelized to have a function that operates on one "chunk" of the iterable, and then some outside framework calls that function in parallel over chunks. How does it work if the iterable is inside the function?Apologies if I'm being dense, I am properly confused! 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this what's going on here? https://www.databricks.com/blog/2020/05/20/new-pandas-udfs-and-python-type-hints-in-the-upcoming-release-of-apache-spark-3-0.html
In all the docs, I only see them supporting
Iterator[pd.Series]
inputs, rather thanIterator[pd.DataFrame]
, so even if this what's being used here, I'm still confused!There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually maybe this post is the closest analogue? https://www.databricks.com/blog/2022/02/02/scaling-shap-calculations-with-pyspark-and-pandas-udf.html