Update inference python file to include shap function #48

vishpillai123 · 2025-01-02T21:39:34Z

SHAP values need parallelization in order to scale for inference. Typically, predictions are generated for hundreds to thousands of students, so it is critical to have this functionality. Spark's repartition does not conserve row order, so the unique identifier needs to be attached every batch.

changes

Added "calculate_shap_values" to inference.py.
Added respective unit test within tests subdirectory.

context

SHAP values are needed for interpretability. Proper scaling is required for inference.

questions

No specific questions

…cale inference generation

src/student_success_tool/modeling/inference.py

bdewilde · 2025-01-03T00:52:26Z

src/student_success_tool/modeling/inference.py

+
+
+def calculate_shap_values(
+    iterator: Iterator[pd.DataFrame], *, 


I believe this is actually an Iterable, not an Iterator. And those should be imported from collections.abc -- see https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable

Also: It's better to name variables descriptively, rather than by their type. I would probably call this dfs: Iterable[pd.DataFrame]. You might say, wait, isn't "df" just shorthand for DataFrame, the type? Yes, technically that's true, but using "df" to refer to a data frame is so generic and common at this point, I always hesitate to go against the expectation. At least "dfs" makes it clear that this variable contains multiple data frames.

So this is a copy of what we have in the private repo, in which the legacy function we previously called an "Iterator". Perhaps it is more of an iterable. Either way, I can rename the variables descriptively rather than by type as you are suggesting.

src/student_success_tool/modeling/inference.py

bdewilde · 2025-01-03T01:01:33Z

src/student_success_tool/modeling/inference.py

+    mode: pd.Series
+) -> Iterator[pd.DataFrame]:
+    """
+    SHAP is computationally expensive, so this function enables parallelization,


Dumb question: Where is the parallelization happening? To me, it looks like this function is iterating over the dataframes one after the other in a for loop.

So, I just copied over our function from the private repo. I was planning on merging this PR and then working on adding the SHAP values notebook, which is where the parallelization is happening using spark.repartition. Not a dumb question - You are correct in that parallelization is not happening in this function as it is just an iteration.

Do you think it's best to create a notebook or put all parallelization in this function? And if we want to create a notebook, I can keep it outside of the PDP template notebooks, or as the 4th template notebook. Curious of your thoughts?

I guess I don't understand how spark.repartition works? Normally I'd expect an operation to-be-parallelized to have a function that operates on one "chunk" of the iterable, and then some outside framework calls that function in parallel over chunks. How does it work if the iterable is inside the function?

Apologies if I'm being dense, I am properly confused! 😅

Is this what's going on here? https://www.databricks.com/blog/2020/05/20/new-pandas-udfs-and-python-type-hints-in-the-upcoming-release-of-apache-spark-3-0.html

In all the docs, I only see them supporting Iterator[pd.Series] inputs, rather than Iterator[pd.DataFrame], so even if this what's being used here, I'm still confused!

Actually maybe this post is the closest analogue? https://www.databricks.com/blog/2022/02/02/scaling-shap-calculations-with-pyspark-and-pandas-udf.html

src/student_success_tool/modeling/inference.py

bdewilde · 2025-01-03T01:04:51Z

tests/modeling/test_inference.py

+    model_features = ['feature1', 'feature2']
+    mode = df.mode().iloc[0] 
+
+    iterator = iter([df])


This is the thing I mentioned before -- doesn't this function work equally well if you pass [df] in directly, rather than converting it into an iterator?

This is used in context of spark's repartition in which I believe that would need an iterator. You are right in this dummy case of our unit test, it doesn't serve a purpose.

tests/modeling/test_inference.py

bdewilde

I read the full blog post from Databricks that I linked to -- looks like we're mostly just doing what they recommend in there. imo the API for Pandas UDFs in Spark is... weird, and was the cause of my immediate confusion. I believe this should work, so please merge it. That said, I'm going to come around after you and tidy this up in a way that makes it more obvious what's going on. Does that sound okay to you?

bdewilde · 2025-01-08T01:27:02Z

@vishpillai123 I re-ran the CI jobs -- there's a test that's juuuuust slightly flakey -- and they all pass. I'm going to merge this in, since I believe I understand what's going on, so I can refactor/rename the logic for clarity. Please forgive if I'm jumping the gun here!

vishpillai123 · 2025-01-08T01:45:22Z

hey no worries. I had to jump onto some other work with inference generation being the priority this month. I know this is still pretty incomplete, but I think we need to add a notebook to provide some clarity. Otherwise, this function is quite confusing to use on its own. I'll start working on that soon pending some other priorities.

vishpillai123 added 2 commits January 2, 2025 21:29

calculate shap values in a parallelized manner in order to properly s…

0140cc8

…cale inference generation

matching formatting to top features function in inference file

79a4f0b

vishpillai123 requested a review from bdewilde as a code owner January 2, 2025 21:39

vishpillai123 changed the title ~~Add shap functions~~ Updating inference python files to include shap functions Jan 2, 2025

vishpillai123 changed the title ~~Updating inference python files to include shap functions~~ Updating inference python file to include shap function Jan 2, 2025

vishpillai123 requested a review from kaylawilding January 2, 2025 22:33

vishpillai123 changed the title ~~Updating inference python file to include shap function~~ Update inference python file to include shap function Jan 2, 2025

bdewilde requested changes Jan 3, 2025

View reviewed changes

changes based on peer feedback

fbee349

bdewilde approved these changes Jan 7, 2025

View reviewed changes

bdewilde merged commit 17d7636 into develop Jan 8, 2025
5 checks passed

bdewilde deleted the add-shap-functions branch January 8, 2025 01:27

bdewilde mentioned this pull request Jan 8, 2025

Refactor SHAP inference functionality #51

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update inference python file to include shap function #48

Update inference python file to include shap function #48

vishpillai123 commented Jan 2, 2025

bdewilde Jan 3, 2025

bdewilde Jan 3, 2025

vishpillai123 Jan 3, 2025

bdewilde Jan 3, 2025

vishpillai123 Jan 3, 2025

bdewilde Jan 4, 2025

bdewilde Jan 4, 2025

bdewilde Jan 4, 2025

bdewilde Jan 3, 2025

vishpillai123 Jan 3, 2025

bdewilde left a comment

bdewilde commented Jan 8, 2025

vishpillai123 commented Jan 8, 2025



		def calculate_shap_values(
		iterator: Iterator[pd.DataFrame], *,

Update inference python file to include shap function #48

Update inference python file to include shap function #48

Conversation

vishpillai123 commented Jan 2, 2025

changes

context

questions

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdewilde left a comment

Choose a reason for hiding this comment

bdewilde commented Jan 8, 2025

vishpillai123 commented Jan 8, 2025