Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update inference python file to include shap function #48

Merged
merged 3 commits into from
Jan 8, 2025

Conversation

vishpillai123
Copy link
Contributor

SHAP values need parallelization in order to scale for inference. Typically, predictions are generated for hundreds to thousands of students, so it is critical to have this functionality. Spark's repartition does not conserve row order, so the unique identifier needs to be attached every batch.

changes

  • Added "calculate_shap_values" to inference.py.
  • Added respective unit test within tests subdirectory.

context

  • SHAP values are needed for interpretability. Proper scaling is required for inference.

questions

  • No specific questions

@vishpillai123 vishpillai123 changed the title Add shap functions Updating inference python files to include shap functions Jan 2, 2025
@vishpillai123 vishpillai123 changed the title Updating inference python files to include shap functions Updating inference python file to include shap function Jan 2, 2025
@vishpillai123 vishpillai123 changed the title Updating inference python file to include shap function Update inference python file to include shap function Jan 2, 2025
src/student_success_tool/modeling/inference.py Outdated Show resolved Hide resolved


def calculate_shap_values(
iterator: Iterator[pd.DataFrame], *,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is actually an Iterable, not an Iterator. And those should be imported from collections.abc -- see https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: It's better to name variables descriptively, rather than by their type. I would probably call this dfs: Iterable[pd.DataFrame]. You might say, wait, isn't "df" just shorthand for DataFrame, the type? Yes, technically that's true, but using "df" to refer to a data frame is so generic and common at this point, I always hesitate to go against the expectation. At least "dfs" makes it clear that this variable contains multiple data frames.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is a copy of what we have in the private repo, in which the legacy function we previously called an "Iterator". Perhaps it is more of an iterable. Either way, I can rename the variables descriptively rather than by type as you are suggesting.

src/student_success_tool/modeling/inference.py Outdated Show resolved Hide resolved
src/student_success_tool/modeling/inference.py Outdated Show resolved Hide resolved
src/student_success_tool/modeling/inference.py Outdated Show resolved Hide resolved
src/student_success_tool/modeling/inference.py Outdated Show resolved Hide resolved
mode: pd.Series
) -> Iterator[pd.DataFrame]:
"""
SHAP is computationally expensive, so this function enables parallelization,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dumb question: Where is the parallelization happening? To me, it looks like this function is iterating over the dataframes one after the other in a for loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I just copied over our function from the private repo. I was planning on merging this PR and then working on adding the SHAP values notebook, which is where the parallelization is happening using spark.repartition. Not a dumb question - You are correct in that parallelization is not happening in this function as it is just an iteration.

Do you think it's best to create a notebook or put all parallelization in this function? And if we want to create a notebook, I can keep it outside of the PDP template notebooks, or as the 4th template notebook. Curious of your thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I don't understand how spark.repartition works? Normally I'd expect an operation to-be-parallelized to have a function that operates on one "chunk" of the iterable, and then some outside framework calls that function in parallel over chunks. How does it work if the iterable is inside the function?

Apologies if I'm being dense, I am properly confused! 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this what's going on here? https://www.databricks.com/blog/2020/05/20/new-pandas-udfs-and-python-type-hints-in-the-upcoming-release-of-apache-spark-3-0.html

In all the docs, I only see them supporting Iterator[pd.Series] inputs, rather than Iterator[pd.DataFrame], so even if this what's being used here, I'm still confused!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model_features = ['feature1', 'feature2']
mode = df.mode().iloc[0]

iterator = iter([df])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the thing I mentioned before -- doesn't this function work equally well if you pass [df] in directly, rather than converting it into an iterator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used in context of spark's repartition in which I believe that would need an iterator. You are right in this dummy case of our unit test, it doesn't serve a purpose.

tests/modeling/test_inference.py Outdated Show resolved Hide resolved
Copy link
Member

@bdewilde bdewilde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the full blog post from Databricks that I linked to -- looks like we're mostly just doing what they recommend in there. imo the API for Pandas UDFs in Spark is... weird, and was the cause of my immediate confusion. I believe this should work, so please merge it. That said, I'm going to come around after you and tidy this up in a way that makes it more obvious what's going on. Does that sound okay to you?

@bdewilde
Copy link
Member

bdewilde commented Jan 8, 2025

@vishpillai123 I re-ran the CI jobs -- there's a test that's juuuuust slightly flakey -- and they all pass. I'm going to merge this in, since I believe I understand what's going on, so I can refactor/rename the logic for clarity. Please forgive if I'm jumping the gun here!

@bdewilde bdewilde merged commit 17d7636 into develop Jan 8, 2025
5 checks passed
@bdewilde bdewilde deleted the add-shap-functions branch January 8, 2025 01:27
@vishpillai123
Copy link
Contributor Author

hey no worries. I had to jump onto some other work with inference generation being the priority this month. I know this is still pretty incomplete, but I think we need to add a notebook to provide some clarity. Otherwise, this function is quite confusing to use on its own. I'll start working on that soon pending some other priorities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants