-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update inference python file to include shap function #48
Conversation
|
||
|
||
def calculate_shap_values( | ||
iterator: Iterator[pd.DataFrame], *, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is actually an Iterable
, not an Iterator
. And those should be imported from collections.abc
-- see https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also: It's better to name variables descriptively, rather than by their type. I would probably call this dfs: Iterable[pd.DataFrame]
. You might say, wait, isn't "df" just shorthand for DataFrame
, the type? Yes, technically that's true, but using "df" to refer to a data frame is so generic and common at this point, I always hesitate to go against the expectation. At least "dfs" makes it clear that this variable contains multiple data frames.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is a copy of what we have in the private repo, in which the legacy function we previously called an "Iterator". Perhaps it is more of an iterable. Either way, I can rename the variables descriptively rather than by type as you are suggesting.
mode: pd.Series | ||
) -> Iterator[pd.DataFrame]: | ||
""" | ||
SHAP is computationally expensive, so this function enables parallelization, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dumb question: Where is the parallelization happening? To me, it looks like this function is iterating over the dataframes one after the other in a for
loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I just copied over our function from the private repo. I was planning on merging this PR and then working on adding the SHAP values notebook, which is where the parallelization is happening using spark.repartition
. Not a dumb question - You are correct in that parallelization is not happening in this function as it is just an iteration.
Do you think it's best to create a notebook or put all parallelization in this function? And if we want to create a notebook, I can keep it outside of the PDP template notebooks, or as the 4th template notebook. Curious of your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I don't understand how spark.repartition
works? Normally I'd expect an operation to-be-parallelized to have a function that operates on one "chunk" of the iterable, and then some outside framework calls that function in parallel over chunks. How does it work if the iterable is inside the function?
Apologies if I'm being dense, I am properly confused! 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this what's going on here? https://www.databricks.com/blog/2020/05/20/new-pandas-udfs-and-python-type-hints-in-the-upcoming-release-of-apache-spark-3-0.html
In all the docs, I only see them supporting Iterator[pd.Series]
inputs, rather than Iterator[pd.DataFrame]
, so even if this what's being used here, I'm still confused!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually maybe this post is the closest analogue? https://www.databricks.com/blog/2022/02/02/scaling-shap-calculations-with-pyspark-and-pandas-udf.html
model_features = ['feature1', 'feature2'] | ||
mode = df.mode().iloc[0] | ||
|
||
iterator = iter([df]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the thing I mentioned before -- doesn't this function work equally well if you pass [df]
in directly, rather than converting it into an iterator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used in context of spark's repartition in which I believe that would need an iterator. You are right in this dummy case of our unit test, it doesn't serve a purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read the full blog post from Databricks that I linked to -- looks like we're mostly just doing what they recommend in there. imo the API for Pandas UDFs in Spark is... weird, and was the cause of my immediate confusion. I believe this should work, so please merge it. That said, I'm going to come around after you and tidy this up in a way that makes it more obvious what's going on. Does that sound okay to you?
@vishpillai123 I re-ran the CI jobs -- there's a test that's juuuuust slightly flakey -- and they all pass. I'm going to merge this in, since I believe I understand what's going on, so I can refactor/rename the logic for clarity. Please forgive if I'm jumping the gun here! |
hey no worries. I had to jump onto some other work with inference generation being the priority this month. I know this is still pretty incomplete, but I think we need to add a notebook to provide some clarity. Otherwise, this function is quite confusing to use on its own. I'll start working on that soon pending some other priorities. |
SHAP values need parallelization in order to scale for inference. Typically, predictions are generated for hundreds to thousands of students, so it is critical to have this functionality. Spark's repartition does not conserve row order, so the unique identifier needs to be attached every batch.
changes
context
questions