-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python evaluator module fix #863
Conversation
Remove redundant and unnecessary sortings Refactor get_top_k_items to return DataFrame with 'rank' column same as pyspark's
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is great, small improvement suggested
dataframe.groupby(col_user, as_index=False) | ||
.apply(lambda x: x.nlargest(k, col_rating)) | ||
.reset_index(drop=True) | ||
) | ||
top_k_items["rank"] = top_k_items.groupby(col_user).cumcount() + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can avoid the repeated groupby too
groups = dataframe.groupby(col_user, as_index=False)
top_k_items = groups.apply(lambda x: x.nlargest(k, col_rating)).reset_index(drop=True)
top_k_items["rank"] = groups.cumcount() + 1
@@ -651,14 +648,16 @@ def get_top_k_items( | |||
k (int): number of items for each user | |||
|
|||
Returns: | |||
pd.DataFrame: DataFrame of top k items for each user | |||
pd.DataFrame: DataFrame of top k items for each user, sorted by `col_user` and `"rank"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would remove the double quotes from rank to match just the backticks like col_user
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, in the returns section of get_top_k_items =)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one more "rank" is there, if you can fix that then we're good
@gramhagen Few changes since the last review:
|
oh interesting, i didn't realize we use the python evaluation to validate test results for spark, we should remove that linkage, I'll add a separate feature request |
oh, i take it back, I guess that's an additional check just to ensure they match. i guess it helped in this case. |
@loomlike feel free to merge when you think it is convenient |
* Python evaluator module fix Remove redundant and unnecessary sortings Refactor get_top_k_items to return DataFrame with 'rank' column same as pyspark's * Update test to catch corner case
Description
Python evaluation module' ranking metric functions have redundant and unnecessary sorting codes.
E.g.
doesn't need to use
rank()
sincedf_hit
is already sorted by user and ratings as it is generated by groupby user (pandas groupby'ssort
argument is by default True) and nlargest ratings.This change removes those redundant and unnecessary sorts and also refactor
get_top_k_items
to return DataFrame with 'rank' column to make its behavior the same as our pyspark evaluation module.Related Issues
Checklist: