Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python evaluator module fix #863

Merged
merged 3 commits into from
Jul 15, 2019
Merged

Python evaluator module fix #863

merged 3 commits into from
Jul 15, 2019

Conversation

loomlike
Copy link
Collaborator

@loomlike loomlike commented Jul 12, 2019

Description

Python evaluation module' ranking metric functions have redundant and unnecessary sorting codes.
E.g.

df_hit["rank"] = df_hit.groupby(col_user)[col_prediction].rank(
        method="first", ascending=False
)

doesn't need to use rank() since df_hit is already sorted by user and ratings as it is generated by groupby user (pandas groupby's sort argument is by default True) and nlargest ratings.

This change removes those redundant and unnecessary sorts and also refactor get_top_k_items to return DataFrame with 'rank' column to make its behavior the same as our pyspark evaluation module.

Related Issues

Checklist:

  • I have followed the contribution guidelines and code style for this project.
  • I have added tests covering my contributions.
  • I have updated the documentation accordingly.

Remove redundant and unnecessary sortings
Refactor get_top_k_items to return DataFrame with 'rank' column
    same as pyspark's
Copy link
Collaborator

@miguelgfierro miguelgfierro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@gramhagen gramhagen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great, small improvement suggested

dataframe.groupby(col_user, as_index=False)
.apply(lambda x: x.nlargest(k, col_rating))
.reset_index(drop=True)
)
top_k_items["rank"] = top_k_items.groupby(col_user).cumcount() + 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can avoid the repeated groupby too

groups = dataframe.groupby(col_user, as_index=False)
top_k_items = groups.apply(lambda x: x.nlargest(k, col_rating)).reset_index(drop=True)
top_k_items["rank"] = groups.cumcount() + 1

@@ -651,14 +648,16 @@ def get_top_k_items(
k (int): number of items for each user

Returns:
pd.DataFrame: DataFrame of top k items for each user
pd.DataFrame: DataFrame of top k items for each user, sorted by `col_user` and `"rank"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would remove the double quotes from rank to match just the backticks like col_user

Copy link
Collaborator

@gramhagen gramhagen Jul 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, in the returns section of get_top_k_items =)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

Copy link
Collaborator

@yueguoguo yueguoguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great

Copy link
Collaborator

@gramhagen gramhagen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more "rank" is there, if you can fix that then we're good

@loomlike
Copy link
Collaborator Author

@gramhagen Few changes since the last review:

  1. changed "rank" to rank
  2. caching groups turns out does not work, since nlargest sorts the ratings while the cached group object still contains unsorted ratings. I changed it back to use groupby again, but added sort=False so that groupby can be performed efficiently (groupby-without-sorting still keeps the inter-group orders and we already sorted previously by 'nlargest')
  3. found the above issue from spark's unit-tests which matches spark-evaluation-fn results to python's. Python evaluation tests couldn't catch the error because the test case users and items were already sorted. I made a simple tweak to the test case so that can catch such errors in the future.

@gramhagen
Copy link
Collaborator

oh interesting, i didn't realize we use the python evaluation to validate test results for spark, we should remove that linkage, I'll add a separate feature request

@gramhagen
Copy link
Collaborator

oh, i take it back, I guess that's an additional check just to ensure they match. i guess it helped in this case.

@miguelgfierro
Copy link
Collaborator

@loomlike feel free to merge when you think it is convenient

@loomlike loomlike merged commit 793799a into staging Jul 15, 2019
@loomlike loomlike deleted the jumin/evaluation-fix branch July 15, 2019 14:49
yueguoguo pushed a commit that referenced this pull request Sep 9, 2019
* Python evaluator module fix

Remove redundant and unnecessary sortings
Refactor get_top_k_items to return DataFrame with 'rank' column
    same as pyspark's

* Update test to catch corner case
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants