Python evaluator module fix #863

loomlike · 2019-07-12T02:26:21Z

Description

Python evaluation module' ranking metric functions have redundant and unnecessary sorting codes.
E.g.

df_hit["rank"] = df_hit.groupby(col_user)[col_prediction].rank(
        method="first", ascending=False
)

doesn't need to use rank() since df_hit is already sorted by user and ratings as it is generated by groupby user (pandas groupby's sort argument is by default True) and nlargest ratings.

This change removes those redundant and unnecessary sorts and also refactor get_top_k_items to return DataFrame with 'rank' column to make its behavior the same as our pyspark evaluation module.

Related Issues

Checklist:

I have followed the contribution guidelines and code style for this project.
I have added tests covering my contributions.
I have updated the documentation accordingly.

Remove redundant and unnecessary sortings Refactor get_top_k_items to return DataFrame with 'rank' column same as pyspark's

miguelgfierro

LGTM

gramhagen

this is great, small improvement suggested

gramhagen · 2019-07-12T13:46:49Z

reco_utils/evaluation/python_evaluation.py

        dataframe.groupby(col_user, as_index=False)
        .apply(lambda x: x.nlargest(k, col_rating))
        .reset_index(drop=True)
    )
+    top_k_items["rank"] = top_k_items.groupby(col_user).cumcount() + 1


you can avoid the repeated groupby too

groups = dataframe.groupby(col_user, as_index=False) top_k_items = groups.apply(lambda x: x.nlargest(k, col_rating)).reset_index(drop=True) top_k_items["rank"] = groups.cumcount() + 1

gramhagen · 2019-07-12T13:47:27Z

reco_utils/evaluation/python_evaluation.py

@@ -651,14 +648,16 @@ def get_top_k_items(
        k (int): number of items for each user

    Returns:
-        pd.DataFrame: DataFrame of top k items for each user
+        pd.DataFrame: DataFrame of top k items for each user, sorted by `col_user` and `"rank"`


i would remove the double quotes from rank to match just the backticks like col_user

also, in the returns section of get_top_k_items =)

good catch!

yueguoguo

Great

gramhagen

one more "rank" is there, if you can fix that then we're good

loomlike · 2019-07-12T18:28:57Z

@gramhagen Few changes since the last review:

changed "rank" to rank
caching groups turns out does not work, since nlargest sorts the ratings while the cached group object still contains unsorted ratings. I changed it back to use groupby again, but added sort=False so that groupby can be performed efficiently (groupby-without-sorting still keeps the inter-group orders and we already sorted previously by 'nlargest')
found the above issue from spark's unit-tests which matches spark-evaluation-fn results to python's. Python evaluation tests couldn't catch the error because the test case users and items were already sorted. I made a simple tweak to the test case so that can catch such errors in the future.

gramhagen · 2019-07-12T18:38:16Z

oh interesting, i didn't realize we use the python evaluation to validate test results for spark, we should remove that linkage, I'll add a separate feature request

gramhagen · 2019-07-12T18:53:30Z

oh, i take it back, I guess that's an additional check just to ensure they match. i guess it helped in this case.

miguelgfierro · 2019-07-15T12:04:16Z

@loomlike feel free to merge when you think it is convenient

* Python evaluator module fix Remove redundant and unnecessary sortings Refactor get_top_k_items to return DataFrame with 'rank' column same as pyspark's * Update test to catch corner case

Python evaluator module fix

3c429f2

Remove redundant and unnecessary sortings Refactor get_top_k_items to return DataFrame with 'rank' column same as pyspark's

loomlike requested review from yueguoguo, miguelgfierro and gramhagen July 12, 2019 02:26

miguelgfierro approved these changes Jul 12, 2019

View reviewed changes

gramhagen requested changes Jul 12, 2019

View reviewed changes

more clean-up

9954dc3

yueguoguo approved these changes Jul 12, 2019

View reviewed changes

gramhagen approved these changes Jul 12, 2019

View reviewed changes

Update test to catch corner case

f3c5b92

loomlike merged commit 793799a into staging Jul 15, 2019

loomlike deleted the jumin/evaluation-fix branch July 15, 2019 14:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python evaluator module fix #863

Python evaluator module fix #863

loomlike commented Jul 12, 2019 •

edited

Loading

miguelgfierro left a comment

gramhagen left a comment

gramhagen Jul 12, 2019

gramhagen Jul 12, 2019

gramhagen Jul 12, 2019 •

edited

Loading

loomlike Jul 12, 2019

yueguoguo left a comment

gramhagen left a comment

loomlike commented Jul 12, 2019

gramhagen commented Jul 12, 2019

gramhagen commented Jul 12, 2019

miguelgfierro commented Jul 15, 2019

Python evaluator module fix #863

Python evaluator module fix #863

Conversation

loomlike commented Jul 12, 2019 • edited Loading

Description

Related Issues

Checklist:

miguelgfierro left a comment

Choose a reason for hiding this comment

gramhagen left a comment

Choose a reason for hiding this comment

gramhagen Jul 12, 2019

Choose a reason for hiding this comment

gramhagen Jul 12, 2019

Choose a reason for hiding this comment

gramhagen Jul 12, 2019 • edited Loading

Choose a reason for hiding this comment

loomlike Jul 12, 2019

Choose a reason for hiding this comment

yueguoguo left a comment

Choose a reason for hiding this comment

gramhagen left a comment

Choose a reason for hiding this comment

loomlike commented Jul 12, 2019

gramhagen commented Jul 12, 2019

gramhagen commented Jul 12, 2019

miguelgfierro commented Jul 15, 2019

loomlike commented Jul 12, 2019 •

edited

Loading

gramhagen Jul 12, 2019 •

edited

Loading